Introduction

Article
01/28/2010

Updated : November 14, 2002

Component failure, power outages, user errors, application memory leaks, and other circumstances can make a data center unavailable. To increase the availability of a data center, you must develop and follow procedures to prepare for and minimize the impact of downtime. You must also perform regular backups to prepare for catastrophic disasters and database corruption. You can prevent component failures from causing the unavailability of a data center by deploying redundant components. You can minimize the time a data center is unavailable due to server failures and planned maintenance by using redundant servers.

Increasing the availability of a data center saves money by keeping a data center available to business users and customers, and it may save a business in the event of a catastrophic disaster. You also need to consider that the availability of a data center can have an impact far beyond your own company, such as affecting client accounts and medical records.

The budget for increasing the availability of a data center is dictated by the cost of the database not being available. The higher the cost of unavailability, the more sophisticated and expensive the solutions should be to prepare for all types of disasters and to minimize the resulting downtime.

Microsoft SQL Server 2000 High Availability Series

The Microsoft SQL Server 2000 High Availability Series helps you plan and deploy a highly available data center that uses Microsoft SQL Server 2000. If you are a consultant, designer, or systems engineer involved in designing and deploying a SQL Server 2000 high-availability data center, this series is for you.

The Microsoft SQL Server 2000 High Availability Series is designed for business decision makers, Microsoft Certified Solution Providers, Microsoft Consulting Services, IT professionals, and developers responsible for application or infrastructure development and deployment.

This series assumes you have a basic understanding of the following areas of SQL Server 2000 and Microsoft Windows 2000 operating system:

Windows 2000 Active Directory service and DNS
Microsoft Cluster Service (MSCS)
Database design for online transaction processing (OLTP) and online analytical processing (OLAP)
Application design for databases
SQL Server operational procedures
SQL Server backup and restore procedures
SQL Server security model
Redundant Array of Independent Drives (RAID)
Storage Area Networks (SANs)
Network Load Balancing (NLB)

The series consists of a Planning Guide and a Solution Guide. The Planning Guide helps you design a data center to achieve the level of availability needed for the business environment. The Solution Guide helps you implement selected server redundancy solutions to minimize unavailability caused by server failures and planned downtime.

The series provides guidance in achieving a highly available data center, including:

Understanding barriers to high availability
Developing procedures to minimize the risk and length of downtime
Using backups to recover from server failure and database corruption
Using redundant components to prevent downtime
Evaluating server redundancy solutions
Implementing server redundancy solutions

Note: This documentation is not intended to replace the product documentation for Windows 2000 or SQL Server 2000.

The Planning Guide provides a solid springboard for making effective decisions to help you build a highly available data center.

Planning Guide

The Planning Guide analyzes the barriers to availability and provides solutions to each of these barriers. It contains the following chapters:

Chapter 1: " Introduction to High Availability " — This chapter defines high availability, discusses setting availability goals, and lists the barriers you must address to successfully deploy a highly available data center. By the end of this chapter, you will understand high availability and how to set goals for achieving the level of availability required for the business environment.
Chapter 2: " Overcoming Barriers to High Availability " — This chapter discusses each of the barriers to high availability and the solutions to overcome these barriers. By the end of this chapter, you will understand the procedures that should be part of every SQL Server 2000 high-availability data center and the steps that you must take to prepare for, prevent, and minimize disaster.
Chapter 3: " Recovering a Data Center by Using Database Backups " — This chapter discusses using database backups to recover from catastrophic disaster caused by hard-disk failure, by application or user error, or by hardware-induced database corruption. By the end of this chapter, you will understand how the SQL Server backups increase availability, and how third-party backup solutions increase restoration performance and functionality.
Chapter 4: " Preventing Downtime by Using Redundant Components " - This chapter discusses using component redundancy to increase the availability of a single data-center server. This redundancy includes using server-class computers to provide component redundancy, using RAID arrays to provide storage redundancy, and using multiple networks to provide network redundancy. By the end of this chapter, you will understand the steps that you can take to prevent downtime caused by the failure of a single component.
Chapter 5: " Minimizing Downtime by Using Redundant Servers " - This chapter discusses server redundancy solutions that minimize the time required to recover from a server failure. By the end of this chapter, you will understand the redundant server solutions that you can deploy to minimize downtime. You will also be able to decide among these solutions based on the unique requirements of the business environment.

After you read the Planning Guide, you will understand how proper procedures increase availability, backups help recover from catastrophic disasters, redundant components prevent downtime, and redundant servers minimize downtime.

Solution Guide

The Solution Guide documents the steps required to implement each of the server redundancy solutions discussed in Planning Guide Chapter 5. The Solution Guide contains the following chapters:

Chapter 1: " Implementing Failover Clustering " - This chapter provides the steps to implement failover clustering, including the steps to fail over to a standby node and to fail back to the primary node.
Chapter 2: " Implementing Log Shipping " - This chapter provides the steps to implement log shipping, including the steps to fail over to a standby server and to fail back to the primary server.
Chapter 3: " Implementing Transactional Replication " - This chapter provides the steps to implement transactional replication for high availability, including the steps to fail over to a subscriber and fail back to the original server.
Chapter 4: " Implementing Network Load Balancing " - This chapter provides the steps to implement network load balancing, including the steps to fail over to a standby server and to fail back to the primary server.
Chapter 5: " Implementing Remote Mirroring and Stretch Clustering " - This chapter provides the steps to implement a stretch cluster, including the steps to fail over to a remote standby node and to fail back to the primary node.

Understanding High Availability

In the context of this series, high availability means increasing the availability of the data center itself. You increase the availability of a data center by:

Developing and following well-documented procedures
Ensuring the ability to recover from any type of disaster
Avoiding downtime and the need for restoration by using component and server redundancy
Minimizing how long data is unavailable after an accident or emergency
Providing the ability to sustain operations during a recovery period after an accident or emergency
Avoiding system downtime caused by planned maintenance

Even if you ensure the actual availability of the data center, users may perceive the data center as unavailable when they cannot access it. Some reasons users may not be able to access the data center are improper application design, inadequate security, network failures, and DNS problems. Although this series identifies these issues as barriers to high availability, resolving these issues is beyond the scope of this series.

Designing a High-Availability Data Center

To design a high-availability data center, start by doing the following:

Set high-availability goals for scheduled and unscheduled downtime.
Identify and analyze the actual and perceived barriers to high availability.
Determine and evaluate the solutions that overcome each barrier to high availability.

After completing these tasks, you are ready to implement the appropriate processes and solutions to increase the availability of the data center.

Setting High-Availability Goals

The first step in increasing the availability of a data center is to set high-availability goals. To do this, the decision-making group must thoroughly understand how the end-user community uses the data center and the importance of the data center to the profitability of the company.

Do the following when setting high-availability goals:

Identify stakeholders - Setting high-availability goals is the responsibility of many parties, and these goals must be appropriate to all stakeholders. The impact of high-availability goals on database administrators as well as on business users and customers must be evaluated. For example, although business users and customers might want 99.999-percent availability, database administrators must make clear the cost of achieving this high-availability goal.
Establish the value of availability - The value of availability determines the budget for achieving that availability. To decide how much money to invest in a high-availability data center, you must understand the value of the data center to the business. For example, each unavailable hour might cost a heavily used commerce Web site $100,000 in sales. This business impact can justify a major investment to prevent and minimize downtime and to ensure continuing customer goodwill. You must also understand that the costs of availability are nonlinear. For example, a five-minute outage may have far wider consequences and costs than five separate one-minute breaks in service.
Evaluate recovery point versus recovery time - When setting high-availability goals, you must determine if it’s more important to restore the data center to its exact state before failure, or to recover quickly, or both. The answer to this question is a critical factor in determining the server redundancy solution. You must determine if a solution that results in lost transactions is inconvenient, damaging, or catastrophic.
Plan when to maintain the data center - To determine the best high-availability solution, you must understand when users need the data center. For example, if the data center is not heavily used or is not used at all at certain times, you can perform maintenance operations during these low-use times at reduced cost. You must, however, pay attention to whether people in different time zones use the data center in different time windows. If use in different time zones eliminates a time window, you must find an alternative approach for maintenance.

The ideal data center availability is 24 x 7 x 365 availability, or 100-percent availability. The percentage of uptime you should strive for is some variation of 99.x percent— with an ultimate goal of five nines, or 99.999 percent. Three nines (99.9 percent) is an achievable level of availability using a single data-center server. Achieving five nines (99.999 percent) is unrealistic for a single data-center server because this level of availability permits only about five total minutes of downtime in a calendar year. Four nines (99.99 percent) is achievable, however, by using fault resilient clusters with automatic failover. Five nines is achievable using advanced fault tolerant computers. This guide addresses the steps required to achieve these levels of availability, including the prerequisites to these technology solutions.

Even minimal scheduled downtime — such as 2 hours a month, or 24 hours a year — reduces availability to 99.73 percent. You can increase availability to 99.93 percent by reducing scheduled downtime to 30 minutes a month, or 6 hours a year. If you use the primary data center-server only for production purposes and perform database backups, health checks, and other tasks on secondary servers that have copies of the same data, the chances of achieving 99.999-percent availability increase.

So how many nines should you realistically pursue for a data center? Table 1.1 shows the levels of availability that have been achieved by some leading companies whose businesses depend upon high availability.

Table 1.1 Achievable levels of availability

Company	Level of availability	Description
NASDAQ	99.97%	Technology stock exchange with 2 million transactions a day (200 a second)
Barnes & Noble	99.98%	Electronic retailer with 5.6 million visitors a month
Quote.com	99.99%	Online financial site delivers 8.6 million page views a day
Buy.com	99.99%	Internet superstore with more than 2,000 concurrent visitors per minute

Because you can schedule planned system outages to have the least possible impact on a business, planned downtime is frequently treated differently than unplanned downtime. Whether planned downtime must be factored into the availability equation depends on business needs. A goal of four or five nines of availability for unplanned outages during scheduled business hours requires less of an investment than 24 x 7 availability, which must include both planned and unplanned system outages.

Identifying and Analyzing Barriers to High Availability

A high-availability barrier is anything that limits a data center’s availability. Because it is impossible to protect a business from every barrier that might arise, you must estimate the effect of each barrier in advance and determine the barriers that are cost effective to overcome. To determine an appropriate high-availability solution, you must first identify and analyze each barrier for the following:

The probable time the system will be down or that the barrier will cause the problem
The probability that the barrier will occur and cause unavailability
The estimated cost to overcome the barrier compared to the estimated cost of the unavailability

For example, to analyze the high-availability risk that a user will accidentally delete a portion of their data, you might do the following:

Estimate the time the data center will be unavailable because of this barrier to high availability.
1. If a copy of the data exists on a redundant server (and the error is discovered before it is duplicated to the redundant server), the deleted data can simply be copied back to the production server.
2. If a copy of the data does not exist on a redundant server or the data deletion has already occurred on the redundant server, you can restore the deleted data to an alternative server. After the accidentally deleted data is restored to the alternative server, it can be copied from that server to the production server by using Transact-SQL.
3. If recovering from this user error requires a restoration from backup on the primary server, the time the data center is unavailable includes the time to restore from backup plus the time needed to resubmit the transactions that occurred after the deletion (if these transactions are available).
Estimate the probability that this barrier will occur. The probability is affected by the application design and the training provided to the data center users.
Estimate the cost to overcome this barrier. The cost to prevent unavailability resulting from this barrier depends on the solution you choose. In addition, the cost to overcome this barrier may include additional user training and perhaps a redesign of the application.
1. If you maintain a redundant copy of the production database on a secondary server, the cost to overcome the barrier is measured by the cost of keeping a redundant copy of the data plus the time it takes to restore that data to the production server. This solution minimizes downtime, but it costs more to implement.
2. If you rely on database backups, the cost to overcome the barrier is measured by the time it takes to restore the data from backup plus the time it takes to resubmit the transactions that occurred after the deletion. This solution results in more downtime, but it costs less to implement.

Note: When you evaluate the cost to overcome a barrier, remember that a solution that overcomes one barrier may also overcome numerous additional barriers. The cost of all of the barriers resolved by a solution must therefore be weighed against the cost of the solution. For example, keeping a redundant copy of the production database on a secondary server can overcome many barriers.

Barriers to high availability can be actual barriers or perceived barriers. Actual unavailability means that the data center is actually down. Perceived unavailability means that the data center is functioning, but is not available to the business user or customer because of intervening problems, such as a network, Web site, or DNS failure.

You must carefully evaluate each process and system element to identify and analyze actual and perceived barriers to availability. These barriers include the following:

Environmental issues - Problems with the data-center environment itself can reduce availability. Environmental issues include inadequate cabling, power outages, communication line failures, fires, and other disasters.
Hardware issues - Problems with any piece of hardware used by the data center can reduce availability. Hardware issues include power supply failures, inadequate processors, memory failures, inadequate disk space, disk failures, network card failures, and incompatible hardware.
Communication and connectivity issues - Problems with the network can prevent users from connecting to the data center. Communication and connectivity issues include network cable failures, inadequate bandwidth, router or switch failure, DNS configuration errors, and authentication issues.
Software issues - Software failures and upgrades can reduce the availability of a data center. Software failure issues include downtime caused by memory leaks, database corruption, viruses, and denial of service attacks. Software upgrade issues include downtime caused by application software upgrades and service pack installations.
Service issues - Services that you obtain from outside a business can exacerbate a failure and increase unavailability. Service issues include poorly trained staff, slow response time, and out-of-date contact information.
Process issues - The lack of proper processes can cause unnecessary downtime and increase the length of downtime caused by a hardware or software failure. Process issues include inadequate or nonexistent operational processes, inadequate or nonexistent recovery plans, inadequate or nonexistent recovery drills, and deploying changes without testing.
Application design issues - Poor application design can reduce the perceived availability of a data center. Application issues include excessive blocking and locking, hard-coding of server names and IP addresses, and use of duplicate SQL Server logins.
Staffing issues - Insufficient, untrained, or unqualified staff can cause unnecessary downtime and lengthen the time to restore availability. Staffing issues include insufficient training materials, inadequate training budget, insufficient time for training, and inadequate communication skills.

This series addresses the identification and analysis of each of these barriers to high availability.

Determining and Evaluating High-Availability Solutions

The high-availability solutions discussed in this series include procedural processes, database backups, redundant components, and redundant servers. All these solutions are required to achieve a highly available data center. The remaining chapters in the Planning Guide discuss issues related to these solutions.

After reading the Planning Guide, see the Solution Guide to learn how to implement the server redundancy solutions presented in Planning Guide Chapter 5.