Export (0) Print
Expand All

Chapter 16 - Five Nines: The Ultimate in High Availability

Achieving true 24 x 7 uptime (100 percent of the time) is virtually impossible, if for no other reason than that at certain times, an upgrade such as a service pack must be applied to a production server. This is where the concept of "nines" comes into play. The percentage of uptime all companies should strive for is some variation of 99.x percent, where x is a specified number of nines. Five nines is 99.999 percent uptime for your systems, which, to many, is considered to be the ultimate in availability. This may be difficult to achieve, since it means only about five total minutes of downtime in a calendar year. Three nines is a more practical number to shoot for. Three nines is just short of nine hours of downtime per year—a very respectable number for any production environment.

Creating redundancy for your databases and servers can help you get to five nines. If, for example, you use your production system only for production, and you perform database backups, health checks, and other tasks on other servers that have copies of the same data, your chances of getting to 99.999 percent uptime increase.

Keep in mind, however, that not all downtime is planned. You may encounter a disk controller failure, a power failure, or memory leaks in the application causing a failure. These events need to be taken into account when you develop a disaster recovery plan. Such a plan will mitigate the risk of a failure.

Mitigating the risk and the chance of an unplanned failure is the ultimate goal. In this chapter, you will learn more about the process of creating a highly available system, with a corresponding disaster recovery plan to protect from the worst-case scenario.

Determine Your Desired Level of Nines

There are 8,760 hours in a year (24 hours a day x 365 days a year). It is not uncommon for a business like an e-commerce Web site to schedule one night per month for a few hours of planned system maintenance. If eight hours a month are scheduled as planned downtime, that is 96 hours over the course of a year, which means that the company would have 98.9 percent uptime (or only one nine). If the maintenance window can be brought to two hours per month by doing such things having servers dedicated to checking database health and testing and timing upgrades, this might allow the company to reduce downtime to two hours per month, yielding an uptime percentage of 99.7. These examples demonstrate how difficult it is to achieve more than two nines, even with minimal planned downtime.

So how many nines can your organization realistically pursue? If you can make an educated guess as to the number of total hours down (planned and unplanned) you are likely to have, you can use these formulas (depending on whether you want to calculate your uptime for a year, a month, or a week):

% Uptime/year = (8760 - number of total hours down per year)/8760

% Uptime/month = ((24 * number of days in the month) - number of total hours down in 
that calendar month)/(24 * number of days in the month)

% Uptime/week = (168 - number of total hours down in that week)/168

Achieving High Availability with SQL Server 2000

To achieve high availability, you must be aware of the various components involved and how they are used together to deliver best results.

Application Design

Before buying redundant hardware and coming up with failover and failback scenarios, consider the application itself—it is why you are striving for five nines in the first place, so it should really drive how you design your high availability solution. What is the architecture of the application? Does it lend itself to working in a high availability solution? Does it already have problems that may be magnified with a more complex hardware and/or software solution? Hardware or other software solutions (such as load 7balancer or disk mirroring tools) should not be used to solve problems with an application design.

Another issue to consider is data access (whether the access is needed by the application, a user, or a Web server). How is the database designed—does it sit all on one server, or is it partitioned over multiple servers? Any type of partitioned or distributed data, or the use of distributed partition views to create federated servers, complicates a five nines picture. An example of a federated server in a high availability solution will appear later in this chapter.

There are many other application design issues to consider as you try to achieve high availability. Here are some recommendations:

  • Avoid anything that would cause blocking, especially in tempdb or system tables, since these conditions may cause long running transactions and, if left unattended, can potentially bring your SQL Server down. 

  • Do not hardcode server names, instance names, and IP addresses into the application. Instead, allow the connection to be made through a COM+ object, giving you not only a more flexible application, but also a more flexible disaster recovery plan. 

  • Since Microsoft® SQL Server™ 2000 can support collations at a more granular level, and not just at the server level, make sure the backup database plan takes into account the proper collation. 

  • If extended stored procedures are used, make sure the design is sound and will not accidentally consume additional system resources. 

  • If the application requires batch, scheduled, or other jobs run at various times—mainly in off hours—make sure that they will not interfere with other applications that may be part of the disaster recovery plan. 

  • Do not use the system or sample databases for the creation of application objects. 

  • When coding custom error messages, if the backup/failover server is going to host another application or database, make sure that one does not conflict with the other. 

  • Make sure application users are unique to prevent potential conflicts, because when certain high availability solutions may be implemented, conflicts may occur if two applications share the same username with different rights and responsibilities. 

  • Use integrated security. If special cases arise where other users would need access to the application (such as UNIX or Macintosh users), create logic in the COM layer to handle it. Using the standard SQL Server system administrator (SA) account in the application and its related tasks, packages, scripts, and so on, is not recommended.

  • Do not code for a specific service pack or release level. If the backup scenario requires one server to host more than one application, and if the application is not compatible with a certain service pack, it will not be a highly available application. 

  • If you require that the backend servers be transparent to the client, and no client configuration can be done, that fact will dictate which technology or technologies can be used (such as clustering or log shipping). This type of consideration should be done at the design stage. 

  • Use Microsoft Message Queuing (also known as MSMQ) as a component in an application. It can sit between the application and database to store and forward transactions to one or more SQL Servers.

  • Use XML to take advantage of a Web application and store the state of a transaction. SQL Server 2000 has built-in XML support. XML is an important part of Microsoft's .NET strategy. 

  • Use Component Services, available in Microsoft Windows® 2000, in conjunction with a COM+ object to achieve a two phase commit. Remember, however, that two phase commits may affect performance. 

Underlying Hardware and Software

After ensuring the application will support a high availability solution, you can focus on the underlying technology, both from a hardware and software perspective.

Choosing the Right High Availability Technology for Your Environment

One of the toughest decisions is to choose which method of achieving high availability is the right one for your environment. Cost is often a factor.

The Options

The combination of SQL Server 2000 and your operating system can deliver a number of technologies designed to improve your availability solution. These technology options include:

  • Failover clustering (SQL Server 2000 Enterprise Edition only) 

  • Log shipping (SQL Server 2000 Enterprise Edition only) 

  • Replication

  • Backup and recovery

  • Network Load Balancing (Windows® 2000 Advanced Server and Windows 2000 Datacenter Server) 

Failover Clustering 

Failover clustering is the most common way to create a highly available system. SQL Server 2000 failover clustering is built on top of industry standard hardware and software technologies and Microsoft Windows Clustering, available in Windows 2000 Advanced Server or Windows 2000 Datacenter Server , or Microsoft Cluster Services (MSCS), available in Windows NT® 4.0 Enterprise Edition. It allows up to four servers (with Windows 2000 Datacenter Server), or nodes, to be linked together through a private network and shared disk array to allow services such as a SQL Server instance to fail over from one server node to another in the event of a failure at the node of the owner, whether that failure is due to hardware or software. Failover clustering protects from logical and physical failures, but does not protect from denial of service attacks, corrupt data, or flawed software installations.

To an end user, application, Web server, or middle tier component such as Component Services, the connection would be broken in a failover, but a reconnect is all that is needed to reconnect once the failover is complete. This is one of the strengths of this technology—the failover is virtually transparent to users accessing the database. There may be a short loss of transactions in flux during the failover, but any committed transactions will be applied in the failover process. However, the servers must be located near each other physically unless hardware-assisted clustering is used.

For detailed information on SQL Server 2000 failover clustering, see Chapter 12, "Failover Clustering."

Log Shipping 

Another popular and common option for creating a high availability solution is SQL Server 2000 log shipping. This is a method in which a warm standby server is created from an initial backup and restore from a primary server. After that, subsequent transaction logs are backed up, "shipped," and applied to this warm standby server. This creates a server that is only a small delta of time out of sync with the primary server. Like failover clustering, it can protect from logical and physical failures, but does not protect against things like viruses and logical data corruption. Although log shipping and failover clustering are similar, they differ in many ways. For a direct comparison of the two options, see the next section in this chapter.

Log shipping is basically a manual failover—an operator must initiate it, and transactions on the primary server may not be able to be applied before bringing the warm standby online. End users will potentially have to reconfigure their clients to talk to a new server to get to their data, or middle-tier application servers or applications must be redirected to the new database. The hardest part of the log shipping is the failback process to use the primary server again. Log shipping is easy to implement and administer, and does not saturate a network or server, making it an attractive option.

For detailed information on SQL Server 2000 log shipping, see Chapter 13, "Log Shipping."

Replication 

Replication is a feature in all versions of SQL Server, although it is not specifically designed as a high availability solution like log shipping and failover clustering. It can create redundancy for data by creating duplicates of production databases, but a method would have to be devised to point an application, client, or Web server to the SQL Server, much like in log shipping. Moreover, there is a greater amount of administrative overhead and complexity involved in setting up and maintaining replication, especially if subscribers are allowed to update data. Using read-only replication options such as snapshot and transactional replication produces a better high availability solution.

Managing a large number of subscribers takes planning. If the version of SQL Server needs to be upgraded with a service pack, under most circumstances, upgrade the distributor first, then the publisher, and finally the subscribers. In some cases, you may need to roll all out at the same time, if the underlying application requires the upgrade. Certain replication methods may require making sure schema and security changes make it to the subscribers.

Another consideration for using replication as part of a high availability solution involves the size of the data being pushed out over the network. With log shipping, the size of the files should be relatively small, but replication may generate more traffic on your network. Finally, if the Distributor server goes down prior to an update, transactions may potentially be lost, which is not really an issue with another method—failover clustering—since all transactions in flux are rolled forward or back in failover clustering. The Distributor is the server that pushes the changes out to the subscribers. The Distributor and Publisher should be separate servers if possible. If the tail of the transaction log cannot be retrieved, replication may provide a more current solution than log shipping if replication fits your architecture. The replicated database can be available for reporting purposes without any special configuration or considerations.

Backup and Recovery 

If you use backup and recovery as your disaster and recovery plan, consider the following::

  • Your recovery plan is only as good as your last known good backups. If backups are being made, but not tested, there may be no way to recover, depending on the final high availability solution implemented, if the backups turn out to be invalid. 

  • You need to assess how long it will take to restore the backup. If it is going to take four hours to restore your very large database, and your e-commerce site is losing $50,000 an hour, you will want to consider this fact in your plan. Also consider that something may happen during the restore process that increases the restore time. Plus, if you are not using transaction marks in your database (commonly used with partitioned data), and you are restoring to restore due to data issues rather than hardware, then you will have to restore, check the data, restore, check the data, find the error, and then do it all over again.

  • You need to consider how long it takes to roll any transaction logs forward. 

  • You should know where all backups of the database in question are stored, and when it was backed up. For more detail, see "Creating a Run Book" in this chapter. 

SQL Server 2000 supports snapshot backups, which involves the mirroring of your disks, and then breaking a complete set of disks off the mirror and using them as a backup. This requires specialized hardware, but is fully supported within SQL Server. Snapshot backups may create higher availability.

If failover clustering is used, any third-party backup programs used must be cluster aware. Should the software not be cluster aware, it may result in unplanned downtime.

Network Load Balancing 

While Network Load Balancing evenly distributes processing load among different servers, SQL Server 2000 itself cannot be load balanced. Network Load Balancing is intended for TCP/IP based services that either do not maintain state on each server or are designed to avoid storing state on the server, such as Internet Information Services (IIS) and Terminal Server. This option would work well as a high availability front end connected to a highly available clustered SQL Server 2000 back end (which may employ some of the technologies detailed here), since the load is balanced between servers, moves automatically in the event of failure, and renders SQL Server completely transparent to the end user.

The Choice

The choice of operating system and version of SQL Server affects which high availability options are available. If possible, use Windows 2000 Advanced Server and SQL Server 2000 Enterprise Edition. (As mentioned, failover clustering and log shipping both require SQL Server 2000 Enterprise Edition.) You may want to choose more than one high availability method.

When choosing a method, ask yourself the following questions:

  1. How are we going to switch to the standby server, and do we have the infrastructure and people to accomplish the switch successfully? 

  2. How many transactions can we afford to lose? 

  3. How long can we afford to be down? 

  4. Does the backup server have enough capacity? 

  5. And if the backup server fails, do we have a backup for that server? How redundant does our environment need to be?

  6. Is the application able to function if it is pointed to another database server? 

In any high availability solution, the first option considered should be a failover cluster. This provides the quickest and most automatic failover and is virtually transparent to the end users. A potentially less expensive alternative would be log shipping, which gives a high level of protection, but not as much as failover clustering. This is due to the fact that in the event of an unplanned event, only failover clustering can guarantee that all completed transactions will not be lost. A tradeoff comes in if and when you need to flip the switch on the warm standby server.

The following table compares failover clustering and log shipping on several criteria.

Feature

Failover clustering

Log shipping

Failover

Fast and automatic; virtually transparent except for small disconnect

Slower and manual; involves changing of application, connectivity, or client. Also involves processes that must be run to bring warm standby online

Cost

More expensive

Potentially less expensive

Full-Text, DTS, Replication

Fails over cleanly

Does not work automatically in failover; needs manual intervention

Logins

Full server failover, so all transferred

Job must be created to synchronize users

Distance

Short unless using distance clustering

Can be geographically disperse

Secondary server

Failover node may have other active virtual servers, but is considered passive for the virtual server currently running on the designated primary.

Can use warm standby for other databases, and use the log-shipped database for maintenance tasks

Transactions

All transactions are rolled forward if completed or rolled back in failover process.

If primary server is unavailable, last transactions cannot be retrieved, and warm standby will be a small delta of time out of sync.

Failing back to primary server

Automatic or manual, with minimal downtime

Manual, with unknown downtime

Many companies implement both failover clustering and log shipping to have a backup to the primary redundancy option, even though technically the other node(s) in a failover cluster solution are the backup plan.

If merge or transactional replication is a requirement of the overall application architecture, log shipping complements those technologies. For more information, see the topics "Strategies for Backing Up and Restoring Merge Replication" and "Transactional Replication and Log Shipping" in SQL Server Books Online.

Log shipping is also a good choice with Network Load Balancing. This scenario assumes that the SQL Server that the Network Load Balancing servers use is read-only, since it is best suited in a Web-based environment, like an e-commerce site that uses catalog servers. Replication is a choice for replicating read-only data as well using transactional replication or snapshot replication, and, like log shipping, replication can be used with Network Load Balancing. Network Load Balancing can be used to mask a server name change.

Design either of these effective combinations:

  • Distance clustering with log shipping. This solution may provide the ultimate in protection, since all nodes may be completely geographically dispersed, and the solution is built completely on enterprise class servers and software. 

  • Failover clustering with hardware mirroring. This solution will protect the primary instance with redundancy, and hardware mirroring can be used to create the failover/standby copy of the database if the failover cluster fails completely. 

Other combinations and solutions are possible. The ones listed above are only some suggestions that could be implemented in your environment.

Designing Hardware for High Availability

After choosing the right SQL Server high availability method, you must choose reliable and scalable hardware to support it. The hardware must be fault tolerant, meaning the system can gracefully respond to an unexpected event. A few factors come into play when designing a highly available hardware solution:

  • What is the current workload of the application or Web site, and what does it project to be six months, a year, or even two years from now? 

    Most people do not have this information prior to implementing a solution. Having benchmarks on how an application or site performs is critical in determining which operating system and what hardware to buy. The best way to benchmark an application prior to production is in a lab environment, but that is not always an option for some environments. Using tools such as System Monitor (Performance Monitor in Windows NT 4.0) can also establish performance trends. This method may not be as accurate as metrics obtained in a controlled lab environment, but would still serve as a baseline. Without a baseline, it will be hard to determine exactly what is needed. Also, any application issues affecting performance, either in current production versions or updates being planned, should be taken into account; SQL Server and its underlying hardware platform will only perform as well as the code that was written to access it. 

  • How much money is budgeted to the project? 

    In theory, organizations would all build the biggest and best boxes on every project. In reality, budget considerations often place some constraints on what you can do. A good rule of thumb is to plan for at least six months worth of capacity, but also take into account some amount of time, whether it is a year, two years, or more, in which the server being configured will still be employed and be expected to perform. 

Processors

Depending on which operating system you choose, different numbers of processors would be available for use.

Operating system

Maximum number of processors

Windows NT 4.0 Enterprise Edition

8

Windows 2000 Advanced Server

8

Windows 2000 Datacenter Server

32

Testing the performance of an application in a lab or some sort of controlled environment should determine the basic processing power needed. Another way to do this is to analyze the growth overtime of your system by regularly recording pertinent statistics in order to make a graph displaying this information. This is an important step, since the configuration of a production server should take into account current workload as well as an increase over time. Failover clustering from an operating perspective requires either Windows 2000 Advanced or Datacenter servers. Windows 2000 Datacenter Server requires a Service Level Agreement and is specifically meant to be used for high availability. If the operating system is not listed above, it does not support more than four processors.

Note Windows 2000 Datacenter Server is part of the Windows Datacenter Program (WDP). For more information, see " ALT http://www.microsoft.com/windows2000/.

Memory

Depending on which operating system is employed, SQL Server can take advantage of different amounts of maximum memory. An installation of SQL Server 2000 Enterprise Edition supports up to 32 gigabytes (GB) of memory on Windows 2000 Datacenter Server (without AWE enabled). Below is a table detailing the maximum amount of memory available to SQL Server per operating system.

Operating system

Maximum

Windows NT 4.0 Enterprise Edition

3 GB

Windows 2000 Advanced Server

8 GB (with AWE enabled)

Windows 2000 Datacenter Server

64 GB (with AWE enabled)

Address Windowing Extensions Memory 

Windows 2000 servers introduced the enhanced Address Windowing Extensions (AWE) API. AWE allows applications to access large amounts of physical memory. Due to limitations of 32-bit memory addressing, only up to 4 GB of physical memory can be used by Windows NT 4.0 and Windows 2000. By default, 2 GB of memory is dedicated to the operating system and 2 GB of memory to the application.

With a /3GB switch in the BOOT.INI used by the operating system, the application can access up to 3 GB of memory, and the operating system is reduced to 1 GB of memory. The difficulty presented by this design is that even if a server was configured with 8 GB of memory, anything past 4 GB was virtually unusable. This problem limited the amount of scalability of an application. AWE is the support built into the operating system for exposing extended memory to Win32-based applications.

With AWE, however, a memory-intensive application can now run much more efficiently under SQL Server 2000 to increase performance.

AWE requires an application to be coded specifically for AWE, like SQL Server 2000. AWE support within SQL Server 2000 must be configured using the sp_configure option awe enabled. This is set per instance. By default, awe enabled is set to 0, or off. Enabling AWE support in SQL Server 2000 also requires some additional operating system configuration. Consult SQL Server Books Online for more information. If enabling AWE memory, it is highly recommended that the configuration be tested prior to bringing the server(s) online in a production capacity.

There are considerations to take into account when implementing AWE memory:

  • The SQL Server instance does not dynamically manage the size of the memory address space used.

    When AWE is enabled with SQL Server 2000, if the max server memory configuration parameter is not set, it will grab the total memory available (except 128 MB to allow the base operating system to function), potentially depriving the operating system and any other processes that would be running on the same server. For detailed information on configuring AWE in a SQL Server 2000 failover clustering environment, see Chapter 12, "Failover Clustering." 

  • Once initialized, AWE memory holds all the memory acquired at startup until it is shut down. 

    If AWE is enabled and is taking too much memory, SQL Server would need to be shut down to reconfigure it, causing downtime (which makes a high availability option like failover clustering less available). Be sure to test for this scenario. 

  • Since the memory pages used by the SQL Server instance are taken from the nonpageable pool of Windows memory, none of the memory can be exchanged. 

    This means that if the physical memory is filled up, it cannot use the page file set up on a physical disk to account for the surplus in memory usage. Planning how AWE is used is vital with any SQL Server instance. 

For more information on configuring AWE memory on your server, see "Using AWE Memory on Windows 2000" in SQL Server Books Online.

Disk Storage

The heart of any database system is its storage—it contains the valuable data used and inserted by an application. For high availability, disks should be part of an external array and should be fault tolerant. The disks should be high speed for performance and support a large capacity.

The disks can be configured either with a small computer system interface (SCSI) or Fibre Channel. Fibre Channel is the preferred method of implementing. Historically, SCSI has been the popular method for implementing a disk array, but while it is generally cheaper, it does have a distance limitation. It also does not perform as well as Fibre Channel. Fibre Channel was specifically designed for high bandwidth and capacity. Storage Area Networks (SANs) are disk arrays that use networking protocols over Fibre to do all I/O. Use of SANs may be supported for use in conjunction with failover clustering as a Cluster/Multi-Cluster Device. SCSI is not supported for Windows 2000 Datacenter Server, so that is a consideration to take into account when deciding on hardware and the operating system.

When configuring the disks in the array, they should be configured as a redundant array of independent disks (RAID). There are different levels of RAID. The most popular and common options are:

  • RAID 0 – Striping 

    RAID Level 0 is striping data across multiple disk drives with no redundant information, making it not fault tolerant. Performance is maximized when data is striped over multiple controllers with one drive per controller and all the storage is available, but if one disk fails, the entire RAID array will become unavailable. RAID Level 0 requires a minimum of two drives to implement. 

  • RAID 1 – Mirroring (Duplexing) 

    RAID Level 1 gives you the highest availability but can be expensive because it requires twice the desired disk space. With RAID 1, data is duplicated over separate disk drives (hence the mirroring). Reads may be even faster if the RAID array controller allows simultaneous reads from both of the mirrored pairs. . If the disks are on separate I/O buses, also known as duplexing, higher availability may be achieved. RAID Level 1 requires a minimum of two drives to implement. 

  • RAID 5 – Striping And Parity 

    This is the most common, and affordable, RAID level. Availability is average. Parity and data is striped across a set of disks, so multiple read requests can be handled independently of each other providing high read performance. Since the parity information is used, a single disk can fail without losing access to the data. It requires a minimum of three disk drives to implement, with the third containing the parity information.

    The tradeoff comes in with writing. It is slower than other forms of RAID. Each write makes four independent disk accesses to finish (1. Old data and parity are read off separate disks; 2. New parity is calculated; and 3. New data and parity is written to separate disks). Some hardware vendors compensate by adding write caching into their hardware. However, because only one data disk of the set can be lost, and the parity disk cannot, this is not the best choice for high availability.

  • RAID 0+1 – Mirroring and Striping 

    This RAID level provides very high reliability combined with high performance. Essentially, it is a combination of Levels 0 and 1. One expense lies in the fact that you will use half of your physical disks for the mirror. Now that the standard disk sizes are larger than 18 GB, space should no longer be an issue for sites looking to invest in a scalable and available system. You can lose fully half the drives in the set without losing access to the data, provided no two are in the same mirror set. RAID 0+1 (sometimes called RAID 10) can read data from both the hot disk and the mirror as a performance boost, and it performs simultaneous writes to the hot disk and the mirror. This is the best solution for a database server requiring high performance and fault tolerance, but capacity may be an issue due to cost and the number of disks required. It requires a minimum of four disk drives to implement. 

The RAID controller cards should have multiple channels, and can be internal to the server, or part of the RAID array itself. It is generally recommended to have these be part of the controller instead of in the server. The controller should have multiple channels, or independent buses, to allow different disks to essentially be configured to "talk" over different pipes in and out. This becomes important in the final configuration, because if the disks become a bottleneck because of contention, that could be a potential point of failure as well as a performance problem. If possible, use more than one disk controller card to split the I/O to the disks.

If failover clustering is used, make sure writeback caching is disabled if the RAID controller is internal to the server and not the disk array. The reason for this is that even with battery backup, once the resources fail over to another node, there could be items in the cache. If the services are failed back over to the node, corruption could occur because it will attempt to overwrite things on the disk. Always get a guarantee from the hardware vendor that data in the cache will be written to disk in the event of a failure.

For more information on hardware, see Chapter 33, "The Data Tier: An Approach to Database Optimization."

File Placement 

Data and log devices, as well as tempdb, should be placed on separate disks. If possible, put them all on different channels. If your system is very large, or has hotspots, you may decide to use filegroups as a method of splitting up the disk I/O. Further segmenting this by putting filegroups on different disks and different channels can result in a performance boost. It is important to keep the file placement in mind when you are analyzing your high availability design. Performance issues caused by bottlenecks can be misinterpreted as availability problems.

Data should be configured with RAID 0+1 for high availability and performance, but RAID 5 is also acceptable (however RAID 5 may result in much slower restore times). Logs should be placed on RAID 1.

For more information on this subject, see Chapter 33, "The Data Tier: An Approach to Database Optimization."

Controller Configuration

Choose a card with enough channels to split the logical grouping of disks (for example, data and logs) to reduce I/O contention. If the Fibre Channel/RAID controller is internal to the server, and not in the shared disk array, writeback caching should be disabled. This is most important with failover clustering because, even with battery backup, once the resources fail over to another node, there could be items in the cache. If the services are failed back over to the node, corruption could occur because it will attempt to overwrite files on the disk. It would also cause data loss in a failover if transactions were in the cache but not yet processed.

Network Design

All high availability options require a reliable, highly performing network. Especially if options like log shipping and replication are involved, network latency looms largely in the picture. If the primary production SQL Server cannot connect to the server that may be the backup, there is no point in implementing the high availability solution until connectivity is reliable. Additionally, if there are connection problems between the clients and the SQL Server, this will be interpreted as an availability problem, and create the perception that you are not meeting your five nines service level agreement.

If larger bandwidth is required, SQL Server 2000 has support for higher bandwidth networking with Giganet or Compaq Servernet II technology on Compaq hardware. If used, these will create higher performance between multiple SQL Servers. Giganet support is built-in, and the update to enable Servernet II is located at " ALT http://www.microsoft.com/sql/.

Hardware Location

For the purposes of disaster recovery, the primary server and backup/warm/standby server(s) should never sit in the same location (with rare exceptions, such as some clustering solutions). A high availability solution works best by dispersing the parts of the puzzle in different places. This ensures that in the event of a catastrophe at the primary site, the backup plan does not go up in smoke with it. If all servers must sit in the same data center, make sure they are on completely separate power grids. That will at least provide some protection against one kind of disaster.

When placing the hardware in your data center, treat it as if you would a mainframe. Place it on a raised floor to protect the cabling. Make sure that an uninterrupted power supply (UPS) is attached to each server, and that the data center has proper backup generators in place should a disaster happen. Also ensure that the data center has the proper cooling facilities—causing failures due to system overheating is something that can be prevented.

For additional information about best practices for your data center, see Chapter 14, "Data Center Availability: Facilities, Staffing, and Operations."

Hardware Compatibility List

Before deciding on all final hardware, consult the Hardware Compatibility List (HCL). All hardware must appear on the HCL, or the cluster configuration will not be supported. Specific cluster categories are listed, including "Cluster." Only complete cluster solutions listed under the "Cluster" category are supported for use with SQL Server 2000 failover clustering.

The HCL can be found at " ALT http://www.microsoft.com/whdc/hcl/default.mspx.

Service Level Agreements (SLA)

Make sure an equal level of protection is purchased for your hardware. In the event of a hardware problem, sometimes a 24-hour turnaround can be unacceptable on a mission-critical server. Any vendor selling a Windows 2000 Datacenter solution requires a SLA by default, and this is something to consider.

Creating a Disaster Recovery Plan

Along with the technical portion of your high availability solution, put a detailed disaster recovery plan in place as well. Proper planning will also aid in the ability to troubleshoot what went wrong so it can be corrected and prevented in the future. All problems and their resolutions, whether trivial or major, should be logged as they occur, since they may assist if the system goes down at a later date.

Preparing Your Environment

Operational excellence is a prerequisite to devising a disaster recovery plan. Below is a list of actions to take to ensure this operational excellence.

  • Make an investment in continued professional training and internal cross training of your current staff. 

  • Keep a record of all the contact people (including developers) for an application, with notes on what they typically can deal with. 

  • Keep a record of all related systems. For example, if this system fails over to a remote site, who would you need to notify? If there was a production change (say, a table changed), who would you need to notify so all related reports could be updated, so all data pushes could be adjusted, and so on?

  • Make sure all support contracts are current and the numbers to call for support are documented along with any pertinent information and the justification process for placing a support call. This will ensure that the proper hardware and software support will be available at a moment's notice. This becomes more important if hardware fails and a component would need to be replaced.

    Treat your production boxes seriously.

    • Lock down security—make sure only the people who need to have to have access to the boxes can get to them. This means no blank sa passwords, obvious administrator passwords, and so on. If need be, only allow access to critical servers from certain known and trusted IP addresses. 

    • Change control should also be applied here—any changes to the production environment such as a service pack should be held to the same standard as changes that come from development, and should be tested in a staging environment. 

For more information, see Chapter 8, "Managing Database Change." For a full discussion of managing operational changes, visit " ALT http://www.microsoft.com/mof/.

The Failover Plan

By now, you have chosen the one or more technologies to build your high availability solution. Now the question is: How are you going to switch over to the designated backup server? If the solution is configured properly, tested, and the process documented, it should be an easy switch. Part of the plan, if applicable, is to ensure that no other users have access to the database or server while bringing it online. This may involve using single user mode. Another part of the plan is having a way to notify end users in the event of a failover, especially if it will take any length of time.

For detailed information on configuring failovers with failover clustering and log shipping, consult Chapter 12, "Failover Clustering" and Chapter 13, "Log Shipping," respectively.

The Failback Plan

Consider the following questions when you develop a disaster recovery plan: What do you do to get back to your original primary server? Do you do it immediately? On a scheduled basis? At a lull? Or do you go back at all? Have a hierarchy for recovery: If multiple failures happen at once, what gets the priority? Is one dependent on another?

If your servers are of equal hardware and capacity, you may have no reason to ever fail back to the original server, unless you have a business need or desire. Analyze the server to prevent failures from reoccurring. This can provide valuable information in the event that you decide to involve technical support. You may find that changes are driven back into the design or code of the system to make it more robust.

For detailed information on configuring a failback with failover clustering and log shipping, consult Chapter 12, "Failover Clustering" and Chapter 13, "Log Shipping," respectively.

Personnel

The most important aspect of any disaster recovery plan is the human one: choosing the right people to execute and staff the plan. Only trained employees should be entrusted with the responsibility of carrying out the final disaster recovery plan in the event of an emergency. A well-established shift rotation should be in place and documented, and contact information should be entered into the run book (described below) to ensure easy accessibility to that information.

Also be sure you have an up-to-date escalation structure, and/or clear chain of command, for dealing with disaster recovery. Assign specific tasks and roles to an individual, and ensure that person know what is expected of them. This will minimize any confusion. If you live in an area that has predictable disasters, make plans that accommodate this.

Creating a Run Book

Crucial to the success or failure of a disaster recovery plan is having a document or collection of documents that not only detail the steps of the plan, but any related documentation that would aid in the recovery and diagnosis of a disaster. Such a document is sometimes referred to as a "run book," and should be readily kept up-to-date and available to the whole team. It includes, but is not limited to:

  • The failover plan. 

  • The failback plan. 

  • Backup file information and primary, secondary, and tertiary locations. 

    In the event of a system failure, a backup would need to be restored. 

  • Location of software, including documented licensing keys and support numbers. 

    In case of complete system failure, where everything would need to be reinstalled, the operating system and SQL Server 2000 software should be readily available with its corresponding licensing information. 

  • Contact information. 

    This includes a list of all personnel and all relevant contact information for them. This should include everyone: administrators, developers, help desk employees, users, and also other related systems that depend on information in this system. 

  • Server configuration. 

    If a software installation or hardware fails, the server itself may need to be rebuilt. You may choose to include the worksheets for log shipping and failover clustering, found in Chapter 12, "Failover Clustering" and Chapter 13, "Log Shipping," respectively. Other things that may be helpful:

    • Operating system version, with service pack level, and settings at the OS level 

    • Registry settings (if applicable) 

    • Physical and logical disk configuration, including RAID levels and disk controller information (including write cache settings) 

    • Drivers and firmware 

    • SQL Server installation configuration, including the installation and service pack levels, and any hotfixes that were applied 

    • SQL Server instance names, IP addresses, ports, configuration options, database file locations, and so on 

    • File shares. Also note any shares, no matter whether they are mapped, reached by universal naming convention (UNC) names or through some special protocol, to which the service login must have permissions. 

    • Collations 

      Prior to SQL Server 2000, collations were set only at the server level. With SQL Server 2000, collations can now also be set at the server and literal string level. This information is crucial for restoring a database properly. 

  • Database Schemas. 

    These should also be saved out to script files for easy recreation. 

  • Jobs. 

    A list of all automated SQL Server Agent jobs, what they do, their corresponding code, and what time(s) they are executed is necessary in case they need to be recreated. They could also be scripted out for possible recreation.

  • Replication setup (if applicable). 

  • Log shipping setup (if applicable).

  • Information related to distributed databases or partitions (if applicable): such as Data Dependent Routing Tables, distributed transaction marks, and so on. 

  • Linked server information and related security. 

  • Information on maintenance. For example if a remote server controls your maintenance. 

  • DTS packages, including associated login and password information, and any other custom-built code that runs on the server. 

  • Configuration information for any other software that runs on the same server. Make sure complete installation and configuration documentation is available, and/or that qualified support personnel are listed as contacts. Also list support numbers and Web sites for each piece of software. 

  • Any custom-made DBA objects that you depend on to maintain or administer the server. 

  • Any specialized configurations, such as XML support for IIS, Microsoft Active Directory™ support, and so on. 

Other information can be in the run book as well, but the above is a helpful guideline to start with.

Testing the Plan

Last, but not least, test the disaster recovery plan. The business consequences of an untested plan can be very significant.

The following procedures will be useful as you conduct your "fire drill":

  • Schedule a test. 

    After the solution is put in place, testing of that solution should be done on a scheduled, regular basis. This is the only way to make sure that all personnel are trained and confident in what needs to be done. No testing should be done during a normal business day unless the servers involved have not been put into production yet. If the servers are not in production, testing is easier and can be done at any time prior to going live. If the servers are already in production, the testing should be done in off-peak hours so business activities are not adversely affected.

  • Log the test. 

    Using the aforementioned run book, log all tests of the disaster recovery plan. Note the day, start time, end time, whether or not it was a success, and, in the event of a failure, why it happened. It is crucial for tracking as well as diagnosing. No greater confidence can be gained than when you know the plan works and is documented to work. And if it fails, and the problem happened before, you know how to deal with the anomaly.

    At some point you may have to put your disaster recovery plan into action. If this occurs, make sure to document the entire process. 

Diagnosing a Failure

In the event of a failure, you should try to determine the root cause, but this may be hard to attain depending on the failure itself. To assist in this effort, make sure the following measures are in place:

  • Whatever was running on the server at the time of the failure should also be known. Whether this was just the application, or specific application tasks (such as daily processes), tasks, or anything that would make a possible suspect from the application/database angle. An effort should be made to document whether anyone in the operations staff or application administrators group was running a special process at the time, as well.

  • Backup the database and save it off for analysis, and if possible, get a transaction log backup as well.

  • Any error message present on the screen when the operator arrives should be carefully and completely documented. 

  • Make sure the Windows application, system, and event logs are available. They should be backed up from time to time to make sure that they can be used for historical research in the event of a failure. 

  • Similarly, SQL Server error logs should be backed up and archived for potential use to see what was going on at the SQL Server level at the time of the failure.

  • If C2 auditing is enabled, make sure those logs are available and were part of a backup scheme. This will help to determine who or what may have caused the problem. 

  • The location of all log files should be standardized, well known and documented in the run book. 

High Availability Scenarios

This section provides some typical scenarios for implementation of high availability solutions. All examples assume no deficiencies in application logic. Th1049787679e companies cited in the following scenarios are fictional.

Corporate Web Site with Dynamic Content, no E-Commerce

Background

A global recruiting company recently put up its official Web site. The site is an application that provides information about the company and its recruiters, and allows potential candidates to search for jobs and send in resumes. The company has offices in New York, Boston, Chicago, Los Angeles, Atlanta, London, Tokyo, Paris, Toronto, and elsewhere.

By default, the application defaults to the local database, but jobs can be searched from all over the world. Each location has its own single instance two node failover cluster, and that database is log shipped to a regional server sized to handle all databases (that is, all of the U.S. east coast, Europe, Southeast Asia, and so on). Log shipping sends the transaction logs to the warm standby server every 15 minutes, and the database name on the warm standby corresponds to the city name. Each virtual server is linked to the others around the world. Web users connect to a farm of web servers in each location that point to the proper database. The log shipped servers all employ snapshot backups to one server in the global headquarters in New York City as a last line of defense. While the Web site is not a strict 24 x 7 shop, it is expected to be up as many hours of the day as possible in the event of an emergency. Some losses in transactions are considered to be acceptable.

Technologies Used
  • SQL Server 2000 Enterprise Edition 

  • Windows 2000 Advanced Server 

  • Network Load Balancing 

  • Internet Information Services 

  • Failover clustering 

  • Log shipping 

  • Snapshot backups 

Scenario

The company: 

Imagine that Tokyo experiences an earthquake. The data center powering the failover cluster loses all power—no connectivity whatsoever. The Global Operations Center (GOC) detects the failure, and uses Terminal Server to remotely access the Southeast Asia warm standby server. The GOC also puts up a special page on the site indicating that the Tokyo search database is temporarily offline, and will be up shortly. Since a proper disaster recovery scenario was in place, a job to bring the database online, with all of its proper steps, is started. The data source name (DSN) on all Web servers for the Tokyo database is redirected to the warm standby. The temporary Web page is removed.

The user: 

Back in San Francisco, John goes to the site at the moment the earthquake happened. He looks for local jobs, but he is also considering a radical change. He decides to search for all QA jobs in Tokyo. Unfortunately, he receives the temporary offline notice on the site. Since this page notes that the database will be searchable again soon, he decides to surf the Web for a few minutes, and then return to the site. He goes to search for all jobs in Tokyo, and succeeds.

Failing Back

After 48 hours, all power is restored to the Tokyo data center. The health of the servers is deemed to be fine. The GOC puts the temporary Web page up again, and sets the database to single user mode to ensure that no errant connections can be made. A full database backup is made, copied to the Tokyo virtual server, and restored. The database is also restored to the warm standby server, except the database is put into standby mode. Log shipping is re-enabled. The GOC tests functionality to ensure all is well, redirects all Web servers to the primary database, turns off single user mode, and removes the temporary Web page.

E-Commerce Web Site

Background

A successful manufacturer of widgets has a Web site selling its products. The database grows rapidly each month, and currently consumes 15 GB of disk space. Microsoft Commerce Server 2000 handles the commerce piece. It accesses not only the shopping database, but also a series of LDAP membership databases spread out over three dedicated servers. A custom application sits on top of Commerce Server, and uses Component Services to queue transactions to be entered into the shopper database. The application also accesses three catalog servers, which are fed by a master catalog database. Daily changes, and not a whole copy of the catalog, are sent using transactional replication to all three catalog servers.

The following also are important features of the environment at the company:

  • Catalog servers. 

    These are protected with log shipping to another SQL Server daily at 1 AM, after changes are pushed to them. 

  • Main shopping database, which includes all transactions. 

    This is set up as a two node single instance failover cluster, and is also log shipped to another server every 10 minutes. 

  • LDAP Membership Databases. 

    This is set up as a two node single instance failover cluster, and is also log shipped to another server every 10 minutes. 

The entire environment is mirrored with third-party hardware to a copy of the data center in another state, in case of a failure. Log shipping is also sent to these servers at the same time as log shipping to the other servers. This allows the database to be only a small delta of time out of sync with the main environment. No external connections can access these servers unless a DNS change is made. The catalogs are fronted by Network Load Balancing, and the Web servers also employ Network Load Balancing.

Because this is a site that is commerce based, every second that the site is down means lost revenue. Therefore, the site needs airtight backup plans, as well as redundancy from a data standpoint.

Technologies Used
  • Windows NT Server 4.0, Enterprise Edition (for Site Server) 

  • Windows 2000 Advanced Server (for SQL Server) 

  • SQL Server 2000 Enterprise Edition 

  • Microsoft Commerce Server 2000 

  • Internet Information Services 

  • Network Load Balancing 

  • Component Services 

  • Log shipping 

  • Failover clustering 

  • Transactional replication 

  • Hardware server mirroring and disk mirroring 

Scenario 1

The company: 

Monday is a busy day for the company, especially from 11 AM to 4 PM. At 1 PM, the operations center notices ASP queuing starting to happen, and that no connections seem to be getting to the shopping database. Because it is protected with failover clustering, within a short time, the database comes back up and the ASP queuing lessens. An engineer starts to investigate why the primary node went down. Around 3 PM, the shopping database starts to exhibit dropped connections and fails before a final transaction log backup could be made. The company puts up a "store closed" page. Since the primary node is already unavailable, the company follows the instructions found in its run book for bringing the log shipped warm standby server online. After the warm standby is brought online and tested, the "store closed" page is removed.

The user: 

Mary goes to the site at 1:03 PM and finds the site extremely slow. She decides to shop elsewhere. Even if a failover is happens cleanly, there is no guarantee of how a user will respond to online conditions.

At 3:05 PM, Ed needs to buy a widget. He goes to the site, only to find a store closed page that said the site will be back up soon. The page indicates that customers can use a special code to get 20 percent off their next order when they comes back. Ed goes back to work, and before he leaves his desk at 5:30, he goes back to the company Web site.

Failing Back Number 1

The problem on the clustered nodes is traced back to a faulty power circuit. It is repaired, and the servers are rebuilt from scratch the next day to ensure that the software installations were not damaged and will function properly, since they were not shut down cleanly and are well out of sync. Friday night during off hours is the scheduled time to bring the cluster back into service. At that time, the "store closed" page is put up and all site traffic to the back end is stopped. The current production shopping database—the log shipping server—is backed up. Then the files are copied to the cluster and restored to the virtual server. The backup is also restored to the log shipping server to put it into standby mode. Log shipping is enabled on the primary server, the site is tested, the "store closed" page is removed, and the store is back online again.

Scenario 2

The company: 

At 12:03 AM on Tuesday, the main data center at the company loses all power. A call is made, and it is determined that the outage will last awhile. Because of the hardware mirroring and the log shipping of the main databases every 10 minutes, the backup environment is only three minutes out. The log shipped servers are brought online, and the catalog is pushed out from corporate headquarters to the catalog servers using a DTS package created specifically for this purpose. Then the site is tested, and the DNS change made to bring the backup environment completely live.

The user: 

Susan cannot sleep, so she decides to browse the Web. She attempts to go to the company Web site at 12:10 AM, only to get a "page not found" error. She finds this odd, since she had browsed the site earlier in the evening, and decides to go back to bed. When she gets up in the morning, she checks the site for the widget she wanted more information on, and finds the site up.

Failing Back Number 2

In this case, a failback to the original environment would only happen in the event of a failure. If it was only one server to fail over, it would be easy, but moving a whole environment is quite difficult and time consuming. And especially where every second means revenue, going down to switch back when your alternate environment is working does not make sense.

Partitioned Database

Background

The Human Resources department of your company uses an in-house designed application that accesses SQL Server 2000. The SQL Server 2000 database is partitioned over three instances using a Distributed Partition View (DPV). These three instances are on a four node Datacenter cluster, with the fourth node being the passive failover node. This is the N+1 scenario described in Chapter 12, "Failover Clustering." Each instance is log shipped every 15 minutes to a single standalone server that has enough capacity to run all three databases, but the transaction logs are not applied because the database may need to be restored to a single point in time. The transaction log in each database is also marked with a known naming scheme every three minutes to make a restore to keep all three databases in sync.

Technologies Used
  • SQL Server 2000 Enterprise Edition 

  • Windows 2000 Datacenter Server 

  • Marked transaction logs (for more information, see "Backup and Recovery of Related Databases" and "Recovering to a Named Transaction" in SQL Server Books Online, as well as Chapter 38, "Scaling Out on SQL Server.") 

  • Distributed Partition Views (federated servers) 

  • Failover clustering 

Scenario

The company: 

An air conditioning repairman accidentally pulls the plug out of the cluster when he walks through the wiring in the data center at 9:26 AM. The data center is three states away from the main operations center. Because the backup scheme for the DPVs is well known because it is in the run book, the operations team restores the databases to the last (and same) distributed mark point on the warm standby, redirects any ODBC data sources, and brings the databases online.

The user: 

Moments after the failure is detected, Janet receives an e-mail from the corporate IT staff saying that the HR system is temporarily down. Fifteen minutes later she receives an e-mail saying the system is back up.

Failing Back

The power cord problem is corrected. Because the HR department is essentially a 9 to 5 shop, users are notified that the system will be shut down for maintenance that evening. At midnight, the operations staff puts the databases in single user mode to ensure no errant connections can be made, backups are made of the three databases, and those are copied and restored on their respective virtual servers. The databases are also restored on the log shipped server in standby mode. Log shipping is enabled, the ODBC data sources are redirected, and the application is tested to ensure functionality works.

Small Company

Background

A small accounting services business manages the books of 25 local businesses. Joe, the owner of the accounting business, uses a custom application. The data Joe works with is his business, but it is not volatile because it changes in small amounts every day. If a particular database has a large number of changes, he backs it up periodically during the day and copies it to the backup server. He manages his own servers since he is the owner and sole employee. He has two Windows 2000 Servers—one his primary, and one his backup. Both contain SQL Server 2000 Standard Edition instances. Joe could not afford to purchase SQL Server 2000 Enterprise Edition. Joe has a scheduled job that backs the primary database up every night onto both disk and tape, and that copies the backup files to the secondary server and automatically restores them. An external UPS protects each server. He also has a set of saved queries on each server so he can access the entries per day in each database to pull specific records out in the event of a failure.

Technologies Used
  • SQL Server 2000 Standard Edition 

  • Windows 2000 Server 

Scenario

The user and company: 

Joe comes in one morning to find that his primary server will not boot up. In a panic, he sits down at the second server to see if the backup from the previous day completed, and was backed up and restored to the server. Unfortunately, the primary server went down before it had a chance to be backed up, so he is 24 hours out of sync. He goes through the business day and does his work on the secondary, deciding to deal with the failure in off hours.

Failing Back

After the office closes, he attempts to restore the primary server. He discovers that the power supply inside the system failed. He replaces it, and boots up the server. He runs DBCC statements and other diagnostics to see if the SQL Server has been damaged. The SQL Server checks out to be in good health. He then runs the query to get all new records out of the databases based on an update_date column in each table, and inserts them into temporary tables. He then proceeds to back up all databases on the secondary server and restore them on the primary. He then reinserts all the records from the temporary tables, runs queries to make sure the records were inserted with no problems, and drops the temporary tables.

Conclusion

A production environment is the lifeblood of any establishment that uses computers to power its day-to-day activities. Because the systems are vital, ensuring that they are up and running is paramount. This means that proper considerations must be taken into account to create a highly available SQL Server in conjunction with all other mission-critical systems. Attaining high availability does not happen merely by adding redundant hardware into the server mix. Hardware is only a part of the solution; it must be used in conjunction with other high availability concepts such as proper application design and disaster recovery planning. It is important to strive for the highest amount of uptime that is possible. Use this chapter as a guideline to assist in the development of designing highly available SQL Servers and the plans to support them.

Was this page helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft