High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
SQL Server Technical Article
Writer: David P. Smith (ServiceU Corporation)
Contributors: Ron Talmage (Solid Quality Mentors); Sanjay Mishra, Prem Mehra
Technical Reviewers: James Podgorski, Mike Weiner, Peter Carlin
Published: August 2009
Applies to: SQL Server 2008
Summary: ServiceU Corporation is a leading provider of event management software. An essential part of our IT mission is maintaining high availability and disaster recovery capability. This technical case study shows how we use Windows failover clustering and SQL Server 2008 database mirroring to eliminate single points of failure in our data centers and enable fast recovery from a possible disaster at our primary data center. These strategies and solutions will be of interest to database administrators, senior IT managers, project leads, and architects.
ServiceU Corporation, based in Memphis, Tennessee, is a leading provider of online and on-demand event management software. Our software services are used by churches, schools, universities, theaters, and businesses to manage events such as concerts and conferences as well as online payments. We have customers in all 50 states of the United States and in 15 countries worldwide.
Our software services are built and deployed using the Microsoft® Application Platform, including the Microsoft .NET connection software, the Microsoft SQL Server® 2008 database software, and the Windows Server® 2008 operating system. The Microsoft Application Platform helps us provide a seamless user experience and maximum availability of our applications to users. The applications use both the Software as a Service (SaaS) model and the Microsoft Software + Services architecture.
From a security standpoint, we maintain Payment Card Industry (PCI) Level 1 compliance to protect credit card holder and Automated Clearing House (ACH) information. (Details of our PCI Compliance are not covered in this case study.)
Achieving maximum availability and near immediate recovery from a disaster is essential for maintaining our revenue stream. We have worked hard to eliminate all single points of failure in our architecture, and we have developed procedures for patching servers, upgrading software, and implementing application changes that preserve high availability. Based on these efforts, we have achieved 99.99 percent uptime, including both planned and unplanned downtime.
This case study examines the decisions that we made and the procedures we employed to maintain maximum availability with minimal downtime. This information will be of interest to senior IT managers, project leads, architects, and database administrators (DBAs).
The ServiceU Application Architecture
A logical view of our application architecture is shown in Figure 1.
Figure 1: A logical view of the ServiceU application architecture showing the application tier layers
Note the following about our architecture:
- Our customers can access our application directly through their browsers, their own Web servers, and from their own e-commerce servers.
- All customer activity is processed through our Web farm that holds the middle-tier layer.
- The end-user application is built with Microsoft technologies through a series of layers, all of which eventually go through the Data Access Layer to contact the application databases.
- The data layer consists of SQL Server 2008 databases.
In order to maintain Level 1 PCI Compliance, rigorous security measures are enforced to protect user cardholder data. In addition, we maintain service level agreements (SLAs) with customers that specify required levels of performance and availability of the application.
Our revenue stream is based on customer activity. Consequently, it is vitally important that our application maintain maximum uptime and availability.
We keep the following general goals in mind for our availability solutions:
- Ensure that all PCI Compliance security measures are applied throughout the network: If a standby data center is used for disaster recovery, it must also be PCI compliant.
- Eliminate all single points of failure: from the Internet presence to the data center, including network, Web and database servers, and data storage.
To help achieve our uptime goals and meet desired service level agreements (SLAs), we created specific guidelines for allowable data loss and service downtime. These objectives were defined by recovery point objectives (RPOs) and recovery time objectives (RTOs) as discussed in the following list:
- Unplanned downtime:
- Loss of a database server:
- RPO = 0; that is, no data loss
- RTO = 60 seconds maximum
- Loss of the primary data center, or the entire
database storage unit in the primary data center:
- RPO = 3 minutes maximum; lost data may be recovered if the primary data center can be made available
- RTO = 15 minutes total, including evaluation of the issue; 5 minutes maximum for making the necessary changes to bring the standby data center online
- Loss of a database server:
- Planned downtime:
- RPO = 0 (no data loss)
- RTO = 60 seconds maximum; some database changes may require a longer downtime than 60 seconds; in those cases every effort is made to minimize the service interruption
To implement high availability within the data center, we decided to implement Windows® failover clustering and with storage area network (SAN) database storage:
- Windows failover clustering provides database server redundancy within a data center.
- Each failover cluster has fully-redundant SAN storage for data protection within a data center.
- We use three nodes in each cluster to preserve high availability during patches and upgrades.
Figure 2 shows our high availability architecture.
Figure 2: ServiceU uses a three-node Windows failover cluster with one clustered SQL Server instance
The next two sections describe the redundant server and storage strategies illustrated in Figure 2.
Database Server Protection: Failover Clustering
For database high availability, we have chosen to use a three-node Windows failover cluster with one SQL Server 2008 failover cluster instance:
- If any single cluster node fails, there are always two remaining nodes available in the cluster, so the SQL Server instance is still protected by a redundant node on the cluster.
- An upgrade or patch can be applied to one node at a time (commonly called a rolling upgrade or update), leaving two nodes always available. This preserves high availability during scheduled maintenance.
- We can replace an existing node with a newer node and have two nodes available for failover during the replacement.
Each node of the cluster has two host-bus adapters (HBAs) connecting it to the Fibre Channel switches, each following a redundant path to the data. The cluster heartbeat network uses a private network, not a crossover cable.
Before upgrading to SQL Server 2008, we used three-node Windows Server 2003 failover clusters with a single SQL Server 2005 instance. When we upgraded to SQL Server 2008, we also upgraded to Windows Server 2008. To continue using a three-node failover cluster on Windows Server 2008, we implemented Windows Server 2008 failover clustering using the No Majority: Disk Only mode.
With an odd number of nodes in a Windows Server 2008 failover cluster, Microsoft usually recommends using a Node Majority quorum mode. However, for us, the Node Majority quorum mode represents a relative loss of availability when only one node is operational:
- In a Node Majority quorum mode, each node that is available and in communication can vote to determine whether the cluster can continue running.
- For a three-node cluster, Node Majority mode will keep its quorum if one node is not available (two votes being a majority of three nodes).
- However, if two nodes are not available, the cluster will lose its quorum because the one node remaining has only one vote. This means that the entire cluster will be offline.
- We require that even if only one node of the cluster is available, the cluster must remain online.
As a result, we chose to configure our three-node clusters using the No Majority: Disk Only quorum mode. In this mode, a cluster will still run with only one node available. This is equivalent to the quorum disk model of failover clustering on previous versions of Windows Server.
To protect the quorum disk, we place it on the SAN with its own logical unit number (LUN). The database server connects to a SAN with fully redundant hardware by using multiple redundant paths. On the SAN, the quorum disk's LUN volume is mirrored using RAID. Because we have protected the quorum disk, we ignore the warning Windows Server 2008 gives in the Failover Cluster Management utility, stating that the Quorum Disk Only option may be a single point of failure.
We made the following decisions when building our clusters:
- On each three-node cluster, two nodes are designated as preferred owners and have identical memory and CPU configuration. The third node, used primarily as a backup during patches and upgrades, has less memory. We implement a startup stored procedure to set the SQL Server memory based on detection of the active node.
- All resources are set to fail over if a restart is unsuccessful.
- Failover between cluster nodes is automatic, but the cluster is set to , which prevents failback to the preferred node. We will fail back to the preferred node manually when convenient.
Database Storage Protection: SAN and RAID
To eliminate single points of failure for database storage, we use a Fibre Channel SAN. (For an illustration of how the databases are connected to the SAN, see Figure 2.) The SAN configuration has numerous redundant features to prevent a single point of failure, from the servers down to disk storage:
- All SQL Server database servers have two HBAs; each is connected to a different fiber switch.
- Each fiber switch then has a path to each loop of the SAN.
- Two fiber loops connect the disk enclosures.
- Each SAN has multiple storage processors to process the data; these are balanced so that each of the SAN's storage processors has approximately 50 percent of the load.
- The SAN backplane is fully redundant.
- Caching is implemented on the SAN to improve throughput, and the cache is fully mirrored.
- The SANs have additional batteries to back up power supplies. If a battery is not fully charged or if the SAN begins to run on battery power, the SAN cache is flushed to disk and disabled until both batteries are fully charged.
- Most LUNs are configured with RAID 10 and enough spindles are allocated for more than the maximum current throughput.
Web Server Protection: NLB Clustering
To prevent single points of failure within Web servers, we place all Web servers in a load-balanced Web farm using Microsoft Network Load Balancing (NLB):
- The web server session state is stored in a database to prevent issues with load balancing.
- NLB affinity is set to None, allowing each request to go back to a different Web server.
- Web content is replicated between servers and sites using Windows Distributed File System Replication (DFSR).
- Each Web server has “Web gardens”, that is, multiple processes and application pools that isolate code for reliability.
- We have converted most Web servers to virtual servers using the Hyper-V™ technology.
- We have found that for configuring virtual servers, performance is better when the virtual guest machine VHD files reside on a physical disk volume that is separate from those used for the host operating system. This observation has led us to host our Web server VHDs on a local mirrored RAID disk volume that is separate from the operating system disk volume.
In our Web farm, several servers can fail or be removed with little or no impact. As a result, upgrades can be applied to one Web server at a time:
- The Web server is removed from the Web farm.
- Code is applied.
- Testing is performed on the Web server.
- The server is placed back in the Web farm.
To protect against the potential loss of a primary data center, we located a second standby data center in a different geographical location. The standby data center serves as the disaster recovery (DR) site should a natural disaster or other disruptive event result in the primary data center becoming nonoperational. The standby data center is used only in the case of emergencies, when the primary data center is unavailable. When the primary data center becomes available again, we reestablish it as the primary data center and the standby data center takes on its role of protecting the primary data center.
Data from the primary data center is sent to the standby data center in near real-time, and the standby data center is a functional duplicate of the primary data center's hardware, software, and infrastructure. In the event of the loss of the primary data center, the standby data center can be brought online almost immediately, with minimal disruption of customer activity. The following sections detail our disaster recovery strategies.
Data Center Selection
We required that the primary and standby data center sites be located sufficiently far apart from each other that a natural disaster affecting one would not likely affect the other. We also required an Internet Service Provider (ISP) for our data centers that could provide direct connectivity between the qualified cities, and could provide minimal network latency and high availability of each data center to the Internet. Finally, we wanted the primary and standby data center cities to be located near each other and to the corporate headquarters, to limit air fare costs and flight time.
Using these guidelines, we chose to locate our primary data center near our headquarters in Memphis, Tennessee, and selected Atlanta, Georgia, for our standby data center. We chose an ISP that provided direct connectivity to the cities of Memphis and Atlanta and could deliver the high throughput and low latency required to replicate or mirror data between the sites:
- Both data centers are situated at the junction of a dual fiber SONET ring. The ring is autorouting, providing protection against connectivity failure: Multiple cuts in the fiber will not affect Internet service. We have not encountered an unplanned loss of connectivity in nine years.
- The ISP has a Point of Presence (POP) in each city.
- Network latency between the two cities is low (10 milliseconds or less), which is a direct result of the fact that the ISP has a POP in both cities.
- The ISP provides both sites with a 30-megabits-per-second (mbps) connection to the Internet.
PCI Level 1 compliance requires that a disaster recovery site be PCI compliant before it can be used as a failover site. Because we have PCI Compliance as a company goal, we ensure that the standby data center meets the same PCI requirements as the primary data center.
Data Center Infrastructure
Each data center has a distinct production network. Both sites are independent of each other with a fully functional infrastructure, from the firewalls through the Web servers and database servers. A high-level view of the data center infrastructure is shown in Figure 3.
Figure 3: Each data center contains redundant hardware and multiple connectivity paths
We use redundant hardware and multiple connectivity paths to eliminate single points of failure within each data center. Figure 3 shows some, but not all, of the efforts that we have made to remove single points of failure:
- Active/Passive firewalls exist between each network.
- An NLB cluster balances incoming traffic across the Web farm.
- Multiple Web servers host the application code.
- At least two Domain Name System (DNS) servers exist at each data center. Each Web server uses a DNS alias for the server name when connecting to the SQL Server 2008 instance.
- The SQL Server 2008 instances are clustered using Windows Server 2008 failover clustering. If one of the nodes fails, the SQL Server 2008 instance will fail over to another node in the cluster.
- Database data and the Windows failover cluster quorum disk resource are stored on a SAN. The failover cluster has duplicate paths to the SAN, and the SAN LUNs are provisioned using RAID striping and mirroring.
- The database servers and SAN have redundant power supplies with uninterruptible power supplies (UPSs).
- Each data center has its own local power generator to protect against temporary loss of power from the electrical grid.
For more information about the data centers, see Appendix A, "Data Center Infrastructure Details".
Disaster Recovery Solution: Database Mirroring
In our environment, database server disaster recovery is accomplished by means of SQL Server 2008 database mirroring between the primary and standby data centers. (SQL Server instance high availability is accomplished by using Windows Server failover clustering on each database server, and was covered in the previous section.) Figure 4 shows a logical view of the database mirroring between the data centers.
Figure 4: ServiceU implements disaster recovery between data centers by using asynchronous database mirroring
Database mirroring ensures a near real-time copy of all mission critical database data at the standby data center:
- The principal databases are located in the primary data center in Memphis, and the mirror databases are in the standby data center in Atlanta.
- Low latency between the data centers benefits database
mirroring in two major ways:
- The principal server can send a large volume of transaction log records to the mirror quickly.
- The time required to ship transaction logs to the mirror server is decreased when initializing database mirroring.
We chose asynchronous database mirroring because we do not want bandwidth-related delays of synchronous database mirroring to affect application performance. In asynchronous database mirroring, the principal server does not wait for acknowledgement from the mirror server before committing transactions. Therefore delays in sending transaction log records to the mirror databases will not affect the completion of user transactions.
Because we have chosen asynchronous database mirroring, in the event of the loss of the primary data center, some unsent transactions may not be present on the mirror database. If that happens, we will retrieve unsent data from the old primary data center at Memphis if the data can be recovered from it when the Memphis data center databases come back online. The previously unsent data can be loaded into the standby (Atlanta) data center databases from the Memphis databases without any primary key conflicts. This is possible because care was taken in assigning the keys in Atlanta during the period it assumed the production server role. After the failover, when the Atlanta data center assumes the production role, we use a script to skip a generous range of keys that may have been used by the transactions whose log records were unsent from the primary Memphis data center at the time of the failure (for more information, see the section "Identity Column Increment" later in this paper). A gap in the keys' sequencing will exist between the last used at Memphis and the new ones assigned at Atlanta, but this is acceptable to the application. If the databases at the primary data center cannot be recovered, the unsent data will be lost.
There are nearly 30 databases on the main database instance, some of them interrelated, and all are mirrored to the standby data center. We base our database mirroring configuration on extensive testing before deployment.
Configuring the Mirror Server
For logins and SQL Server Agent jobs, we use scripts to keep the mirror SQL Server 2008 instance as current as possible with the principal instance:
- New logins are created at both data centers using scripts.
- SQL Server Agent jobs are scripted and applied to both the principal and mirror servers.
We keep all SQL Server Agent jobs active on both the principal and mirror server. Most job steps (all those that apply to databases) begin with the command:
The user-defined function cf_IsMirrorInstance() accesses the sys.database_mirroring Dynamic Management View (DMV) and returns a value of 1 when executed on the mirror instance. As a result, the SQL Server Agent jobs on the mirror instance that reference mirrored databases can remain active. They will succeed but not do anything while their server remains the mirrored server. (See Appendix D "Scripts", for the cf_IsMirrorInstance() source code.)
New or changed database permissions are also scripted, and scripts are kept in a secured location. If a failover to the standby data center occurs, these scripts are run on the standby data center's SQL Server instance after the databases have been recovered.
Using Log Shipping to Help Set Up Database Mirroring
Our database administrators have found that using SQL Server log shipping to assist in setting up database mirroring enables them to deploy database mirroring at their convenience. We use log shipping to automatically transfer the required databases and transaction logs. After log shipping is running, database mirroring can be initialized at a later time. DBAs do not have to initialize database mirroring immediately after transferring and applying the database and transaction log backups to the standby data center.
We use log shipping to help set up mirroring in several contexts, including setting up mirroring from the primary to the standby data center, as well as within the primary data center when upgrading SQL Server. The following steps show how we use log shipping to help set up database mirroring from the primary to the standby data center:
- Disable the database transaction log backup jobs on the primary data center SQL Server 2008 instance. We back updatabase transaction logs hourly using a SQL Server Agent job. If an hourly transaction log backup job were to run at the same time as the log shipping transaction log backup job, the two backups would cause a break in log shipping's transaction log chain and log shipping would fail. Therefore we disable the hourly maintenance SQL Agent backup job. We schedule the log shipping transaction log backup jobs with a frequency greater than once per hour, usually the default of every 15 minutes, ensuring that the log shipping transaction log backup job provides at least the same or better level of protection as the hourly transaction log backup jobs.
- Set up log shipping for each database at the primary data center to the standby data center. Each database at the primary data center will be a log shipping primary, and with a corresponding log shipping secondary database at the standby data center. When setting up log shipping between SQL Server 2008 instances, we use backup compression. Database and log backups and restores are faster when compressed, and because the resulting backup files are smaller, copying them from the log shipping primary to the log shipping secondary takes less time.
- After log shipping job is restoring transaction log backups on a regular basis, we wait for an appropriate low-traffic time to switch to database mirroring. This helps reduce any delays that database activity might cause to database mirroring initialization.
- Disable the log shipping transaction log backup job on the primary data center's SQL Server 2008 instance (the log shipping primary: LS_Backup job.)
- At the standby data center, run the LS_Copy and LS_Restore jobs until all transaction log backups from the log shipping primary databases have been copied and restored.
- Disable the log shipping jobs on the standby data center's SQL Server 2008 instance.
- Set up database mirroring for each database from the primary to the standby data center. Because the databases have been kept current as of the last transaction log backup by log shipping, this process simply involves stepping through the Database Mirroring wizard (or running a script).
- Remove log shipping for each database.
- Enable the hourly transaction log backup jobs on the principal server.
For more information about using database mirroring with log shipping, see the Microsoft white paper Database Mirroring and Log Shipping Working Together (http://sqlcat.com/whitepapers/archive/2008/01/21/database-mirroring-and-log-shipping-working-together.aspx).
Failing Over to the Standby Data Center
Failover to the standby data center addresses unplanned downtime in the event of the complete loss of the primary data center. The standby data center is normally unused; no traffic is directed to it except when the primary data center is unavailable. We monitor the standby data center continuously. Once per year, a more labor-intensive and thorough failover test is performed. This controlled failover is limited to one or two databases that do not have dependencies on other databases.
We use a Domain Name System (DNS)-based approach to direct customers to the appropriate data center. If the primary data center is lost, this DNS-based solution directs our customers to the standby data center.
Within each data center, we use aliases and internal DNS servers to direct Web servers' database connections to the appropriate SQL Server instance:
- Each data center has at least two DNS servers.
- The Web servers connect to the correct clustered SQL Server 2008 instance using a combination of a DNS alias (a CNAME record) and instance name, rather than the actual server name and instance name.
When the primary data center is active, DNS servers at both data centers direct the Web servers to connect to the primary data center's SQL Server 2008 instance.
We automatically and continuously monitor the application servers at the standby data center with functional tests and security scans:
- The application/Web servers at the standby data center are continuously tested to assure that the Web servers are functioning correctly and are ready to be used.
- We also run vulnerability scans for PCI Compliance at the standby data center to ensure it is always PCI compliant.
In the event of a disaster at the primary data center, we will take the following steps to redirect Web servers to the correct SQL Server database server. The following list includes only the major steps and is not exhaustive:
- The primary data center is confirmed to be down and the decision is made to fail over to the standby data center.
- We use our DNS-based solution to direct customers to the appropriate IP address of the standby data center.
- We then use scripts to perform the following
- Recover the standby SQL Server 2008 databases.
- Apply correct user permissions to the SQL Server 2008 databases.
- Increment the identity columns at the standby data center's SQL Server database tables (see "Identity Column Increment" later in this paper.)
- Change the standby data center's internal DNS alias to the standby data center's database server name.
Bringing the Primary Data Center Back Online Following a Disaster
As noted previously, when the primary data center initially becomes available following a disaster, we will first find and retrieve any unsent data in the primary data center's database tables, and then load that into the appropriate databases at the standby data center. Because the identity columns of the tables at the standby data center will have used values from a higher range, the unsent primary data center data can be loaded directly, keeping all identity values intact. If the primary data center's data cannot be accessed because of damage due to disaster, that unsent data will be lost.
The standby data center is not meant as a permanent substitute for the primary data center, but as a backup in case of emergencies. When the primary data center comes back online, we will perform the previous steps to reverse the roles so that the Memphis data center becomes the primary data center, and then we will reestablish the Atlanta data center to the role of standby.
We will use log shipping from the standby data center to the primary data center in order to prepare the databases for database mirroring. After database mirroring is operational from the standby data center to the primary, during a low-traffic period the direction of mirroring will be reversed, and all DNS aliases adjusted so that the primary data center again assumes its original role.
We have implemented appropriate monitoring to send alerts quickly when potential problems are detected.
Network and Web Servers
We continuously monitor system performance, with most concern devoted to performance during peak periods. Such traffic patterns occur during a five-to-seven-hour window. Rather than judge average performance over 24 hours, we compute averages based on values from the peak periods, and focus on the short bursts of peak activity (3-10 minutes) during that window. Whenever a server begins to run at 60 percent of load capacity during peak periods, we begin the process of planning the upgrade of that equipment.
We monitor the health of each data center from three locations: the primary data center, standby data center, and corporate offices, receiving alerts via e-mail and/or text messages.
General server monitoring includes:
- General server availability: whether the server is running or not.
- Cluster service per node and for the active node for SQL Server services.
- The ability to connect remotely to the event log of each server.
- Excess server memory usage.
- CPU usage above a certain threshold for all processors.
- Hard disk space usage on all local and SAN drives.
- Active Directory® servers via a simple Lightweight Directory Access Protocol (LDAP) query.
Web servers and services monitoring includes:
- Microsoft Internet Information Services (IIS) Admin, WWW Publishing, custom encryption application service, and specific Web sites.
- Monitoring Services (additional services that check on the health of Web sites).
- Response time and content of Web pages.
Monitoring Database Mirroring
We monitor database mirroring by using the SQL Server 2008 Database Mirroring Monitor Job. We use scripts to set up monitoring for a selected pair of counters, one each at the principal and mirror instances. For all mirrored databases, we track the following two counters and set thresholds for their alerts. The resulting SQL Server Agent jobs run once per minute:
- Age of oldest unsent transaction (set up at the
principal server): reports the age in minutes of the oldest unsent transaction
in the send queue at the principal server. We set an alert at three minutes for
each mirrored database.
- We run an initial script on the principal that looks for any databases participating in database mirroring, andsets a baseline value for the "Age of oldest unsent transaction" for each database. All mirrored databases initially get the same setting.
- We run a second script also on the principal server that adjusts the threshold value for any databases which may need a different value. We may assign differing threshold values for databases based on varying patterns of update activity.
- We use this counter to monitor potential data loss in the event of an unplanned loss of the primary data center. The "Age of oldest unsent transaction" counter helps us ensure that it stays within its recovery point objective (RPO) of three minutes (see "Availability Goals" earlier in this paper.)
- Unrestored log threshold (set up at the mirror
server): helps estimate how long it would take the mirror server to roll
forward the log records remaining in its redo queue. We send an alert if the
redo queue exceeds a certain threshold, usually between 250 kilobytes (KB) and
500 KB. The actual value may change for each database depending upon the
database's workload and behavior patterns.
- We run an initial script on the mirror that looks for any databases participating in database mirroring. It sets a baseline log threshold value for each database. Each mirrored database gets the same initial setting.
- We run a second script on the mirror server that adjusts the threshold value for any databases that may need a different value due to differing patterns of update activity.
Because we use asynchronous mirroring, we do not monitor the "Mirror commit overhead" counter. In addition to the database mirroring monitoring counters, we also monitor the space used and available free space for log and data volumes at each server.
We use Windows Management Instrumentation (WMI) Alerts to monitor lock escalation and deadlocks. To minimize lock escalation issues that occur during reporting, we are currently testing Read Committed Snapshot Isolation (RCSI).
Unplanned Downtime Scenarios
In the past, we have encountered a few events causing unplanned downtime, leading us to adopt the following strategies.
We have observed that under certain conditions, database mirroring sessions may enter a suspended state. When a database mirroring session is suspended, the principal database's transaction log records cannot be sent to the mirror. Because transaction log backups on the principal will no longer truncate the transaction logs if transaction log records cannot be sent to the mirror, the log files grow. We have established SQL Server Agent jobs that monitor and alert if any database participating in mirroring is in a state other than SYNCHRONIZING or SYNCHRONIZED (see the query in Appendix D.) After the underlying issue has been addressed, the mirroring session can easily be resumed using a script or SQL Server Management Studio.
Identity Column Increment
We use asynchronous database mirroring. In the case of a disaster involving the loss of connectivity to the primary data center, some log records from the principal databases may be prevented from being applied to the mirrored databases. This could result in inconsistencies between databases that are interrelated from an application perspective. The missing data still exists in the primary data center in Memphis, but it may be days or weeks before it can be applied to the standby data center in Atlanta.
Our databases have many tables that include identity columns. Because the databases are interrelated, one database may refer to identity values in another database. After a failover to the standby data center due to a disaster, unapplied log records that can no longer be sent from the principal could mean that a database may refer to an identity value that does not exist in the table of another database. We consider this extremely significant and have developed a methodology to prevent data integrity issues that could be caused by using the identity values that have not been sent or applied.
During recovery of the mirrored databases on the failed over site, a script runs on every table in every database, reseeding the identity value to increment it by a certain number, and logging the change with new and highest old values. When the primary data center comes online again, we can query the former principal database server's data (assuming it is readable) and retrieve appropriate rows to populate the missing values to bring the tables into consistency across all the databases on the new principal.
Planned Downtime Scenarios
In the past, we have encountered a few events causing unplanned downtime, leading us to adopt the following strategies.
Steps for Upgrading to SQL Server 2008 from SQL Server 2005
When we decided to upgrade from SQL Server 2005 to SQL Server 2008, we also decided to upgrade from Windows Server 2003 to Windows Server 2008. After extensive planning, we accomplished the upgrade with minimal downtime.
When upgrading from Windows Server 2003 to Windows Server 2008, we decided also to reformat the storage LUNs for the Windows Server 2003 failover cluster at the primary data center, and to upgrade to new database servers at the standby data center. As a result, we chose to rebuild the failover cluster at the primary data center, and build a new failover cluster at the standby data center. We built a temporary SQL Server 2008 clustered instance on spare servers and used it to keep the SQL Server databases available while the primary data center's failover cluster was rebuilt.
This section lists the steps we took to perform the upgrade, showing how we were able to preserve high availability while minimizing user downtime. In these steps, the following abbreviations will be used. Each of the following SQL Server instances is clustered:
- primarySQL2005: the legacy SQL Server 2005 instance at the primary data center
- standbySQL2005: the legacy SQL Server 2005 instance at the standby data center
- tempSQL2008: a temporary SQL Server 2008 instance at the primary data center
- primarySQL2008: the new SQL Server 2008 instance at the primary data center
- standbySQL2008: the new SQL Server 2008 instance at the standby data center
The following steps illustrate the process our team used to upgrade to SQL Server 2008.
Phase 1: Redirected Application Users to a Temporary SQL Server 2008 Instance
- Configured a temporary two-node SQL Server
- For the temporary SQL Server 2008 cluster, called tempSQL2008, only two cluster nodes were used. The instance would only be online for a few off-peak hours.
- The servers for tempSQL2008 were configured with Windows Server 2008, clustered, and a clustered instance of SQL Server 2008 installed.
- For tempSQL2008 data storage, we added Disk Access Enclosures (DAEs) with additional disk drives to the existing EMC CX-Series array.
- The tempSQL2008 server used a Fibre Channel path via the same equipment as the production SQL Server 2005 clustered instance, going through the same fiber optic switches.
- The tempSQL2008 server level settings were configured and thoroughly tested.
- Stopped database mirroring to the standby data center.
- Set up log shipping from primarySQL2005 to
- We use log shipping to help prepare databases for database mirroring. (For more information, see "Using Log Shipping to Help Set Up Database Mirroring" earlier in this paper.)
- We could not use backup compression to assist in setting up log shipping in this step, because backup compression is only available between SQL Server 2008 instances, and the primarySQL2005 instance was running SQL Server 2005.
- Initialized asynchronous database mirroring
from primarySQL2005 to tempSQL2008.
- Accomplished by converting from log shipping to database mirroring.
- Waited for a very low traffic period before beginning the upgrade process.
- Converted all database mirroring sessions
to synchronous database mirroring
- Waited for synchronization to occur.
- Used the firewall to redirect all incoming
traffic to a “scheduled downtime” Web site.
- All Web servers have the same configuration, and each hosts a Web site for the purpose of handling downtime messages. This Web site responds appropriately to Web service requests.
- Application downtime now starts.
- Removed all Web Servers from Web farm
- This Web server continued to serve the "scheduled downtime" Web site. Because this Web site was not immediately rebooted, it is temporarily called the StaleWebServer in these steps.
- Rebooted all the remaining Web servers to remove any cached or pooled connections.
- Simultaneously changed the DNS connection
alias and reversed the database mirroring roles.
- Changed the DNS connection alias to redirect the application to the tempSQL2008 instance. For details about how we use DNS connection aliases, see "Data Center Infrastructure" earlier in this paper.
- At the same time, we ran an SQL script to manually fail over database mirroring, reversing the database mirroring roles and making tempSQL2008 the principal for all database mirroring sessions.
- Removed database mirroring.
- We ran a script to remove all database mirroring sessions because tempSQL2008 could not mirror to primarySQL2005. (A SQL Server 2008 instance cannot mirror to a SQL Server 2005 instance.)
- Tested all systems with one of the rebooted
- We now took one of the rebooted Web servers currently outside the Web farm (call it the TestWebServer), and used it for testing the application that now connects to the tempSQL2008 database server. This was the final test to ensure that all application functionality was present when connecting to the new tempSQL2008 database. If the testing failed, we could have reverted back to the SQL Server 2005 instance (by issuing a RESTORE command on each of the SQL Server 2005 databases on primarySQL2005 to bring each database from a loading, nonrecovered state into a recovered state).
- This was effectively the Go/No-go decision point. After we made the decision to allow users back into the application and to connect to tempSQL2008, user updates to the databases would start. After that point, new data in the SQL Server 2008 database would be lost if we decided to roll back to SQL Server 2005 based on restoring from database backups.
- Because we made the decision to proceed, we put
the remaining rebooted Web servers back into the Web farm. We performed the
following two actions simultaneously and as quickly as possible:
- Placed TestWebServer into the Web farm, making it an active Web server.
- Removed StaleWebServer from the Web farm, rebooted it in order to remove any cached or pooled connections, and placed it back in the Web farm.
- All the Web servers were now active, in the Web farm, and ready to connect to tempSQL2008.
- Redirected traffic (via the firewall) back
to the application IP addresses.
- The Web servers now were connecting to tempSQL2008. At this point the system was back up, and users were now able to use the application.
- This first downtime period lasted approximately 10 minutes.
Phase 2: Redirected Application Users to the Permanent SQL Server 2008 Instance at the Primary Data Center
- Built a new SQL Server 2008 cluster (primarySQL2008)
at the primary data center.
- Reconfigured the original primarySQL2005 servers with Windows Server 2008 and SQL Server 2008, applying the appropriate drivers and critical updates. Other IT personnel continued to monitor and test tempSQL2008, currently the production instance.
- Reconfigured the primarySQL2005 server's LUNs on the SAN and reformatted them using Windows Server 2008. We reconfigured the LUNs because we changed the number of disks from the older Windows Server 2003 configuration. If reconfiguration had not been required, just a Quick Format using Windows Server 2008 to clean up the drives and maintain proper LUN disk partition alignment would have been sufficient.
- Created the new Windows Server 2008 cluster as a three-node cluster (using an integrated install), and then installed SQL Server 2008 Enterprise. We added the first SQL Server node using the SQL Server Setup program interactively, and added the other SQL Server nodes using Setup's command-line installation options. We found this faster than using Setup interactively for all nodes. We then configured the SQL Server settings and tested a variety of failover situations to make sure everything was functioning correctly.
- Set up log shipping from tempSQL2008 to
- We were able to use backup compression when setting up log shipping between these two SQL Server 2008 instances, making the log shipping setup process faster when compared with the previous setup of up log shipping from primarySQL2005 to tempSQL2008.
- Initialized asynchronous database mirroring
from tempSQL2008 to primarySQL2008.
- Accomplished by using log shipping to initialize database mirroring.
- Converted the mirroring sessions to synchronous database mirroring, and waited for all mirror databases to synchronize.
- At this point, we were ready to move the application to the primarySQL2008 instance, but it required a second downtime period.
- Used the firewall to redirect all incoming
traffic to the “scheduled downtime” Web page.
- Users were effectively offline again at this point. The second downtime period starts.
- Simultaneously changed the DNS connection
alias and reversed the database mirroring roles.
- We changed the DNS connection alias at the data center DNS servers to point connections to the primarySQL2008 server.
- At the same time, we ran a script to reverse the database mirroring roles, making primarySQL2008 the principal and tempSQL2008 the mirror for all database mirroring sessions.
- We then repeated the processes for testing the application, as well as rebooting all Web servers, as outlined in step 12.
- Redirected traffic (via the firewall) back
to the application IP addresses.
- Users could now access the application, and the Web servers were connecting to the primarySQL2008 instance.
- Downtime duration for this second phase was about six minutes.
- At this point, the major part of the upgrade process was finished and the application was now using the desired primarySQL2008 instance. The following steps in Phase 3 did not need to occur immediately and no user downtime was required.
Phase 3: Prepared a New SQL Server 2008 Instance at the Standby Data Center and Set Up Database Mirroring to it from the Primary Data Center
- Prepared the standby data center SQL Server
2008 instance (standbySQL2008).
- We left mirroring from primarySQL2008 to tempSQL2008 active temporarily, in case any issues arose with the primarySQL2008 cluster.
- We then replaced the standbySQL2005 cluster with new servers, installing Windows Server 2008 and SQL Server 2008, as well as upgrading to a new SAN. This was part of a planned equipment upgrade process.
- Set up log shipping from primarySQL2008 to
- We again were able to use backup compression to improve the speed of the log shipping setup process.
- Established database mirroring to the
standby data center.
- Removed database mirroring from primarySQL2008 to tempSQL2008 instances.
- Removed log shipping and set up asynchronous database mirroring from primarySQL2008 to standbySQL2008.
- At this point, both data centers were live with SQL Server 2008 and the upgrade process was complete.
For several weeks after the upgrade, we left the databases in SQL Server compatibility mode 90. This allowed us to troubleshoot potential database issues without the additional concern of having changed to the new SQL Server 2008 compatibility level as a factor in troubleshooting. After no issues were found, we changed the compatibility level of the databases to 100.
Patches and Cumulative Updates
We apply Windows and SQL Server patches to the mirror instance first, before the principal, and always during off-peak hours.
For SQL Server patches (hotfixes or cumulative updates), we use the following process:
- Start at the standby data center with the failover cluster hosting the mirror SQL Server 2008 instance.
- Assume the cluster nodes are named Node 1, Node 2, and Node 3. Also assume that SQL Server is currently running on Node 1, and that Node 1 and Node 2 are the preferred owners.
- Run the patch installation on a node other than Node 1, for example, Node 2. When the installation is finished, reboot Node 2. Though perhaps not necessary, this will remove any pending reboot requirements on the server.
- Run the patch installation on the other unused node, Node 3, and when finished, reboot Node 3.
- Move the SQL Server 2008 resource group from Node 1 to the other preferred node ( Node 2). This normally takes 30-60 seconds, and is the only downtime in this process.
- Run the patch installation on Node 1, and when finished, reboot the node.
- Verify that the SQL Server instance has the correct version number for the patch by running SELECT @@VERSION on the SQL Server 2008 instance.
- Repeat steps 2-7 for the principal SQL Server 2008 instance (the failover cluster at the primary data center).
For Windows Server 2008 updates (including patches, drivers, and other software updates), we use the following steps:
- Start at the standby data center, on the mirror instance failover cluster.
- Again, assume the cluster nodes are named Node 1, Node 2, and Node 3, and that the SQL Server 2008 instance is running on Node 1.
- Pause an inactive node, for example, Node 2.
- Install any updates and make any required changes.
- Reboot the node.
- Resume the node.
- Repeat steps 2-6 for Node 3.
- Move the SQL Server 2008 resource group from Node 1 to Node 2.
- Repeat steps 2-6 for Node 1.
- Repeat steps 2-9 on the principal instance failover cluster at the primary data center.
Database and Application Changes
We have a number of procedures in place to handle planned downtime resulting from database and application changes.
To determine database schema and other major changes to the databases, we use a third-party database comparison utility to compare the production and final version of the development databases. The comparison utility generates a Transact-SQL script that changes the production database schema to the target database schema.
After the script is generated, we inspect it:
- We ensure that the deployment script changes are correct, and we drill down into details ofchanges.
- If there is any potential for a table lock (due to a table schema change, for example), we run the deployment script on a development system to determine the effects of the changes.
A common scenario of a schema change is adding a new column to a table. When creating the table, we ensure that the CREATE TABLE command includes default values for the new column. We also ensure that the final change script modifies any affected stored procedures and views to reflect the new column. If a stored procedure is changed to reflect the new column in an input parameter, we also initialize the default value of the parameter.
When a database schema change requires downtime, and we are ready to apply the script, we take the following steps:
- Choose an off-peak time.
- Before applying changes, back up all database transaction logs as simultaneously as possible. Because some databases are interrelated, this makes the backup image of all of them as consistent as possible.
- Apply the changes. If the estimated downtime is less than 60 seconds, simply apply the changes without stopping applications from connecting to the SQL Server instance. Some users may see an error message. If the estimated downtime is greater than 60 seconds, redirect the applications to a friendly downtime message until the changes are complete.
- Confirm the functionality of the database changes. The type of confirmation depends on how significant the database changes are.
Most of our database changes occur in the context of a change to the application. In such cases, running scripts to change to the databases must be coordinated with changes to the application. Generally updates to the application can be done in a matter of seconds.
In general, we have two strategies for deploying changes to Web servers. In our application, Web content is replicated using Windows Distributed File System (DSF), so deployment strategies change depending on whether there are changes to Web content or not. We perform the following steps using a team of people:
- When there are no changes to Web content, we remove all Web servers except one from the Web farm. Then we simultaneously deploy the code on one of the removed servers and to SQL Server. After testing, we swap these two Web servers by putting the one with changed code into the farm and removing the existing one. Then we deploy the new code to all the remaining servers and put them back into the farm.
- The case is more difficult if the content get replicated. In this case we go down to a single Web server, taking all of the other web servers out of the web farm. We apply the changes to that Web server and simultaneously apply the SQL Server changes. At this point, the Web applications and the SQL Server databases should work together. We ensure that the new Web content has replicated and then add the other servers back into the Web farm.
When only SQL Server changes are being deployed, we determine how long the SQL Server database changes will take. If only stored procedures or views will be changed, and there are no schema changes, the SQL Server changes typically finish in a matter of seconds. In those cases, it is not necessary to redirect users to a "scheduled downtime" Web site. If the SQL Server deployment is more time-consuming due to schema changes, we will direct users to a "scheduled downtime" site, as illustrated in "Steps for Upgrading to SQL Server 2008 from SQL Server 2005" previously, until the changes are successfully deployed and verified.
We rebuild and reorganize indexes in a selective and balanced manner. Rebuilding or reorganizing indexes generates large amounts of transaction log records, and those log records must then be sent to the remote mirror. With asynchronous mirroring, such a condition can cause the mirror to fall behind the principal significantly.
As a result, we allow index maintenance only during low traffic times. We have a periodic Transact-SQL job to reorganize as well as rebuild indexes. Both actions cause transaction log load that must be sent to the mirror. To reduce and even out this load, the script does not pick all tables and indexes at once, but spreads the task across multiple days and at low usage times.
The script uses a threshold to determine whether to rebuild or reorganize an index, depending on the use of the table (lookup tables, for example, would not require frequent rebuilds) as well as fragmentation percentages. This is a multitenant system, and all customers have data in the same tables. Different usage patterns by different customers can cause some tables to require index rebuilding or reorganizing.
The script rebuilds indexes online whenever possible. Maintenance on tables that cannot be reindexed online is done only during the lowest traffic times.
ServiceU has successfully implemented a sophisticated high-availability and disaster-recovery solution for our applications. Database high availability within each data center is achieved by placing a clustered SQL Server 2008 instance on a three-node Windows Server 2008 failover cluster. A three-node failover cluster maintains high availability during cluster patches and upgrades. Disaster recovery is achieved using SQL Server 2008 database mirroring from a primary to a standby data center. Thoroughly tested procedures are used to maintain maximum availability with minimal downtime during both planned and unplanned downtime scenarios. The application is continuously tested and scanned for vulnerabilities at the standby data center, ensuring it is always ready to go into production should the primary data center become unavailable.
For more information, see the following documents:
- Database Mirroring and Log Shipping Working Together
- How to: Minimize Downtime for Mirrored Databases When Upgrading Server Instances
- Using Warning Thresholds and Alerts on Mirroring Performance Metrics
Appendix A. Data Center Infrastructure Details
ServiceU has taken number of steps to ensure the infrastructure at each data center is fully redundant and secure.
- All power is filtered to provide reliable current.
- Data centers have backup generators with large fuel tanks. Both have emergency contracts in case of a disaster to assure continued fuel.
- Batteries provide additional backup in case of any delays or problems with the generators.
- Power-switching equipment detects not only the availability of utility company power, but the quality of that power before it switches back from the generator.
- Multiple electrical circuits are used to further mitigate any power risk. Each device's power supply is connected to a different electrical circuit. Detailed power diagrams help ensure that mistakes are not made.
- All equipment, when available from the manufacturer, has redundant power supplies.
- Multiple air conditioning units provide redundant temperature and humidity control.
- Capacity is oversized so that even if up to 50 percent of the units are nonfunctional, the data center still maintains acceptable temperature and humidity.
- Multiple badges, or badge plus code access, are required to enter the facility.
- Badge access and video logs are kept for a minimum of 90 days (a PCI Data Security Standard (DSS) requirement).
- Both data centers maintain PCI Compliance 24x7x365. In this way, if we ever have to fail over to the standby data center, we do not have to make any changes to be PCI compliant. It is a PCI Compliance requirement that a standby data center facility also be compliant before beginning to process any transactions.
- Key access is required to access the servers.
- Servers use a password protected KVM.
- Servers lock after 15 minutes of inactivity.
- Databases and transaction logs are backed up to disk files. Those backup files are included in the daily tape backups that are transported offsite and stored in a climate-controlled vault. In the case of a disaster, those tapes are flown to the nearest major airport. They are then shipped overnight to the standby data center. While the tapes should never be needed, this is an additional layer of protection.
- Key members of the company carry satellite phones at all times in the case of a disaster. This allows them to communicate with vendors, service providers, and other employees.
- Key service providers have the satellite phone numbers.
- We use firewalls in a passive high availability configuration: Active/Passive firewalls – If a firewall fails, the other firewall takes over with no loss of connectivity to the client. The client never realizes that there was a problem.
- The firewalls share state, so if one goes down or has to be rebooted, the user should not notice any packet loss or connectivity problems.
- All switches have very few moving parts (this is part of the company specification).
- All have redundant power supplies.
We have established the following backup procedures:
- Databases are backed up daily.
- Database transaction logs are backed up hourly during the active part of the day.
- Backups are made to disk and kept on disk for three days.
- Tape backups are stored off-site and on a regular rotation.
Appendix B. Database Server Additional Information
ServiceU places database data and transaction log files on separate physical LUNS of the SAN, not on virtual LUNs. After extensive testing of our application's data access behavior, we found we could achieve slightly better overall performance by placing the tempdb data files and its log files on the database server's data LUN. Disk spindles previously used for a separate tempdb LUN were then reallocated to the database server's data LUN.
Additional SQL Server 2008 database server decisions we made include:
- We found that giving each database server 64 GB RAM dramatically improved performance.
- Our application requires MSDTC for some middle-tier transactions. Microsoft does not support MSDTC with database mirroring, but we do not use MSDTC in a way that would lead to non-supported behavior.
- We have found that setting MAXDOP to 1 at the server level has been best for online transactional processing (OLTP) performance. On occasion we may set it to 0 temporarily for large data changes such as rebuilding indexes, to take advantage of parallelism.
- We use a Policy-Based Management (PBM) server to centralize policies. Policies are pushed out to the instances of SQL Server via the PBM Server and the Windows PowerShell® command-line interface.
Appendix C. Documentation Procedures
ServiceU has implemented documentation procedures that are crucial to both SLAs and high availability.
Disaster Recovery Process
- Our disaster recovery failover steps, including SQL scripts, processes, and other important data are documented and ready at the standby facility to use at a moment’s notice. It is simple for someone under pressure to run these scripts and activate the standby facility.
Configuration Documents, Steps, and Diagrams
- Configuration documents exist for all equipment.
- These are easy to use, yet complete, and they help achieve consistency across the enterprise.
- These also make sure that infrequently changed equipment is always configured correctly. A good example is the configuration of fiber switches for SQL Server clusters.
- We maintain an internal knowledge base (KB) system, which is used extensively to share information and document core knowledge.
- We have a policy that no updates are applied past noon on Thursday. This helps to prevent errors or problems occurring over the weekend, when the response time by IT personnel may be longer.
- Extensive code reviews and testing processes ensure accuracy of code before its deployment to the production environment.
Appendix D. Scripts
The cf_IsMirrorInstance() Function
CREATE FUNCTION [dbo].[cf_IsMirrorInstance] () RETURNS bit AS BEGIN -- This function determines whether a server is the mirror -- instance or the principal. -- Assumption: All databases reside at either one location -- or the other. -- Replace <db name> with an actual database name. DECLARE @mirroring_role_desc nvarchar(60) DECLARE @IsMirrorInstance bit -- Choose a single critical database and test to see whether -- it is PRINCIPAL or MIRROR -- Because the databases are interrelated, all must be the same; -- other jobs test for this SELECT @mirroring_role_desc = mirroring_role_desc FROM sys.database_mirroring m JOIN sys.databases d ON m.database_id = d.database_id WHERE d.name = '<db name>' -- Evaluate the result IF (@mirroring_role_desc IS NULL) OR (@mirroring_role_desc = 'PRINCIPAL') SET @IsMirrorInstance = 0 ELSE SET @IsMirrorInstance = 1 -- Return the result RETURN @IsMirrorInstance END
Determining the Current State of a Mirror Database
ServiceU uses the following query as part of a larger SQL Server Agent job script that alerts IT personnel when database mirroring states have a value other than SYNCHRONIZED or SYNCHRONIZING. The query lists the actual database names, the database state, the mirroring session state, and the mirroring role of the database. When IT personnel receive the alerts, they can quickly see which databases are having problems and the current mirroring state of each.
SELECT d.name, m.mirroring_state_desc, d.state_desc, mirroring_role_desc FROM sys.database_mirroring m JOIN sys.databases d ON m.database_id = d.database_id WHERE mirroring_guid IS NOT NULL AND mirroring_state_desc NOT IN ('SYNCHRONIZED', 'SYNCHRONIZING') ORDER BY d.name ASC
Appendix E. Feedback
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:
- Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason?
- Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release.