Minimizing Downtime with Redundant Servers

Article
01/28/2010

Updated : November 14, 2002

The final step in increasing the availability of a data center is to deploy one or more redundant servers to minimize downtime when a server fails. Use a redundant server solution only after you have followed the recommendations in the preceding chapters. Deploying and maintaining a redundant server solution adds complexity and cost to a data center, and if you do not follow the recommendations in the preceding chapters, your efforts might actually decrease data center availability. If you follow the recommendations and you deploy one or more of the redundant server solutions discussed in this chapter, you can achieve 99.99 percent to 99.999 percent availability for the data center.

Note: Each of these server redundancy solutions is discussed in this chapter, followed by a decision tree that helps you choose among them.

A redundant server solution can minimize downtime caused by:

Planned maintenance that requires taking a data-center server offline
A hardware failure that causes a data-center server to become unavailable while the site remains available
A disaster that causes an entire site to be unavailable, such as a regional power failure, a natural disaster, or man-made disaster

You can minimize data-center downtime that can result from these types of server unavailability by deploying one or more of the following server redundancy solutions:

Failover clustering
Log shipping
Transactional replication

Using Failover Clustering to Maintain a Redundant Server

Use failover clustering to increase the availability of a data center unless you cannot justify its financial cost or setup complexity. Failover clustering reduces the downtime caused by a server failure to less than one minute by automatically detecting a server failure and initiating the failover to a secondary server. Failover clustering requires no administrative intervention during the failover, and clients connect seamlessly to the promoted secondary server after the automatic failover.

Failover clustering uses the Microsoft Cluster Service (MSCS), which is a feature of Microsoft Windows 2000 Advanced Server and Windows 2000 Datacenter Server. MSCS allows two to four servers (nodes) to be linked in a server cluster utilizing a private network and a shared disk array. MSCS allows cluster resources to fail over from the primary node to one of the secondary nodes if specified cluster resources become unavailable on the primary node. MSCS supports two node clusters on Windows 2000 Advanced Server and four node clusters on Windows 2000 Datacenter Server.

Failover clustering allows you to install up to 16 Microsoft SQL Server virtual servers on one MSCS cluster. A SQL Server virtual server is a clustered instance of SQL Server.

Failover Clustering Architecture

Each node in an MSCS cluster is an independent computer with its own hardware resources and operating system. Each node also shares a hard-disk array with other nodes in the cluster. The primary node controls and uses the shared hard-disk array until MSCS initiates a failover to a secondary node. When you install SQL Server 2000 in a cluster, SQL Server setup installs the executable files on each node and installs the data and log files on the shared disk array. When a node fails, MSCS starts SQL Server on the designated secondary node, and the secondary node takes control of the shared hard disk array. SQL Server attaches the data and log files on the shared disk array to the newly started SQL Server instance.

Figure 5.1 illustrates the components of a SQL Server virtual server in a failover cluster.

Figure 5.1: Virtual instance in a failover cluster

A SQL Server virtual server includes the following components:

Shared storage resource — One or more disks that are part of a cluster group and that are available to multiple nodes of a cluster, but to only one node at a time
Private storage resource — One or more disks that are available only to each node in the cluster and that contain the operating system and executable files
SQL Server network name — The virtual name assigned to the virtual server during installation, which is the name that all clients use to connect the clustered instance of SQL Server regardless of the node on which the instance is running
SQL Server IP address — The virtual IP address of the virtual server, which is the IP address that all clients use to connect the clustered instance of SQL Server regardless of the node on which the instance is running
SQL Server resources — The clustered SQL Server resources for a SQL Server instance include the SQL Server, SQL Server Agent, and Microsoft Search services (the clustered resources)

SQL Server 2000 supports single-instance clusters and multiple-instance clusters.

Single-instance clusters — In a single-instance cluster, only one virtual server is installed in the cluster. The data and log files for the virtual server are installed on the shared storage resource for the cluster, and the executable files for the virtual server are installed on the private storage resource for each node. The virtual server is owned by the primary node, and each secondary node is in a wait state. When the primary node fails or is demoted, a secondary node is enabled. When the node is enabled, SQL Server resources start on this node and take control of the data and log files in the shared storage resource. If each node is configured with identical hardware and software resources, the virtual server performs identically on the secondary node after the failover.
Multiple-instance clusters — In a multiple-instance cluster, two or more virtual servers are installed in the cluster. The data and log files for each virtual server are installed on a shared storage resource dedicated to that virtual server. When the primary node for a virtual server fails or is demoted, the secondary node takes control of the shared storage resource for the virtual server. Because each virtual server has a dedicated shared storage resource, no other virtual servers are affected by this failover.

When determining whether to implement a multiple-instance cluster, assess the expected load of the database applications on each node, and determine whether a single node can handle the combined load during a failover. If a single node cannot handle the combined workload of all virtual servers, use two single-instance clusters instead, each with its own dedicated hardware resources.

MSCS uses a private network between nodes for monitoring. MSCS on the primary node monitors the SQL Server resource group to determine when to initiate a failover. MSCS on the secondary node monitors the shared storage resource to determine when to initiate a failover. MSCS uses the following mechanisms to determine when it should initiate a failover:

Shared storage resource mechanism — MSCS on the primary node renews its reservation of the shared storage resource every three seconds. MSCS on the secondary node attempts to reserve the same shared storage resource every five seconds. MSCS on the secondary node will never reserve the shared storage resource if MSCS on the primary node continues to renew its reservation of the shared storage resource. If MSCS on the primary node fails to renew its reservation of the shared storage resource, however, MSCS on the secondary node seizes control of the shared storage and initiates a failover.
SQL Server resources mechanism — MSCS on the primary node runs two tests to determine whether the SQL Server resources are functioning —a LooksAlive check and an IsAlive check. If either check determines that the SQL Server resources are no longer functioning, MSCS determines that a resource failure has occurred. Depending on the failover threshold configuration for the cluster, MSCS on the primary node either attempts to restart the failed SQL Server resource or immediately initiates a failover to the designated secondary node. If MSCS initiates a failover, it relinquishes its reservation of the shared storage resource.

If a failover occurs because the primary node fails, MSCS starts the resources in the SQL Server resource group on the secondary node. On startup, SQL Server initiates automatic recovery for each database, rolling forward all completed transactions and rolling back all incomplete transactions. When SQL Server completes automatic recovery, SQL Server is available to its consumers with the same virtual server name and IP address as before the failover. The time required for the failover is short, generally under one minute. The actual time is governed by the number and size of the transactions that must be rolled forward or back at startup and by the speed of the shared storage resource.

The end user, client application, Web server, and middle-tier component (such as Microsoft Transaction Server) experiences a break in the connection to the virtual server when a failover initiates. Cluster-aware applications detect the failover and reconnect automatically to the virtual server, which is running on a new node. If a SQL Server client is not cluster aware, the user must manually reconnect the client to the virtual server.

For more information, see "SQL Server 2000 Failover Clustering" on the Microsoft Web site at https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/failclus.mspx.

Failover Clustering Advantages

There are a number of advantages to using failover clustering rather than log shipping or transactional replication. These advantages include the following:

Automatic failover — Switching between nodes occurs automatically whenever a failure is detected. Failover occurs when a SQL Server resource, the Windows operating system, or essential hardware on the primary node fails.
Simple manual failover — You can manually switch the virtual server from the primary node to a secondary node. Simple manual failover enables you to perform normal maintenance on one node while the virtual server continues to run on another node.
Transparent to clients — Failover is relatively transparent to the clients consuming SQL Server resources. After a failover is complete, clients simply reconnect to the new primary node (the promoted secondary node) with the same virtual name and IP address to resume work. Although clients must resubmit any incomplete transactions, they do not need to connect to a different server name or IP address because the failover cluster uses a virtual server name and IP address.
Transactionally current — When MSCS initiates a failover, all committed transactions are preserved and are available to clients when the failover process is complete.
Rapid failover — Completing the failover is fast, depending upon how many transactions need to be rolled forward or rolled back and the speed of the shared storage resource.

These advantages make failover clustering the first choice for increasing data center availability.

Failover Clustering Limitations and Disadvantages

There are a number of limitations and disadvantages to using failover clustering rather than log shipping or transactional replication. These limitations and disadvantages include the following:

Proximity requirement — The round-trip latency of network communication between nodes must not exceed 500 milliseconds, or MSCS will initiate failovers when no actual failure has occurred. This means that all cluster nodes must be located close to one another, unless a third-party clustering solution is used to place nodes in geographically dispersed locations (a stretch cluster). If you deploy a stretch cluster, you must use a third-party solution that appears on the Hardware Compatibility List. To search the HCL for certified computers, components, and configurations, see the "Hardware Compatibility List" on the Microsoft Web site at https://support.microsoft.com/kb/131900 that have not been certified. For more information, see article 280743, "Windows Clustering and Geographically Separate Sites," in the Microsoft Knowledge Base at https://support.microsoft.com/default.aspx?scid=kb;en-us;280743&sd=tech. Third-party clustering solutions are more expensive and more complex than MSCS.
Hardware requirement — MSCS requires specifically certified hardware.
Software requirement — MSCS is available only with Windows 2000 Advanced Server and Windows 2000 Datacenter Server, and failover clustering is available only with SQL Server 2000 Enterprise Edition.

If these disadvantages outweigh the advantages of failover clustering for a data center, you should consider log shipping to determine whether it provides a better high-availability solution for the data center.

Using Log Shipping to Maintain a Redundant Server

Log shipping is designed as an inexpensive high-availability solution that can be used to protect a database against the failure of the primary server. Use log shipping rather than failover clustering if you cannot justify the financial cost and complexity of failover clustering. Also use log shipping with failover clustering to protect against site-level disasters as well as local server failure. Log shipping enables you to maintain a copy of the production database on one or more secondary servers and to easily promote one of the secondary servers to become the new primary server.

Log shipping uses SQL Server backup and restore capabilities to provide database redundancy between the production database on a primary server and a standby database on one or more secondary (standby) servers. Log shipping requires an administrator to detect a failure and initiate a role change. A role change promotes a standby server to a primary server by recovering the production database on the standby server. The role change process is not automatic and takes at least several minutes to complete. It can take significantly longer with large databases. After a role change, clients must connect to a different server with a different server name and IP address. Unlike with failover clustering, virtual server names and IP addresses are not incorporated into log shipping.

Log Shipping Architecture

SQL Server 2000 Enterprise Edition has built-in support for log shipping, whereas SQL Server 2000 Standard Edition requires a custom implementation of log shipping. The built-in log shipping architecture consists of a primary server, one or more standby servers, and a monitoring server. These servers perform the following roles:

Primary server — The primary server is a server running SQL Server 2000 and containing the production database. You create and use a full database backup of this database to initialize a standby database on each standby server. SQL Server Agent jobs make periodic transaction log backups of the production database to capture changes made to the production database.
Standby servers — Standby servers are servers running SQL Server 2000 and containing an unrecovered copy of the production database from the primary server. SQL Server Agent jobs periodically copy transaction log backups from the primary server and restore these transaction log backups, without recovery, to each standby database. The restoration process updates the standby database on each standby server with new transactions from the primary server. A standby database can be recovered, and the standby server can replace the primary server if the primary server fails or must be taken offline for maintenance.
Monitoring server — A monitoring server is a server running SQL Server 2000 that monitors the status of the log shipping jobs on the primary server and on each standby server. To enable you to detect a failure to the primary server or to any standby server, use a server other than the primary server or a standby server as the monitoring server.

To create a custom log shipping solution, create each of these roles manually.

Note: You can use a standby server to contain standby databases from multiple primary servers and a monitoring server to monitor multiple primary server-standby server pairs.

Figure 5.2 illustrates how log shipping works.

Figure 5.2: Log shipping

When you configure log shipping, you restore a copy of the production database without recovery on each standby server. Thereafter, log shipping uses the following SQL Server Agent jobs and alerts to 1) maintain an up-to-date copy of a production database on one or more standby servers, and 2) report the success or failure of these jobs to the monitoring server and to administrators:

Transaction log backup job — The transaction log backup job runs on the primary server and backs up the transaction log of the production database on the primary server according to a schedule. This job also records the results of each execution in a status table on the monitoring server. To ensure that this backup remains available if a disk fails, use a fault tolerant storage location for the backup file. Share these backup files so that the transaction log copy job on each standby server can copy them, but strictly limit access through this share to protect sensitive information. Schedule the backup job to run every 1 to 15 minutes. The more frequent the backup, the smaller the size of each backup file and the more closely you can synchronize the standby database on each standby server.
Transaction log backup copy job — The transaction log backup copy job runs on a standby server and copies each transaction log backup from the network share on the primary server to the standby server according to a schedule. This job also records the results of each execution in a status table on the monitoring server. Schedule this job to run with the same frequency as the transaction log backup job. Copying each backup to a standby server as soon as it is created is important because the purpose of log shipping is to have an up-to-date copy of the production database on a standby server in case the primary server fails.
Transaction log restore job — The transaction log restore job runs on a standby server and restores transaction log backups to the standby database according to a schedule. Restore the transaction logs without recovery to allow additional transaction logs to be restored. To minimize the time required to complete a role change, schedule the restore job to run with the same frequency as the copy job. This keeps the standby database synchronized to within several minutes of the production database. If you also want to use the standby database for reporting, however, you cannot restore the transaction logs as frequently without interfering with its use for reporting. SQL Server requires exclusive access to the standby database to restore transaction logs or the transaction log restore job will fail. To ensure that the transaction log restore job does not fail, configure the restore job to terminate all users every time it runs.

You must choose between maximum synchronicity and additional functionality. For example, if you run the transaction log restore job once a day, users can access the standby database for reporting, but it will take longer to recover the standby database and bring the standby server online to replace a failed primary server. Delaying the application of transaction log backups to the standby database increases the likelihood that you will discover user errors, application errors, or database corruption before the backups containing those errors are restored to the standby database.
Backup alert — The backup alert runs on the monitoring server and monitors the success and failure of transaction log backup jobs on the primary server. Specify the threshold for alerting an administrator that backup jobs are failing. To reduce the risk of receiving false alerts, set a threshold that avoids false alerts by allowing for large transaction log backup files.
Out of sync alert — The out of sync alert runs on the monitoring server and monitors the synchronicity between the production database and the standby database. Specify the threshold for alerting an administrator that the standby database has exceeded a specified level of synchronicity. To reduce the risk of receiving false alerts, set a threshold that tolerates the time required to copy large transaction logs across the network and to restore them on the standby server.

If you are using SQL Server Enterprise Edition, use the Database Maintenance Plan Wizard to create these jobs and alerts automatically. If you are using SQL Server Standard Edition, create these jobs manually.

Because log shipping copies only database objects in the production database, database objects that are stored in the master and msdb databases or in the file system must be manually copied and recreated on a standby server either before or when a role change is initiated. To accomplish this, copy the following objects on the primary server:

Server logins — Create a Data Transformation Services (DTS) job by using the transfer logins task that transfers logins from the primary server to a standby server. Create a separate DTS job for each standby server. Back up the sysxlogins table by using the Bulk Copy Program (BCP) utility to capture the server-level logins. Store this backup file on each standby server. Update this backup each time new logins are added to the primary server.
Other database objects — Generate Transact-SQL scripts to recreate all jobs, alerts, and user-defined error messages on the primary server. Execute these scripts on each standby server. Update these scripts as these objects change, and use these scripts to update each standby server.
DTS package — Store a copy of each DTS package on each standby server. Update the copy of each DTS package as they change.

The log shipping jobs keep the standby database synchronized or ready to be synchronized. Manually copying and scripting objects stored outside of the production database ensures that the standby database can take the place of the production database during a failure or planned maintenance.

If you are using the standby server for standby databases from multiple SQL Server instances, the following considerations apply:

SQL Server logins from different SQL Server instances may have identical names with different permissions, which can create a security conflict.
Database objects from different SQL Server instances must have different names, or they overwrite each other.
DTS packages from different SQL Server instances or servers must have different names, or they overwrite each other.

If you take these considerations into account, a SQL Server instance on a standby server will be able to contain the standby database from multiple primary servers.

Firewall Considerations

If the primary server and a standby server are separated by a firewall, you need to open the following ports on the firewall:

Port 1433 — This port is required for SQL Server operations between default instances. This port must be open in both directions. If you are using a named instance of SQL Server, you must open the port used by the named instance.
Port 135 — This port is required so that SQL Server Agent can copy transaction log backup files and other synchronization files to the shared folder on the standby server
Ports 137, 138, and 139 or 445 — These ports are required to use Uniform Naming Convention (UNC) shares.

These are the only ports that need to be opened for log shipping. Other Windows services and application may require additional ports to be opened.

Log Shipping Role Change

Unlike failover clustering that supports automatic failover, changing log shipping server roles is a manual process. You must determine when to initiate a role change and manually execute a number of tasks to complete it. Changing log shipping roles requires you to perform the following tasks on the selected standby server:

Back up transaction log — If the primary server is still functioning, terminate all user connections and perform a final backup of the transaction log to capture all completed transactions that have not yet been backed up. Back up the transaction log by using the No Recovery option. Doing so enables you to restore transaction log files to log-shipped database without restoring it from a full database backup. Disable the transaction log backup job, and restrict client access to the production database to ensure that clients cannot change the database. SQL Server Enterprise Edition provides the sp_change_primary_role system-stored procedure to accomplish this task.

If the primary server fails, completed transactions that are not backed up and copied to a standby server are lost. End users have to resubmit these transactions.
Copy transaction log backup — If the primary server is still functioning, copy any transaction log backup files that are not already copied to the standby server that you want to promote. SQL Server Enterprise Edition provides the sp_change_secondary_role system-stored procedure to accomplish this task.

If the primary server fails, you will not be able to restore to the standby database any transaction log backup files that were not copied to a standby server before the primary server failed.
Restore transaction log backups, with recovery — On the standby server, restore, with recovery, all transaction log backup files that are not already restored to the standby database. After recovery, the standby database on the standby server will be able to accept client connections. SQL Server Enterprise Edition provides the sp_change_secondary_role system-stored procedure to accomplish this task.
Resolve SQL logins — Resolve SQL Server logins to user accounts in the standby database. Use the sp_resolve_logins system-stored procedure to load the saved BCP file containing logins from the primary server. This stored procedure resolves the SID values between the production database and the standby database. This task must be performed after the standby database is recovered.

When these tasks are complete, the new primary server is ready to receive client requests. Because the production database is residing on a different server with a different server name and IP address, however, you must manually point clients to this new server. You accomplish this task in different ways, depending on the design of the client applications and the use of middle-tier objects. You may be able to change the server name or IP address in a COM+ object or in DNS, or you may have to make a change to the Data Source Name (DSN) at each client computer. You may also be able to create a Network Load Balancing (NLB) cluster and virtualize the server name and IP address if the primary server and the standby server are on the same subnet. For more information about using NLB with log shipping, see "Redirecting Client Network Traffic to a Promoted Secondary Server" later in this chapter.

Note: If an application has a hard-coded server name, you cannot use log shipping without recoding the application or using an alias of the same name on the client that you redirect to the primary server when the primary server changes.

After a log shipping role change, you can either continue to use the new primary server as the primary server or bring the original primary server back online after it has been repaired or the maintenance task is finished. If you continue to use the new primary server as the primary server, enable log shipping to the original primary server (configure it as a standby server) to provide for database redundancy if the new primary server fails. If you want to bring the original primary server back online, back up and copy the transaction log from the new primary server to the old primary server, and then recover the original primary database. When these tasks are complete, you can point clients back to this original server and re-enable log shipping to the standby server to reinstate redundancy.

For more information about SQL Server log shipping, see article 314515, "Frequently Asked Questions — SQL Server 2000 — Log Shipping," in the Microsoft Knowledge Base at https://support.microsoft.com/default.aspx?scid=KB;EN-US;314515&sd=tech.

Log Shipping Advantages

There are a number of advantages to using log shipping rather than failover clustering or transactional replication. These advantages include the following:

Hardware requirement — Log shipping does not require specifically certified hardware.
Software requirement — Built-in log shipping is a standard component of SQL Server 2000 Enterprise Edition and custom log shipping can be manually configured with any SQL Server 2000 edition.
Less network traffic than transactional replication — Log shipping generates less network traffic than transactional replication does.
Site protection — Log shipping to a remote standby server provides protection against catastrophic site failure.

These advantages make log shipping the logical choice for a high-availability solution if you cannot justify the financial cost and setup complexity of failover clustering, or if you want to protect against server and site disasters in a single solution. You can also use log shipping with failover clustering to provide an inexpensive high-availability solution for site-level disasters with an automatic failover solution for local server failures and routine maintenance operations.

Log Shipping Limitations and Disadvantages

There are a number of limitations and disadvantages to using log shipping rather than failover clustering or replication. These limitations and disadvantages include the following:

Manual detection and role change — Log shipping requires manual intervention by a database administrator (DBA) to detect a failure and to initiate a role change. Initiating a role change requires using a number of stored procedures to recover the standby database. In addition, you must manually ensure that required objects or files not stored in the production database are kept synchronized between the primary server and the standby server.
Dual purpose versus minimal latency — You cannot use a standby database for secondary purposes, such as reporting or routine maintenance tasks, without increasing the transactional latency of the standby server and increasing the time required to bring a standby server online when the primary server fails.
Potential transaction loss — If the primary server fails, completed transactions that have not been backed up and copied to a standby server may be unrecoverable. You can reduce this risk by using fault tolerant storage for the transaction logs, increasing the frequency of transaction log backups, using fault tolerant storage for the transaction log backups, and immediately copying transaction log backups to fault tolerant storage on a standby server.
Lack of transparency to clients — After a role change, clients must connect to a different server with a different server name and IP address. You must manually modify each client to connect to this new server or incorporate an intermediate mechanism to accomplish this task for these clients. Intermediate mechanisms include NLB, DNS, and COM+ components. Each of these intermediate mechanisms requires a manual modification by a DBA to point clients to the new server name and IP address. This lack of transparency increases the time required to complete a role change.

If these disadvantages outweigh the advantages of log shipping for a data center, you should consider transactional replication to determine whether it provides a better high-availability solution for the data center. If neither log shipping nor transactional replication provides the high-availability advantages required for the data center, reconsider failover clustering.

Using Transactional Replication to Maintain a Redundant Server

Use transactional replication rather than failover clustering or log shipping if you want an inexpensive solution that allows a secondary server to be used for reporting as well as for redundancy. Transactional replication can also be used with failover clustering to provide an inexpensive solution to site-level disasters and provide one or more servers that can also be used for reporting. Because transactional replication is not designed for high availability, promoting a secondary server is not simple. In addition, reverting to the original primary server after a secondary server is promoted requires a complete database restoration. Finally, transactional replication requires preparation of the schema for replication. Identity columns and triggers must be set to Not For Replication; literals must be defined with apostrophes rather than quotation marks; primary keys must be defined for every table; and database objects must not include unresolved references.

Transactional replication uses SQL Server Agent jobs to copy each change made to a production database to a copy of the database on one or more secondary servers. Transactional replication requires an administrator to detect a failure, disable replication, and designate one of the subscribers as the new primary server. This process is not automatic and takes at least several minutes to complete. It can take significantly longer with a large database. After a role change, clients must connect to a different server with a different server name and IP address. Unlike with failover clustering, virtual server names and IP addresses are not incorporated into transactional replication.

Transactional Replication Architecture

The transactional replication architecture consists of a primary server (the publisher), a distribution server (the distributor), and one or more secondary servers (the subscribers). These server roles perform the following functions:

Publisher — The publisher is the server containing the production database (the publication database).
Distributor — The distributor is the server containing the snapshot folder and the distribution database. The snapshot folder contains a snapshot of the publication database. The distribution database stores changes made to the publication database since the snapshot was taken. In a high-availability solution, the distributor role should be on the same server as one of the subscribers to eliminate additional points of failure. To increase availability in the event of a site failure, place the distributor/subscriber in a separate site from the publisher.
Subscribers — A subscriber is a server containing a copy of the publication database that is periodically updated from the distribution database.

Figure 5.3 illustrates how transactional replication works. This diagram illustrates the placement of the distributor role on a subscriber. A single subscriber is shown for the sake of exposition. In a high-availability solution, you may use multiple subscribers, which are geographically distributed.

Figure 5.3: Transactional replication

SQL Server uses the following SQL Server Agent jobs (agents) to initialize and maintain an up-to-date copy of the publication database on one or more subscribers:

Snapshot agent — The snapshot agent creates the initial snapshot of the publication database and stores this snapshot in a shared folder on the distributor. The snapshot agent places a mark in the transaction log demarking the time of the snapshot. The log reader agent uses this mark to determine the transactions to copy to update the snapshot on subscribers.
Log reader agent — The log reader agent monitors the transaction log for the publication database and copies new transactions to the distribution database. For minimum latency, run the log reader agent continuously.
Distribution agent — The distribution agent initializes a copy of the publication database on each subscriber by using the initial snapshot. Thereafter, the distribution agent periodically copies transactions stored in the distribution database and applies them to the publication database copy on each subscriber to keep the database current.

Because transactional replication copies only database objects in the publication database, database objects that are stored in the master and msdb databases or in the file system must be manually recreated on a subscriber either before or when a role change is initiated. To accomplish this, perform the following tasks on the publisher:

Server logins — Create a DTS job by using the transfer logins task that transfers logins from the publisher to a subscriber. Create a separate DTS job for each subscriber. Back up the sysxlogins table by using BCP to capture the server-level logins. Store this backup file on each subscriber. Update this backup each time new logins are added to the publisher.
Other database objects — Generate Transact-SQL scripts to recreate all jobs, alerts, and user-defined error messages on the primary server. Execute these scripts on each subscriber. Update these scripts as these objects change, and use these scripts to update each subscriber.
DTS jobs — Store a copy of each DTS job on each subscriber. Update the copy of each DTS job as these jobs change.

The transactional replication jobs keep the standby database synchronized. Manually copying and scripting objects stored outside of the publication database ensures that the database copy on a subscriber can take the place of the publication database if a failure occurs.

Firewall Considerations

If the publisher and the subscriber/distributor, or a subscriber and the subscriber/distributor, are separated by a firewall, open the following ports on the firewall:

Port 1433 — This port is required for SQL Server operations between default instances. This port must be open in both directions. If you are using a named instance of SQL Server, you must open the port used by the named instance.
Ports 137, 138, and 139 or 445 — These ports are required if you are delivering the initial snapshot by using UNC shares.

These are the only ports that need to be opened for transactional replication. Other Windows services and application may require additional ports to be opened.

Transactional Replication Role Change

Using a subscriber as the new primary server is a manual process. You must manually stop the replication process and resolve SQL Server logins and user accounts. When these steps are complete, clients can connect to the new primary server. Changing roles includes the following tasks:

Disabling replication — If the distributor is still functioning, disable the transactional replication jobs and place the publication database in single-user mode.
Resolving SQL logins — Resolve SQL Server logins to user accounts in the database copy on the subscriber. Use the sp_resolve_logins system-stored procedure to load the saved BCP file containing logins from the publisher. This stored procedure resolves the SID values between the publication database and the database copy on the subscriber.

When these tasks are complete, the new primary server is ready to receive client requests. Because the production database resides on a different server with a different server name and IP address, however, you must manually point clients to this new server. You accomplish this task in different ways, depending on the design of the client applications and the use of middle-tier objects. You may be able to change the server name or IP address in a COM+ object or DNS, or you may have to make a change to the DSN at each client computer. You may also be able to use an NLB cluster and virtualize the server name and IP address if the publisher and the subscriber are on the same subnet. For more information about using NLB with transactional replication, see "Redirecting Client Network Traffic to a Promoted Secondary Server" later in this chapter.

Note: If an application has a hard-coded server name, you cannot use transactional replication without recoding the application or using an alias of the same name on the client that you redirect to the primary server when the primary server changes.

After a transactional replication role change, you can either continue to use the new primary server as the primary server or bring the original primary server back online after it has been repaired. If you continue to use the new primary server as the primary server, enable transactional replication to the original primary server (configure it as a subscriber). If you want to bring the original primary server back online, you must place the production database in single-user mode, back up the production database on the new primary server, and restore the database backup to the original primary server. After restoring the original primary server, you can point clients back to this original server and re-enable transactional replication to the original subscriber.

For more information, see "Transactional Replication Performance Tuning and Optimization" on the MSDN Web site at https://msdn2.microsoft.com/library/aa902656.aspx.

Transactional Replication Advantages

There are a number of advantages to using transactional replication rather than failover clustering or log shipping. These advantages include the following:

Dual use — You can use the database copy on a subscriber for reporting or routine database maintenance tasks without reducing the transactional latency of the database copy.
Minimal transactional latency — You can reduce the transactional latency between the publication database and database copy on the subscriber to several seconds on a high-speed network. The minimal transactional latency for a standby database with log shipping is several minutes.
Hardware requirement — Transactional replication does not require specifically certified hardware.
Software requirement — Transactional replication is a standard component of all editions of SQL Server 2000.
Site protection — Transactional replication to a remote subscriber provides protection against catastrophic site failure.

These advantages make transactional replication the logical choice for a high-availability solution if you require dual use or minimal latency. If neither of these advantages is required for the data center, you should reconsider log shipping or failover clustering.

Transactional Replication Limitations and Disadvantages

There are a number of limitations and disadvantages to using transactional replication rather than log shipping or failover clustering. These limitations and disadvantages include the following:

Manual detection and role change — Transactional replication requires manual intervention by a DBA to detect a failure and initiate a role change. In addition, you must manually ensure that required objects or files not stored in the publication database are kept synchronized between the publisher and each subscriber.
More network traffic than log shipping — Transactional replication generates more network traffic than log shipping.
Potential transaction loss — If the publisher or distributor fails, completed transactions that have not been copied to subscribers may be unrecoverable. You can reduce this risk by using fault tolerant storage for the transaction logs.
Lack of transparency to clients — After a server role change, clients must connect to a different server with a different server name and IP address. You must manually modify each client to connect to this new server or to incorporate an intermediate mechanism to accomplish this task for these clients. Intermediate mechanisms include NLB, DNS, and COM+ components. Each of these intermediate mechanisms requires a manual modification by a DBA to point clients to the new server name and IP address. The lack of transparency increases the time required to complete a role change.
Complex setup and maintenance — Setting up and maintaining transactional replication is complex.
Complex failback — Changing back to the original primary server after a role change is complex.

These disadvantages outweigh the advantages of transaction replication unless you absolutely require dual use or extremely minimal transactional latency.

Redirecting Client Network Traffic to a Promoted Secondary Server

After the failure of a primary server, you must redirect client network traffic to the promoted secondary server. Although this happens automatically with failover clustering, you must perform this task manually with log shipping and transactional replication. Although you can perform this task at each client when the secondary server is promoted, you can use a number of mechanisms to expedite this process, including the following:

Network Load Balancing — You can create an NLB cluster to direct client network traffic to a promoted secondary server. NLB is a Windows service that you run on the primary server and on one or more secondary servers to create an NLB cluster. An NLB cluster provides a single server name and address to all clients that want to connect to the data center. While the primary server is functioning properly, you configure NLB to direct all client requests to the primary server. If the primary server fails, you promote a secondary server and then reconfigure the NLB cluster to direct all client requests to the promoted secondary server. With NLB, clients are not redirected to a secondary server until that server is ready to receive client requests.

Note: Because NLB cannot redirect client requests across subnets, all servers in an NLB cluster must reside on the same subnet. To use NLB to redirect client network traffic to a remote site, you can configure a virtual private network connection between the sites.
Alternate server list in client application — You can design client applications with a list of alternate servers. With this approach, if a client application fails in an attempt to connect to the primary server, you specify a number of retries. After the retries fail, the client application attempts to connect to a secondary server from its list of alternate servers. If a connection to the first secondary server fails, the client application attempts to connect to the next secondary server on the list.

With an alternate server list, clients may attempt to connect to a secondary server that is not ready to receive client requests and attempt to connect when the primary server has not actually failed. Although custom applications can easily be modified to include a list of alternate servers, third-party applications and utilities cannot be easily modified. The users of these applications need to be aware of the alternate list and make the alternate connection manually.
DNS redirection — You can use DNS to redirect client network traffic to a promoted secondary server. If the primary server fails, you promote a secondary server and then modify the DNS entry for the primary server name to point to the standby servers IP address. Then you rename the promoted secondary server to the name of the primary server. This approach does not work for clients using an IP address rather than a server name to connect to the data center.

Note: You must change the DNS entry for the primary server in both the forward lookup zone and the reverse lookup.
ODBC DSN redirection — You can use ODBC for client connections to the data center. With ODBC, the client application uses a Data DSN to connect to the data center. The DSN specifies the name of the server to which the client application connects. If the primary server fails, you promote a secondary server and then modify the DSN to point to the promoted secondary server. You can automate the modification of these DSNs by using batch files to copy the modified DSN to all clients.
COM+ component redirection — You can use COM+ components for client connections to the data center. With a COM+ component, the client application connects to the COM+ component on an intermediate server, and the COM+ component specifies the name of the server to which the client application connects. If the primary server fails, you promote a secondary server and then modify the COM+ component to point to the promoted secondary server.

In addition to these methods for redirecting client network traffic in general, DTS packages have their own options for redirection. You can use an alias for the server name or use a dynamic properties task to set your connection object:

Alias — Use the Client Network Utility or local.
- If you are using a named instance of SQL Server or need to refer to another server, create an alias on both servers by using the Client Network Utility. Use the same alias, but point it to the local <server name>\<instance name> of the server on which the package is stored.
- If you are referring to a default instance of SQL Server on the machine on which the package will run, select local from the server dropdown list.
Dynamic properties task — You can use the dynamic properties task in the DTS package to set your connection object. Using the dynamic properties task allows the DTS package to refer to the server and instance on which it is running. You do not need to modify each task or connection when the DTS package runs on a different computer.

Choose among the client redirection mechanisms discussed above to redirect client network traffic to a promoted secondary server and to allow DTS tasks to execute properly on the promoted secondary server.

Choosing Among the Technology Solutions

The cost of the high-availability solution, the complexity of its setup and administration, the ease of failover and failback, and the transparency of failover and failback to clients determine the appropriate high-availability solutions for a data center. The choice of operating system and edition of SQL Server affects which high-availability options are available. Failover clustering and built-in log shipping require SQL Server 2000 Enterprise Edition. NLB requires Windows 2000 Advanced Server or Windows 2000 Datacenter Server.

When choosing among these high-availability solutions, consider failover clustering first. Failover clustering provides simple, quick, automatic, and transparent failover, guaranteeing that no transactions are lost when a server fails. The hardware and software required to set up failover clustering is relatively expensive, however. In many environments, the direct and indirect costs resulting from data center unavailability justify the cost of this solution. Failover clustering does not protect against site destruction without the use of a third-party stretch cluster solution, which is expensive and complex. For some environments, such as e-commerce solutions, however, stretch clustering can be justified.

If you do not need automatic failover and cannot justify the cost of failover clustering, log shipping is the most appropriate solution. Log shipping does not require expensive hardware, and setting it up is relatively simple. Log shipping is designed for high availability, and changing server roles in log shipping is easier than changing server roles in transactional replication. Log shipping is also an appropriate, inexpensive solution to protect a data center against site destruction when remote failover clustering cannot be justified.

If you want to use a secondary server for reporting or routine maintenance tasks and offload these tasks from the primary server for performance, you can do one of the following:

Increase the log shipping latency to enable a log shipping standby server to be used for reporting or maintenance tasks. Increasing the latency, however,decreases the availability of the data center by increasing the time required to recover the standby database if the primary server fails.
Maintain two log shipping standby servers, using one for reporting and maintenance tasks and the other for high availability.
Use transactional replication as the high-availability solution.

To minimize the impact of a failover or a role change to a client, use failover clustering. If you cannot justify the cost of this solution, incorporate intermediate mechanisms, such as NLB, DNS, or COM+ components.

Summary

To overcome each barrier to high availability, you must perform an analysis and determine the best solution. No one solution will overcome all of the barriers to high availability. You must use a combination of well-documented policies and procedures, well-designed client applications, regular database backups, redundant hardware, and redundant servers. You must weigh the cost of overcoming a barrier against the cost of not overcoming that barrier when you determine the appropriate solution. When evaluating a solution, be sure that the solution resolves the barrier it was intended to solve. For example, implementing failover clustering does not solve downtime caused by database corruption or user error.

After you have determined the appropriate server redundancy solution for a data center, use the accompanying Solution Guide to implement the solution.

Minimizing Downtime with Redundant Servers

On This Page

Using Failover Clustering to Maintain a Redundant Server

Failover Clustering Architecture

Failover Clustering Advantages

Failover Clustering Limitations and Disadvantages

Using Log Shipping to Maintain a Redundant Server

Log Shipping Architecture

Firewall Considerations

Log Shipping Role Change

Log Shipping Advantages

Log Shipping Limitations and Disadvantages

Using Transactional Replication to Maintain a Redundant Server

Transactional Replication Architecture

Firewall Considerations

Transactional Replication Role Change

Transactional Replication Advantages

Transactional Replication Limitations and Disadvantages

Redirecting Client Network Traffic to a Promoted Secondary Server

Choosing Among the Technology Solutions

Summary

Additional resources