Site Server - Membership Directory Guidelines for Failsafe Operations

April 1999 

Related Documents:

  • Microsoft Site Server 3.0 Membership Capacity and Analysis 

  • Microsoft Site Server 3.0 Membership Directory Configuration and Tuning Guidelines 

Purpose

This document addresses procedures for reducing or eliminating downtime in a Microsoft® Site Server version 3.0 environment. Hardware failure, data corruption, and physical site destruction all pose threats to a 24 x 7 operation. The goal of this document is to help you learn to minimize downtime in the Site Server environment

Overview

Keeping the Site Server system running in a failsafe environment means addressing potential failures at each of the three tiers: application servers, LDAP servers, and servers running Microsoft® SQL Server™. In the case of the application servers, round robin Domain Name System (DNS) provides an acceptable answer. Hardware solutions are also available from a variety of vendors. These solutions, in conjunction with Dynamic Directory replication, also provide a strong solution for LDAP server redundancy. Data stored on the server running SQL Server must be shared between the LDAP servers. If any of the SQL Server components fail, all LDAP servers using that store are compromised. Since all of the application servers are dependent upon these LDAP servers, those components that use the Site Server Membership Directory, such as Site Server 3.0 Personalization & Membership, will also be compromised.

Tier 1
Fault Tolerance:

Low state, no unique data, inexpensive machines
Use round robin DNS, hardware load balancing

Tier 2
Fault Tolerance:

Low state, no data
Use round robin DNS, hardware load balancing

Tier 3
Fault Tolerance:

Holds system state, lots of unique data

Because of this critical single point of failure, this document will focus on methods that provide automatic fail-over, hot swap, or warm backup capabilities for the databases that support the Site Server Membership Directory. This document will cover four methods of protection against the single point of failure. The optimal solution is site dependent.

The recommended solutions for failsafe operations are identified in the following table. The remainder of this document provides additional information about these solutions.

Type of Failure

Failsafe

Detection

Failover

Application or application server failure

Multiple copies running with round-robin DNS or hardware

Monitoring tools

DNS: script removal from DNS
Hardware: automatic

LDAP application or LDAP server failure

Multiple copies running with round-robin DNS or hardware

Monitoring tools

DNS: script removal from DNS
Hardware: automatic

SQL Server computer

Multiple servers running SQL Server with Windows NT Clustering Services

Automatic

Automatic

SQL Server Disk

RAID5

Automatic

Automatic

Data corruption

Warm backup SQL Server on separate server

Monitoring tools

Switch to use warm backup computer

SQL Server 6.5 Replication

SQL 6.5 replication allows administrators to configure their SQL Servers such that all modifications to data in one database are automatically propagated to another, allowing a copy of the data in one database to be present in another.

This method is limited in a Site Server Membership Directory environment by:

  • Data transfer rates 

  • Replication latency 

  • Monitoring requirements of the replication operation 

  • Difficulty in setting up the replication operation 

  • Special handling that is required for image data replication 

  • Potential for replication of user-corrupted data 

SQL 7.0 Replication

SQL Server 7.0 replication is much like SQL Server 6.5 replication, but addresses many of the limitations in SQL Server 6.5. Specifically, SQL Server 7.0 greatly improves data transfer rates, simplifies replication setup through wizards, and requires no special handling for image data. In addition, SQL Server 7.0 has the capability of replicating changes made to subscribers back to the master database, offering improved hot swap capabilities. SQL Server 7.0 continues to present the potential for replication of user-corrupted data, and is further limited by the fact that it is still in beta and hence, not useful in the short term.

Microsoft Windows NT Clustering Services

Microsoft® Windows NT® Clustering Services (also known as Microsoft Cluster Server) allows two computers to act as standby computers for each other, sharing only their disk array. This system allows one computer's CPU, memory, operating system, or other critical components to fail without causing a service outage. Using RAID5 protects against physical media corruption (drive failures).

Clustering in conjunction with a failsafe storage device system represents a true fail-over solution for everything but data corruption. It is limited by expense of hardware, difficulty of configuration, shared media, controller failure potential, and the fact that it offers no protection against database corruption.

Keeping a Warm Backup SQL Server

The best data protection strategy is a combination of full backups and backups from transaction logs. If a redundant computer is continuously being restored from a master computer, service may be restored with excellent data integrity. Intelligent management of warm backup computers will lead to nominal downtimes.

SQL Server 6.5 Replication

SQL Server 6.5 replication is a poor solution for failsafe operations in the Site Server environment. It is primarily intended for geographic distribution of data and for query distribution in environments that are query driven (involving very few writes, modifies, or deletes). The following information is presented in the interest of correcting misconceptions and detailing what would be involved in using SQL Server 6.5 replication for failsafe operations.

Advantages

  • The primary advantage of SQL Server 6.5 replication is that it allows a very recent copy of the data to be maintained with little operational overhead (once it has been set up). Because SQL Server 6.5 replication is designed to act as a continuous process managed by SQL Server 6.5, it can be set up and run in a relatively unattended mode. 

  • SQL Server 6.5 replication is also useful for geographic distribution. If you need read-only copies of your directory in a Wide Area Network (WAN) environment, SQL Server 6.5 replication can provide these copies with a reasonably low administrative overhead. 

  • Finally, because SQL Server 6.5 replication is an existing technology, it requires neither custom tools nor extensive in-house development to configure and deploy. It is a workable redundancy tool that is available today. 

Limitations

  • For a site using the Membership Directory for personalization, or one that has a reasonably high update rate, the replication throughput in SQL Server 6.5 is not sufficient to keep up with the change rate in the store. Thus, SQL Server 6.5 replication is viable only for very low add, modify, and delete transaction scenarios. 

  • SQL Server 6.5 replication requires special handling for image data. In SQL Server 6.5, the replication of an image row/column pair always results in the allocation of a 2 KB page on the subscriber, even if the source column is NULL. This results in tremendous space usage (bloat) on the subscribing databases if image columns are not given special handling, such as splitting them off to another table, then using a view to join them back into the original query. Unfortunately, this split/join process is inefficient and does not yield a warm backup SQL Server. Converting a subscriber for use as a publisher involves merging the split image data back into the primary table and rebuilding indexes. This process can take several hours in a medium- to high-scale environment. 

  • SQL Server replication offers little or no protection against data corruption. If data is corrupted and the SQL Server replication process detects the corruption, the replication process will be halted. If the user corrupts the data, the corruption will be replicated. If the replication process fails when a transaction is only partly replicated, the subscriber database may be corrupted. 

Option 1: Full Replication

While SQL Server 6.5 replication is not a recommended solution for most deployments, there are a few situations where it can provide added value. The first of these is in a very high-read, low-write environment. In such an environment, replication of data for geographic distribution or query load balancing can be advantageous.

All tables are replicated in a straightforward manner, with the exception of the Object_Attributes table, whose image column should not be replicated directly, as discussed previously.

Detailed instructions about how to handle the image data are beyond the scope of this document. Briefly, however, the following steps are necessary to replicate the Object_Attributes table without incurring database bloat due to NULL image field replication.

  1. In the subscriber, the Object_Attributes table should be renamed to Object_Attributes_No_Image, and the img_Val column removed. 

  2. A new table should be created to contain just the key data and the image. Call this table Object_Attributes_Image, and give it only two columns: I_Sequence (int), and img_Val (image). Create an index on the I_Sequence column of this table. 

  3. A view should be created on these two tables. This view should be called "Object_Attributes" and consist of all the columns from Object_Attributes_No_Image joined by I_Sequence to the Object_Attributes_Image table. 

  4. Appropriate indexes should be created on the Object_Attributes view to mimic those present in the standard Object_Attributes table. 

This configuration allows you to use the replica as a read-only database while you are restoring the master database from backups or log files. If no backups are kept, or they are somehow lost or not maintained, restoring write capabilities from this database will be challenging. In addition to addressing the normal issues involved in promoting a subscriber to a publisher, the following steps must be taken:

  1. Alter the table on Object_Attributes_No_Image to add the img_val column (data type Image). 

  2. Update the Object_Attributes_No_Image table to take the data from Object_Attributes_Image as follows: "UPDATE Object_Attributes_No_Image A, Object_Attributes_Image B SET A.img_val = B.img_Val where A.i_Sequence = B.i_Sequence." 

  3. Establish identity properties on the object_lookup (I_DSID), object_attributes (I_Sequence), attributes (I_AID), and classes (I_CLSID) tables. 

  4. Build indexes on Object_Attributes_No_Image to mimic those on the normal Object_Attributes table. 

  5. Delete the Object_Attributes view. 

  6. Rename the Object_Attributes_No_Image table to Object_Attributes. 

  7. Delete the Object_Attributes_Image table. 

Option 2: Replication Without Indexes

By dropping indexes on the subscriber, the replication throughput can be increased somewhat, allowing a higher write, modify, and delete rate than is available in Option 1. Requirements and limitations are effectively the same, except that searches on the subscriber are limited by which indexes remain. For example, if you have dropped the indexes ind_dt_aid (date data index) and ind_int_aid (integer data index), any search on this information will almost certainly fail with a timeout.

If your system is used exclusively for authentication or personalization, the majority of the indexes can be dropped; only the clustered index on the Object_Attributes table is required.

Option 3: Partial Attribute Replication

A second way of improving replication throughput in a specialized system is to replicate only those attributes that are mission critical. In a typical Site Server deployment, the LDAP servers are using the SQL Server computers to fulfill many roles, including personalization, authentication, and targeted mailing. It is primarily the personalization functionality that induces the high transaction rate, causing problems for SQL Server replication.

By choosing to replicate only those attributes needed for authentication and mail resolution, you can reduce the add, modify, and delete rates so they can more readily be handled by the SQL Server replication system. These attributes change less frequently, so the replication engine has an easier time keeping up with the load.

In the event of a failure on the master, only the replicated attributes will be accessible. Attributes you did not replicate will be unrecoverable from the subscriber. Read operations, including authentication, mail routing, and mail name resolution, can be performed from the subscriber while you repair the publisher. Alternatively, the subscriber can be promoted to a publisher, as detailed above.

Other requirements and limitations are effectively the same as in Option 1, except that operations on the subscriber are limited by which attributes were replicated.

Performance Summary

 

Hardware Failure

Software Failure

Data Corruption

Read Operations

Good if pointers are swapped to the replica.

As hardware

Minimal to no protection. Corruption will either be replicated or stop the replication process.

Write Operations

None until replica is converted to master, this conversion is error prone and will take considerable time depending on amount of data in store.

As hardware

Write operations against a corrupted database will be suspect at best.

System Restoration

Complex, time consuming, prohibitive in a very high scale environment.

Fail over to subscriber for read operations while rebooting master.

No guaranteed recovery method.

SQL Server 7.0 Replication

SQL Server 7.0 replication offers significant benefits over SQL Server 6.5 replication. This allows SQL Server 7.0 to play a much more significant role in failsafe operations.

Advantages

  • SQL Server 7.0 resolves the problem with page allocation on image columns for a NULL value. For this reason, all tables may be replicated directly and there is no need for the special handling detailed earlier in SQL Server 6.5 Option 1. The fact that the Object_Attributes table can be replicated as-is makes promotion of a subscriber to publisher status much simpler and faster. 

  • Replication throughput is considerably higher in SQL Server 7.0. Data transfer rates are three to four times greater, allowing SQL Server replication to work even in Site Server personalization environments without special handling as previously detailed. 

Limitations

  • If the SQL Server 7.0 replication process detects corruption in the master database, the replication process will be stopped. 

  • If a user corrupts data in SQL Server 7.0, the corruption will be replicated to subscribers. 

Option 1: Full Replication

 

This is the simplest option. Replicate all tables to provide a warm standby copy of the data. If there is a hardware or software failure on the publisher, promote a subscriber. If there is a hardware or software failure on the subscriber, synchronize a new subscriber. Note that this option offers no protection against data corruption.

Option 2: Leverage Writable Subscribers

SQL Server 7.0 has a replication mode called Merge Replication. This allows writes to a subscriber database to be periodically propagated back to the publisher, where they can in turn be pushed out to the other subscribers. This mode can be leveraged by configuring a one-to-one mapping of LDAP server sets to SQL Server sets.

A failure on a given SQL Server subscriber set only drops the affiliated LDAP server set; this failure can be repaired after taking that LDAP server set out of DNS, or redirecting the LDAP set to a SQL Server replica that is still functional.

If a failure occurs on a publisher, an LDAP server may be redirected to a functioning subscriber while the publisher is repaired or a subscriber is promoted to a publisher. Writes to subscribers are possible in this model, and will be distributed to all subscribers once a publisher is established and re-synchronized.

Like all SQL Server replication-based solutions, this offers little or no protection from user corruption of data.

Performance Summary

 

Hardware Failure

Software Failure

Data Corruption

Read Operations

Good. Read operations only affected on those LDAP servers that leverage the failed SQL server; recovery involves switching to the replica.

As hardware

Minimal to no protection. Corruption will either be replicated or stop the replication process.

Write Operations

Poor until replica is converted to master. This conversion is error prone and will take considerable time depending on amount of data in store.

As hardware

Write operations against a corrupted database will be suspect at best.

System Restoration

Fairly simple promotion of subscriber. Most operations recovered in minutes; failure of publisher implies tens of minutes down time.

Fail over to subscriber for read operations while rebooting master.

No guaranteed recovery method.

Microsoft Windows NT Clustering Services

Microsoft Windows NT Clustering Services (clustering) offers a great degree of protection from system failures, either at the hardware or software level. This protection is automatic and "lights out" in most cases, offering a very good administration/protection tradeoff. It offers no protection from data corruption and still holds a single point of failure in the disk controller. For this reason, hardware solutions for failsafe storage devices are recommended.

Advantages

  • The primary advantage to clustering is its protection against hardware failures and catastrophic software failures, which covers virtually all problems other than data corruption. 

  • Windows NT Clustering Services will automatically detect problems and switch to the "good" computer without administrator intervention, allowing "lights out" operation. This is transparent to the Site Server Directory servers that use the clustered SQL servers, and no operator intervention is required. 

  • As detection is rapid and recovery is automated, it is possible to achieve near-zero downtime in hardware and catastrophic software failure scenarios. 

Limitations

  • Taking full advantage of clustering requires SQL SP 5, which is not currently available. 

  • Because the disk device is shared and the data on the disk is common to each SQL Server computer in the cluster, there is no protection against data corruption. 

Option 1: Clustering with Shared Disk Array

Clustering with a single shared disk array is the standard scenario. This provides protection from all catastrophic hardware and software problems. Through use of RAID 5, all other single points of failure are eliminated. Recovery from other hardware or software failures is automated and imposes no downtime on the system.

 

As previously mentioned, this solution does not offer protection from data corruption.

Performance Summary

 

Hardware Failure

Software Failure

Data Corruption

Read Operations

Excellent.

Excellent. Automatic fail-over.

Minimal to no protection. Read operation may be possible against corrupted database.

Write Operations

As-read operations.

Excellent. Automatic fail-over.

Write operations against a corrupted database will be suspect at best.

System Restoration

Fix hardware.

Reboot failed computer.

No guaranteed recovery method.

Keeping a Warm Backup SQL Server

Using this strategy, a backup computer is created from the primary SQL server by taking a database dump from the source computer, then incrementally testing the source computer for data corruption and applying incremental backups (backing up from transaction logs). If the primary computer fails or becomes corrupted, the backup computer is available as a standby.

A well thought out and executed backup strategy is unquestionably the best single solution for failsafe operations. Maintaining a hot standby server from backups allows the best compromise between minimal downtime, data integrity, and reasonable administrative overhead. It requires constant monitoring and administrator involvement, but most (if not all) of these tasks can be automated through scripting.

Advantages

  • The most important advantage to this solution is that it offers true protection from data corruption. No other solution provides this. 

  • Due to constant maintenance of a validated warm backup computer, responding to a failure in the source SQL server – hardware, software, or corruption – involves merely switching to the warm backup computer. Making a nominal administrative change and restarting the LDAP server activates the new database. 

  • This backup strategy is simple to execute, well documented, and has been the data integrity strategy of choice for a number of years. Many administrators know how to execute it, SQL-intensive data centers already have processes in place, and support personnel know how to implement it. 

  • This solution involves no special hardware, requiring only a standby SQL Server computer and a transaction processing computer to implement. 

  • If a cluster fails, then the site can be recovered by switching to the warm backup. 

Limitations

  • This solution has a low setup overhead, but a high running administrative overhead for validating data, maintaining the warm backup, and so on. Creating scripts to automate the tasks may mitigate this. 

  • If corruption is detected in a transaction cycle and the corruption is irreparable, transactions that occurred over the corruption cycle will be lost. For this reason, transaction updates should be applied as frequently as possible. 

Option 1: Restore on Demand

This solution requires low administrative overhead, and involves making an occasional full backup from the master database and storing a sequence of transaction log deltas. Once each interval, where the length of the interval is determined from the needs of the site, the full backup snapshot is taken again and the intermediate logs deleted.

In the event of corruption, software or hardware failure, deltas are incrementally tested for corruption and applied to the full backup to achieve a "known good" state. Although this approach requires a much lower administrative overhead, it results in a longer interval to system recovery.

Option 2: Restore to Warm Backup Computer

Another option that greatly reduces downtime at the expense of administrative overhead involves maintaining a backup computer in a "warm" state. This implies priming the backup with a full snapshot of the source database, then applying transaction logs as follows:

  1. Use DBCC to test source database for corruption. 

  2. Capture transaction log to backup computer. 

  3. If no corruption is detected, apply transaction log to backup computer. 

  4. If irreparable corruption is detected, discard transaction log and switch servers to use backup computer. Transactions since last good snapshot are lost if not recoverable. 

  5. If the source database is irreparably corrupted, back up the old backup computer (now the source) to the old source computer (now the backup) and restart the cycle in the opposite direction. 

    Note This option provides very low downtime at the expense of constant administrative monitoring as detailed above. It is unquestionably the best solution available for failsafe data center operations. 

Performance Summary

 

Hardware Failure

Software Failure

Data Corruption

Read Operations

Operator intervention required; minimal downtime while SQL server pointer is switched.

As hardware

Read operations will be unpredictable against a corrupt database.

Write Operations

Operator intervention required; minimal downtime while SQL server pointer is switched.

As hardware

Write operations against a corrupted database will be suspect at best.

System Restoration

Switch to warm backup computer.

Switch to warm backup computer.

Only guaranteed recovery method. Switch to warm backup; data involved in the corruption process will be lost.

The ideal approach to achieving failsafe operations in the SQL tier of the Site Server architecture is the combination of clustering technology and maintaining a warm backup.

Clustering was previously detailed under Microsoft Windows NT Clustering Services, Option 1. This will provide protection against virtually all failure scenarios in hardware and software in a failsafe, "lights-out" manner. Data corruption is the only remaining vulnerability.

If the approach detailed in "Backing Up from Transactions" is used, data corruption can be detected and, if the corruption is irreparable, the warm backup can provide an immediately available, valid data store to work from.

This solution demands a high investment in both hardware and administrative overhead, but offers true, failsafe, 24 x 7 operations in the Site Server environment.

Where to Go for More Information

SQL Server Replication

https://msdn.microsoft.com/library/default.asp?url=/nhp/Default.asp?contentid=28000409 

SQL 6.5 Replication

Microsoft SQL Server 6.0 Administrator's Companion, Part 6

SQL 7.0 Replication

Microsoft SQL Server 7.0 Administrator's Companion

https://www.sql-server-books.com/microsoft-sql-server-70-administrators-companion.html 

SQL Backup Strategies

Microsoft SQL Server 6.0 Administrator's Companion, Appendix E

Microsoft TechNet

Call 1-800-344-2121 ext. 3118, 6:30 am to 5:30 pm (PST) or order online at https://Shop.Microsoft.Com .

Summary

This document describes the options available to Site Server administrators who wish to operate their data center in a failsafe manner. This information should be applied in a way that makes the most sense for each individual environment.

For the site with sufficient hardware and administrative resources, combining clustering with the maintenance of a warm backup SQL Server is recommended, to protect against hardware, software, and data failures.

Note that the technical implementation details for the solutions outlined in this paper are only touched upon here. The administrator needing more information is strongly encouraged to refer to the supporting documentation for Microsoft Windows NT Clustering Services and Microsoft SQL Server, as well as documentation from individual hardware vendors.

The Site Serverteam is committed to providing you the best possible server products for the Internet. We welcome your feedback about how we can better meet your needs. You may contact the Microsoft support team by visiting: https://support.microsoft.com/support .