Overcoming Barriers to High Availability

Updated : November 12, 2002

To achieve high availability, you must develop solutions to each barrier listed in Chapter 1. Overcoming these barriers requires following well-established best practices for high availability. To minimize and eliminate downtime, you must develop processes for carrying out operational and emergency tasks and escalating issues within your organization. You must also anticipate and plan for all types of failures, using redundant components to prevent failure and redundant servers to minimize downtime caused by unplanned failures.

To help organizations implement the required processes, Microsoft has developed Microsoft Operations Framework (MOF). MOF is a collection of best practices, principles, and models designed to help businesses achieve reliability, availability, supportability, and manageability for mission-critical applications built using Microsoft products and technologies. MOF divides the IT life cycle into identifiable tasks and functions, such as change management, system administration, security administration, and problem management. The best practices and policies discussed in this chapter are based on the recommendations contained in MOF. For more information, see "Better Manage Your IT Systems with the Microsoft Operations Framework" on the Microsoft Web site at https://www.microsoft.com/technet/itsolutions/cits/mo/mof/default.mspx.

On This Page

Overcoming Environmental Barriers
Overcoming Hardware Barriers
Overcoming Communication and Connectivity Barriers
Overcoming Software Barriers
Overcoming Service Barriers
Overcoming Process Barriers
Overcoming Application Design Barriers
Overcoming Staffing Barriers

Overcoming Environmental Barriers

You must design the data center to ensure high availability. The term data center refers to medium and large facilities. You can increase the availability of your data center by including the following:

  • Raised floors — Raised floors provide space for the massive amounts of cable required for a data center, simplify the process of adding and moving equipment, and allow cool air to be pushed under raised floors and directed at servers and other heat-sensitive equipment. Without a raised floor, staff members can more easily trip over a cable and cause a server to fail. Also, if it is difficult for your staff to add or move equipment, these tasks can lengthen unavailability when equipment must be added or moved.

  • Fire suppression systems — Good smoke detectors and fire extinguishers are crucial. Use a gas-and-water system to suppress fires. Install smoke detectors and temperature sensors throughout the data center so that you can monitor conditions and control them in zones. Also be sure that your staff can manually start and stop the fire suppression system.

  • Temperature controls — Computer equipment reliability is better in cool conditions. Try to keep the data center at 68 degrees Fahrenheit (20 degrees Celsius). Although desktop computers and individual servers have fans to cool the CPU, fans do not cool the air enough for data-center servers and other heat-sensitive equipment. Do not rely on the centralized air conditioning in the building. Generally it is turned off in the evenings and on weekends. Use a dedicated, redundant cooling system to provide continuous cooling for the data center equipment.

  • Humidity controls — High humidity can cause condensation on equipment. Very low humidity can lead to excess static electricity. Large fluctuations in humidity can cause circuit boards to expand and contract, damaging circuitry. Try to keep the data center between 40 to 45 percent relative humidity.

  • A redundant power system — Prepare for both widespread and local power outages to avoid a lengthy service interruption. When a power outage occurs, a battery backup system can supply enough power for an orderly shutdown. If systems must continue to operate, install redundant backup generators to power-critical equipment, including the cooling system.

  • Redundant power supplies — Within the data center, blown circuits or damaged wiring can cause a power outage to a rack of equipment or to individual components. Redundant power supplies to each rack of equipment can prevent a blown circuit from causing downtime. If a main circuit loses power, a redundant power supply automatically switches the rack to a second power supply.

  • Redundant data connections — If voice and high-speed data connections to the data center fail, users cannot access the data-center servers and the data center staff will have difficulty communicating with standby sites. Ensure that voice and high-speed data connections are redundant. If one communications carrier loses service or must take its system down for maintenance, a second carrier lets users continue accessing the information they need and allows staff to communicate with standby sites if needed.

    To prevent damage to multiple lines during construction or maintenance, the redundant lines should enter the data center at different locations. For optimal service and cost-effectiveness, locate the data center facility near an Internet hub.

  • Backup systems — Automated backup devices that mechanically insert and remove tapes are essential for large data centers. Ensure that your backup system can automatically perform all backups required by your disaster recovery plan without intervention by your staff.

  • Off-site storage — If database backups are stored on site, a disaster at the data center might destroy both the production system and all backups of the production system. Storing backups off-site protects against this barrier. Consider storing backups at a secondary site as well as at an off-site location near the primary site.

  • Security precautions — Intruders can access data by physically entering the data center, or they can obtain virtual access through a network connection. To stop intruders from physically entering the data center facility, require staff and visitors to provide credentials upon entry, and log all entries. Security cameras can also help. To stop virtual intruders, enforce security on each server, secure the network using a firewall, and require employees to use strong passwords that change frequently. Ensure that your staff stays current with hacking trends.

  • Space — The data center facility should provide enough room for equipment to be organized, for growth, and for staff. Relocating equipment often causes downtime. In addition, adequate room for employees is important for employee productivity, which can increase availability.

  • **Redundant data center facilities —**To provide protection against site-level disasters, deploy redundant servers in a secondary data center facility as well as in the primary data center. This arrangement ensures that you don’t lose the secondary server during a catastrophe at the primary site. If all servers must be in the same facility, place them on separate power grids to provide some protection against localized disasters.

Ensuring that the data center itself is properly designed is a key component to increasing the availability of the data center.

Overcoming Hardware Barriers

A high-availability data center requires server-class hardware. Invest in high-quality components, redundant components at every point, and hot-swappable components whenever possible. You can overcome hardware barriers by using the following in the data center:

  • Up-to-date hardware — Out-of-date components, firmware, or software drivers can cause software incompatibilities and result in availability problems. Installing up-to-date components with correct firmware and software driver revisions minimizes the risk that this barrier will reduce availability.

  • Certified computers, components, and configurations — Using uncertified computers, components, and configurations for the data center is not supported by Microsoft Product Support Services (PSS). Use only computers, components, and configurations that are listed on the Microsoft Hardware Compatibility List (HCL). To search the HCL for certified computers, components, and configurations, see the "Hardware Compatibility List" on the Microsoft Web site at https://support.microsoft.com/kb/131900.

  • Sufficient capacity — Insufficient storage system, memory, or processor resources can cause the perception of unavailability by causing the data center to respond sluggishly to client requests, resulting in time-out errors. You resolve insufficient memory resources by adding more memory. You resolve insufficient processor resources by adding more or faster processors. You resolve insufficient storage resources by adding more disks, more controllers, and controllers that support multiple channels. More disks are faster than larger disks; more controllers split the I/O to the disks; and multiple channels support multiple pipes. To further increase the performance of the storage subsystem, use a storage area network (SAN). A SAN consists of multiple disks connected to one or more servers using Fibre Channel for high speed connectivity.

  • Redundancy solutions — Use Microsoft SQL Server database, file, and transaction log backups to protect data against hardware failures. Use redundant components and redundant servers to eliminate or limit downtime caused by hardware failures.

Ensuring that only top quality hardware is used will increase the availability of the data center. Because hardware failures do occur, however, you must also employ redundancy solutions to limit or eliminate downtime caused by hardware failures. For more information on redundancy solutions, see Chapter 4, "Preventing Downtime by Using Redundant Components," and Chapter 5, "Minimizing Downtime by Using Redundant Servers."

Overcoming Communication and Connectivity Barriers

The inability of clients to connect to and communicate with the data center causes a perceived lack of availability. To avoid this problem, ensure that your network is designed for high availability. Overcome communication and connectivity barriers by implementing the following:

  • Redundant networks and network cards — Network cards and network paths can fail and prevent users from accessing the data center. Ensuring that the data center is accessible over multiple network paths using multiple network cards increases data center availability.

  • Multiple data center access paths — The failure of a router or switch can prevent users from accessing the data center and can prevent your staff from reaching the Internet when troubleshooting a hardware or software problem. Use multiple switches and routers to ensure that users have a redundant path to the data center and that your staff can access the Internet.

  • Multiple domain controllers — Users must be able to be authenticated by either SQL Server or the Microsoft Windows operating system. To ensure that Windows users can be authenticated by the Windows operating system, use multiple domain controllers in the Windows domain to provide redundancy.

  • Multiple DNS servers — Users must be able to locate the data center to establish a connection to the data center. On a Windows 2000 network, users find resources by using DNS. The failure of a DNS server can prevent users from locating the data center. Ensure that a secondary DNS server exists on the network that can direct users to the data center if the primary DNS server fails.

Ensuring that redundant network paths and components exist ensures that network infrastructure problems do not prevent users from communicating with the data center. Ensuring that multiple domain controllers exists guarantees that Windows users can establish a trusted connection to the data center.

Overcoming Software Barriers

You must ensure that the Windows operating system and the SQL Server service remain available to users. If either becomes unavailable, users cannot access the data center. Software barriers occur due to a variety of causes, including software failures, upgrades and maintenance, database corruption, user or application errors, viruses, and denial of service attacks. Use the following guidance to overcome each of these barriers

  • Software failures — The failure of the SQL Server service because of application errors, memory leaks, and excessive locking can cause data center unavailability. You must ensure that the applications deployed are well-designed and tested, that custom extended stored procedures do not cause memory leaks, and that the application design does not generate excessive locking.

  • Upgrades and planned maintenance — SQL Server service pack installations and planned maintenance can require that the Windows 2000 operating system or the SQL Server service be restarted. A Windows service pack installation requires restarting the Windows operating system. The installation of a SQL Server service pack requires placing SQL Server into single user mode and then restarting SQL Server. In addition, other types of planned maintenance — such as adding additional memory, adding a hard drive, or upgrading a server application — may require a server to be taken offline. To avoid downtime caused by these software barriers, consider implementing either automatic failover clustering or automatic log shipping. For more information about failover clustering and log shipping, see Chapter 5, "Minimizing Downtime by Using Redundant Servers."

  • Database corruption — Hardware failure can cause database corruption. To limit the downtime caused by database corruption, ensure that you back up all databases regularly and ensure the security of those backups. The only way to recover from database corruption is to recover from backup.

  • User error — Accidental or malicious deletion or alteration of data can jeopardize the availability of the data center. To limit or eliminate downtime caused by user error, ensure that the database is backed up regularly. Depending on the extent of the error, you may be able to recover from it by using an alternative server to recover the data and applying it to the primary server. In a severe case, you may have to recover the primary server from backup. Without adequate backups, you must reconstruct the data, which will severely affect the availability of the data center.

  • **Viruses—**Viruses can cause a data-center server to cease functioning. To reduce the threats caused by viruses, install the latest security patches and deploy virus protection software on all networked computers. Ensure that the virus protection software monitors all incoming and outgoing files and data for viruses.

  • Denial of service attacks — Malicious users can attack Windows 2000 services and limit the availability of the SQL Server service by sending an overwhelming number of requests for access. To reduce the risk of denial of service attacks, restrict access to legitimate users only. You can use Windows authentication in SQL Server to restrict access to authenticated Windows users only. You can use Internet Information Services (IIS) to authenticate Internet users before they attempt to connect to SQL Server. You can also use IIS to restrict the IP addresses that can access SQL Server through the IIS server.

Eliminating software barriers before they cause a lack of availability is the best way to achieve high availability for a data center. However, software barriers are not always preventable. Application errors, user errors, and hardware subsystem failures can cause database corruption. Because these barriers cannot always be prevented, you must consider mitigation solutions. For more information on these solutions, see Chapter 3, "Recovering a Data Center by Using Database Backups"; Chapter 4, "Preventing Downtime by Using Redundant Components"; and Chapter 5, "Minimizing Downtime by Using Redundant Servers."

Overcoming Service Barriers

Achieving a highly available data center requires services from outside hardware and software vendors. Unresponsive vendors and poorly trained vendor staff can reduce the availability of the data center. Your relationships with external vendors should include the following:

  • Service Level Agreements — Negotiate a Service Level Agreement (SLA) with each of your major vendors to ensure a specific level of availability for the portion of the data center affected by their hardware or software. An SLA guarantees that a system will perform to specifications, support required growth, and be available to a given standard. Be sure that you are working with responsive vendors with well-trained staff members. Be sure that your staff is aware of the terms of each SLA. For example, many hardware vendor SLAs have clauses that require support personnel from the vendor or only specific, certified persons from your staff to open the server casing to replace defective components. Failure to comply can result in a violation of the SLA and potential nullification of any vendor warranties or liabilities. SLAs for data centers usually include (at minimum):

    • Percentage of availability (such as 99.99 percent)

    • Maximum number of concurrent users

    • Number of transactions to be supported per unit of time (such as 5,000 per second)

    • Method of contacting support personnel

    • Number of support calls allowable within a given period

    • Response time expected on support issues

      Note: Microsoft requires that an SLA be part of each Windows 2000 Datacenter Server solution.

  • Microsoft support contract — In addition to an SLA, you should also negotiate a support contract for your Microsoft software.

  • Testing vendor support — Make arrangements to periodically test escalation procedures by conducting support-request drills. Be sure you also test pagers and phone trees to ensure you have the most recent contact information.

If you are unable to solve a system issue quickly on your own, the absence of an SLA or support contract can increase the length of time the data center is unavailable. If you cannot access the vendors support center or do not have necessary service account codes, the length of your service outage can increase.

Overcoming Process Barriers

Proper processes eliminate unnecessary downtime and ensure the most rapid recovery possible from service outages. Proper processes also ensure respect for your staff and ensure that you have the clout necessary to make the decisions required to protect the data center.

To overcome process barriers, you must incorporate the following:

  • Process management — Develop and follow operational processes for performing routine tasks and emergency processes for responding to each type of disaster. Document the steps to follow for each task and disaster, and keep this document current. This document is called a run book. For information that the run book should contain, see Appendix A, "Contents of a Run Book."

  • Incident management — Diagnosing that a failure has occurred and responding to the failure in a timely and proper manner is essential to achieving high availability. Incident management also encompasses monitoring to anticipate disaster before it occurs and to assist in diagnosing failures when they do occur.

  • Change management — Establish a quality assurance (QA) environment to test all changes before they go into production. Maintaining a duplicate environment enables you to ensure that each alteration you make to the production environment succeeds and does not cause unnecessary downtime.

  • Maintenance management — Perform periodic maintenance on the data center to prevent problems from occurring and to detect problems before they affect the availability of the data center.

  • Configuration management — Develop a standard configuration for all data-center servers, such as standardized drive letters, mapped directories, and share names.

Process Management

Well-developed operational and emergency processes increase the availability of the data center by providing your staff with tested steps for resolving issues as they arise. Develop procedures for routine operational processes. Develop a complete disaster recovery plan consisting of the emergency procedures to be followed for each type of emergency. If your staff knows the precise steps that should be followed to repair a user error or to replace a failed disk, the risk of unnecessary downtime is diminished and necessary downtime is kept to a minimum.

Create a script library containing Transact-SQL scripts that your staff should use for routine operations and for recovery from each type of disaster. If your staff always performs the same procedure the same way, you will avoid problems and increase availability.

Ensure that the run book is easily accessible to all personnel who need it. For example, if all the blinking lights on the data-center servers suddenly go dark, an inexperienced engineer on night watch must be enable to locate and use the run book to follow the correct procedures. These procedures include the initial steps that should be followed, emergency phone numbers, and escalation procedures if service cannot be immediately restored.

As maintenance and emergency procedures change, you must update the procedures in the run book. Similarly, you must update contact information in the run book. Log all problems and their resolutions, whether trivial or major, as they occur. The resolution of past problems may help expedite the resolution of a future problem. If you ever have to use your disaster recovery plan, be sure to document the entire process. To determine the right level of detail, imagine how you would explain the sequence of events and your thought process to your peers.

Keep an up-to-date, offsite printed copy or encapsulated copy of your run book. Avoid maintaining a run book that needs a separately installed graphical user interface. Use a versioning tool to track changes to the run book.

Periodically test the disaster recovery plan to ensure that it works and is sufficiently detailed. The consequences of an untested plan can be very costly. Use drills to ensure that each member of your staff knows what they need to do in an emergency. Use a test environment to avoid causing downtime to your production system for these tests. Database backups and log shipping can be used to duplicate the database environment on a test platform. Log all tests of the disaster recovery plan. Note the day, start time, end time, whether or not it was a success, and in the event of a failure, why it happened. Logging this information is crucial for tracking as well as diagnosing.

Incident Management

Incident management consists of determining the cause of a failure, minimizing the time required to respond to a failure, and monitoring to anticipate or diagnose a failure. The availability of the data center is directly affected by how your staff responds to service incidents.

Diagnosing a Failure

After you have restored a server, try to determine the cause of a failure. Although some types of failure are difficult to assess, the following measures can help you in this effort:

  • Document all of the processes that were running on the server when it failed. This will assist you in determining any dependencies that may have contributed to the failure.

  • Back up the database and the transaction log for later analysis. Restoring the database on an alternative server and then restoring the transaction logs may help determine the cause of the failure.

  • Document any error message that appears on the server console.

  • Back up the operating system logs from the failed server, and compare them with historical backups of these logs to isolate anomalies.

  • Back up and analyze the SQL Server and SQL Server Agent error logs to see what was going on at the SQL Server level at the time of the failure. Compare these logs with historical backups of these logs.

  • If C2 auditing is enabled in SQL Server, back up and analyze the C2 log to help determine who or what caused the problem.

Note: The location of all log files should be standardized, well known, and documented in the run book.

Diagnosing the cause of a failure may help you prevent a similar failure in the future and correct the failure more quickly.

Response-Time Practices

To minimize the response time of your staff to high-availability issues, configure each data-center server to send an alert when it detects a condition that might lead to an error. The following practices assist in reducing response time:

  • Send alert messages to a predefined group rather than to an individual. Doing so helps to ensure that alert messages are received. Be sure to specify a fail-safe operator in SQL Server to ensure that someone is notified if the designated individual or group is not on duty when the alert fires.

  • Forward alerts and events for each of your data-center servers to a centralized server so that your staff can view all alerts and events from there.

  • Standardize the physical appearance of all alert messages so that the appropriate response to an alert is easy to determine.

  • Clearly document the meaning, priority, and suggested resolutions (or starting points) for each alert type. Clearly define individuals who must be notified if your staff cannot resolve the issue without downtime.

  • If you are receiving alerts through e-mail, set up server-based rules in Microsoft Outlook to prioritize your alerts. Forward high-priority alerts to your cell phone or pager when you are out of the office.

  • Ensure that your staff knows whom to notify to escalate the alert and whom to contact if the first contact does not respond within a given period.

  • Use a team Web site to display quick status information for each data-center server, alert messages, and broadcasts. Permit each staff member to post alerts, messages about upcoming projects, and system notes.

Monitoring Practices

To manage incidents, monitor the data center to anticipate disaster and to provide assistance in diagnosing problems and failures. Monitor the data center using Windows 2000 Performance Monitor and SQL Server Profiler. Use the following practices to monitor the data center:

  • Establish a performance baseline to determine what is normal for each data-center server. This performance baseline shows resource usage changes over time and helps compare when troubleshooting.

    • Choose a large sample interval (more often than every 3 minutes) for a short duration (15 to 30 minutes) to collect information periodically.

    • Choose a smaller interval (every 15 seconds) to get a snapshot of how the system performs normally at different times of the day.

    • Save these samples in a drive other than the data or log file drives.

    • Synchronize the time on all servers to enable you to match information between servers when a problem occurs.

  • Monitor for specific performance conditions on each data-center server, and use alerts to warn of impending problems before they affect availability. You can also configure an alert to initiate a job to perform additional monitoring when a specified condition is detected.

  • Perform security audits for sensitive data access and administrative changes, and review the audit logs regularly.

  • Monitor login failures to detect attacks on SQL Server. Monitor only what will be meaningful in the long-term; eliminate monitoring that does not return value proportionate to its cost.

Change Management

Unmanaged change to the data center is a major cause of data center failures. You must manage the following types of changes:

  • Planned changes — Planned changes, such as hardware upgrades and changes in software versions, can be managed using a QA environment to test all changes before they are deployed to the production environment.

  • Emergency changes — Emergency changes, which are changes necessary to avoid or eliminate downtime, cannot be thoroughly tested. These changes should be fully documented and then duplicated, where appropriate, to the QA environment to ensure that the two environments remain synchronized.

Planned Changes

To manage planned changes, establish a QA environment that is an exact copy of the production environment. For example, if your production environment is using automatic failover clustering, the QA environment must use automatic failover clustering. Use database backups or automatic log shipping to duplicate the production databases to the QA environment.

Use the QA environment to test all changes before they go into production. Create test scripts to adequately represent your production load during normal use and to emulate peak usage time. Use the following guidelines when deploying planned changes:

  • Check all code — including schemas, scripts, and stored procedures —into a source control product such as Microsoft Visual SourceSafe version control software. Doing so enables you to clearly identify each version of a script.

  • Create a detailed plan for deploying completed code to the production environment, including time estimates and prerequisites such as service pack levels. Be sure that your change process is modular. You should be able to stop and restart (or roll back) the change at any point. Rely on scripts for making changes, and create a rollback script for every change.

  • Schedule the change, and staff accordingly. Before you implement any change in the production environment, notify all affected parties.

  • Back up the entire environment before implementing a change to ensure a rollback will be smooth if the change must be backed out.

  • List every change and all associated information. Save e-mail related to the project, such as any mail showing discarded design ideas that you can use if the current design does not work as planned.

  • Allow changes to be made only under the auspices of your staff.

Emergency Changes

When an issue is too urgent to undergo strenuous testing, you must forgo complete testing in the QA environment before implementing the change in the production environment. If time permits, perform at least limited testing in the QA environment. Never introduce a highly risky change into an unstable environment without testing carefully. Make only one change at a time, and record your observations before and after the change. Update the QA environment to ensure it remains in sync with the production environment.

To avoid the worst-case scenario where change becomes uncontrolled and changes are not tracked, follow these guidelines:

  • Even in a crisis, allow time for the most recent fix to take effect before trying an additional fix.

  • Allow only highly experienced senior staff to be involved in emergency changes.

  • In a crisis, require junior staff members to notify a senior staff member that they need assistance.

Although not all changes can be controlled, they can be managed to eliminate the risk of unnecessary downtime.

Maintenance Management

Every active database needs occasional maintenance to detect and prevent problems before they occur. You can run DBCC (CheckDB, CheckAlloc, CheckFilegroup, and CheckTable) and DBCC IndexDefrag online without blocking updates. The impact is minimal with the recommended settings. Running tasks, such as DBCC CheckDB with the physical_only option or DBCC IndexDefrag, do not affect performance much.

However, running other tasks can affect availability. You must evaluate each task to determine its affect on availability. For example, dropping and recreating an index is faster than running DBCC IndexDefrag to perform the same task. If the index is a vital index, however, dropping and creating an index degrades performance until the index is rebuilt. This may be perceived as a lack of availability on a busy system. In the case of a vital index, DBCC IndexDefrag is the better choice for availability.

Evaluate each maintenance task to determine its impact on availability. Perform those tasks that have the highest impact on availability during off-peak hours or on a copy of the production database maintained on a secondary server. Performing maintenance tasks on a secondary server enables you to detect problems without affecting the availability of the production server.

In addition to database maintenance tasks, perform regular health checkups on the data-center environment. Periodically check hard disks to determine whether they need to be defragmented.

Configuration Management

Develop a standard configuration for all servers. Set guidelines specifying which drive letters to use for particular components or applications, such as C for operating system, S for SQL Server binaries, L for transaction logs, and D for data. Standardize how services, applications, and users access files and directories over your network, whether through mapped drives or shares. Never allow users to access a drive used by the SQL Server data and log files because user access can reduce the drive's availability by generating competing demands for disk access.

Services

Use a well-documented (or the same) domain user account for the SQL Server and SQL Server Agent services on all servers. Never use a real user account to run these services. Create a dummy account, name it clearly, and note in its profile not to delete or tamper with it.

Ensure that the password for this dummy account does not expire. Changing this password can require a lot of planning if you employ linked servers, remote procedure call (RPC), replication, log shipping, or clustering. If you must change the password, use SQL Server Enterprise Manager.

If SQL Server Agent is not running, the data-center server will be unable to issue alerts. To overcome this barrier, configure SQL Server Agent service to restart automatically and use a centralized monitoring system, such as Microsoft Operations Manager 2000 (MOM), that checks the services on each data-center server periodically. For more information, see "Get in Control of IT Operations — Microsoft Operations Management" on the Microsoft Web site at https://www.microsoft.com/mom/.

Security

Secure each data-center server to ensure that only the people who need access to the servers can access them. Ensure that all administrator accounts use strong passwords. Allow access to critical servers only from trusted Internet Protocol (IP) addresses.

Use the same security model on all servers. For enhanced security, use Windows Authentication mode, rather than Mixed mode. Mixed mode permits users to connect using SQL Server login accounts, which are inherently less secure. Each staff member should log in to SQL Server using their Windows user account rather than the sa account. This enables auditing of all activity on each data-center server. Never use the sa account in code; you should be able to change the sa account password frequently, without notifying anyone or revising any code.

For more information, see:

Overcoming Application Design Barriers

High-availability solutions cannot compensate for bad application design. Poorly written application queries can cause blocking and excessive locking, which reduces the perceived availability of the data-center server. At worst, such queries can cause SQL Server to fail. While these application design issues can affect perceived and actual availability, they are beyond the scope of this document. There are a number of specific application design issues, however, that you must overcome to deploy the redundant server solutions discussed in Chapter 5, "Minimizing Downtime by Using Redundant Servers." These application design issues include the following:

  • Do not hard-code server names, instance names, and IP addresses into the application. Creating n-tier applications that connect to the data center using COM+ objects enables you to create a more flexible disaster recovery plan.

  • Use the same collation across all SQL Server instances and databases in the data center. Using a common collation enables you to easily restore a database to any server in the data center. If different servers and databases use different collations, you complicate your failover options because restoring a database with one collation to a server with a different collation requires precise coding. Your disaster recovery plan must take this into account.

  • If you are using SQL Server logins, you must give application users unique names across all SQL Server instances in the data center to prevent user name conflicts when a failover occurs. Conflicts can occur if two applications share the same user name with different rights and responsibilities. These conflicts occur only when SQL Server logins are used. The uniqueness of Windows logins is enforced by the Active Directory service.

  • Do not code an application for a specific service pack level, although you might want to code for a minimum service pack level. If your disaster recovery plan requires one server to host more than one application, and if the application is not compatible with a certain service pack, the application will not be highly available.

You should ensure that software developers incorporate these best practices in all future development projects.

Overcoming Staffing Barriers

Only entrust qualified and trained employees to operate the data center and carry out the disaster recovery plan during a data center emergency. Unqualified or untrained employees can magnify a disaster and hamper a recovery.

You can facilitate shared ownership and collective responsibility for the data center by rotating team roles, using spare time effectively, presenting new information during meetings and training sessions, and promoting effective communication within the team.

The best practices for overcoming staffing barriers to high availability include the following:

  • Assigning roles — Assign specific tasks and roles to individuals, and be sure each staff member knows what is expected to minimize confusion. Assign a primary and at least one secondary staff member to each server.

  • Rotating roles — Rotating staff members among roles allows each staff member to learn new responsibilities and skills. You should rotate staff among projects, among classes of servers, and among production and QA environments. As each staff member becomes familiar with other systems, the number of issues that he or she can handle increases. As a result, team members do not need to call each other as frequently for answers, which can save valuable minutes when a server is unavailable.

  • Educating the staff — Use spare time to educate team members about unfamiliar projects or systems, or about any new techniques. Provide books, journals, Web sites, and newsgroups for continual learning. Hold short weekly lunchtime presentations on high-availability issues. Invest in professional training and internal cross-training for your staff.

  • Increasing communication skills — Encourage the development of both spoken and written communication skills. Increasing communication skills might involve formal training or courses in presentation skills, technical writing, and other communication skills. By establishing a high level of communication among staff members, you improve the staff's response time. Staff members must be instantaneously available to each other by cell phone or by e-mail during working hours — and sometimes during nonworking hours.

If your staff is well integrated, able to perform many roles, highly trained, and possesses excellent communication skills, the availability of the data center increases.