Implementing Systems for Reliability and Availability

Article
02/20/2014

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

Implementing Systems for Reliability and Availability

Microsoft Consulting Services

Manufacturing and Engineering Practice

Chapter 1 - Introduction
Chapter 2 – Best Practices
Chapter 3 – Customer Case Studies
Chapter 4 -- Planning: Hardware Strategies
Chapter 5 -- Planning: Software Strategies
Chapter 6 – Planning: Pitfalls To Avoid
Chapter 7 – Monitoring and Analysis
Chapter 8 – Help Desk
Chapter 9 – Recovery
Chapter 10 – Root Cause Analysis
Appendix A – Tools
Appendix B – Implementing Hardware Standardization
Appendix C – Implementing Software Standardization
Appendix D – Performance Analysis Checklist
Appendix E – Help Desk Escalation Procedure
Appendix F – Problem Resolution Checklist
Appendix G – References and Relevant Web Sites

Chapter 1 - Introduction

Executive Summary

This document discusses tools, technologies, and operational practices you can use to improve the reliability and availability of computers, networks, and application systems. This is a prescriptive guide that describes technical and operational practices found to be effective in the deployment of the Microsoft® Windows NT® operating system. Using this guide, organizations can develop a set of guidelines to help ensure planning and deployment decisions that result in high availability systems.

Audience

This guide is targeted at individuals and organizations responsible for planning, deploying, and maintaining Windows NT based systems that must be highly available. The case studies and recommendations have been written from the vantage point of customers with large numbers of servers and clients. The information, however, should be valuable to all customers—regardless of the number of clients or servers deployed.

Technical Level

This guide is intended for individuals familiar with infrastructure components and operations practices. Infrastructure components in this context include, networks, routers, name resolution services, power supplies, and other resources and systems needed to keep servers functioning. Familiarity with Windows NT is helpful, but not required.

Product Focus

The primary focus of this document is Windows NT Server 4.0, Service Pack 4 or greater. The document addresses some Microsoft BackOffice® family components and non-Microsoft server applications. The intention is to provide methods to improve the reliability and availability of any application running on Windows NT Server. In cases where the Microsoft Windows® 2000 operating system provides a substantially different or better means of addressing a particular concern, the document will point out the difference.

Why Microsoft Prepared this Document

Two trends in computing make operational procedures significant factors for achieving system availability:

Increasing complexity of systems and application software
Increasing functionality and reliability of commodity hardware

These trends are illustrated in a study1of system failures during 1985 to 1993. During this period, a shift from the predominant model of stand-alone servers serving dumb terminals to a model of distributed client and server computing occurred. Client and server computing is a more powerful and more complex model. While the client and server model became predominant, hardware components became increasingly reliable. The study concluded that the proportion of computer, network, and application system outages caused by operational management issues rose from 15 percent to 50 percent of all system crashes during this period.

Internal Microsoft studies of Windows NT reliability show similar results. These studies suggest that well-planned operational procedures are a key to creating reliable and available computers, networks, and applications using commodity (as opposed to stand-alone mainframe) hardware and the powerful client-server model. Well-planned, well-documented, and disciplined operational procedures decrease system failures.

That is why Microsoft prepared this document. The Microsoft goal was to interview customers who use non-mainframe, commodity technology in business-critical systems, and report on the practices they use to improve reliability and availability. The customers come from different industry segments that require high availability and cover a wide range of systems and applications. You can read the case studies of nine anonymous customers in Chapter 3 – Customer Case Studies.

Key Terminology

It is crucial to understand some key terms before examining the customer case studies and the chapters about methods of improving reliability and availability. This section defines the terms failure, reliability,availability, and the categories of operational procedures covered in this document.

Failure

For the purposes of this discussion, failure is defined as a departure from expected behavior on an individual computer system or a network system of associated computers and applications.

Failures can include behavior that simply moves outside of defined performance parameters. If the specified behavior of the system includes time constraints such as a requirement to complete processing by a specified amount of time, performance degradation beyond the specified threshold is considered a failure. For example, a system that must process a transaction within two seconds may be in a failed state if transaction processing degrades beyond this two-second window.

Software, hardware, operator and procedural errors, along with environmental factors can each cause a system failure. A recent survey found that while hardware component failure accounts for up to 30% of all system outages, operating system and application failures account for almost 35% of all unplanned downtime. Typical hardware components that may fail include computer cooling fans, disk drive hardware, or power supplies.

Failure of a single component or single computer can directly influence reliability of the overall system.

Reliability

Reliability is a measure of the time between failures occurring in a system. Hardware and software components have different failure characteristics. Although formulas exist to predict hardware reliability based on historical data, it is difficult to find formulas for predicting software reliability.

Hardware components usually have what is known as an exponential failure distribution. Under normal circumstances and after an initial phase, the longer a hardware component operates the more frequently it will fail. Therefore, if the mean time to fail (MTTF) for a device is known, it is possible to predict when the hardware component fail.

Historical data about mechanical and electrical components fit the so-called bathtub curve illustrated in the following figure.

Figure 1: The Bathtub Curve 2

In this model, three distinguishable phases exist in a component's life cycle. The phases are:

Burn-in
Normal operation
Failure mode

Each phase is characterized by some signature or behavior. Failure rates are characteristically high during burn-in, but drop off rapidly during the normal aging period. Devices seldom fail during the normal period. As the devices age, however, the failure rates rise dramatically and predictably.

Information about the MTTF of hardware components can be used to track the actual failure rates of specific devices, and replace them before they enter the expected failure mode. This strategy is often used when the cost of a failed component can be catastrophic.

MTTF values for a sample of commodity hardware components are listed in the following table.

Table 1 Typical component MTTF 3

Component	MTTF
Connectors and cables	1,000 years
Logic boards	3 – 20 years
Disks	1 – 50 years
LAN	3 weeks
Power (North America)	5.2 months

It is important to note that the reliability of different hardware components of the same type can vary considerably based upon a variety of influences. For example, areas with frequent summer electrical storms, or other extreme environmental conditions experience different failure rates. These conditions stress components beyond normal design and operating environment specifications.

Statistics such as shown in Table 1. Typical component MTTF are not readily available for software defects. Data dependent errors triggered by data entry or data communication must be predictable to develop formulas that accurately predict the failure of an application.

Availability

Availability is a measure of the amount of time a system or component performs its specified function. Availability is related to, but different than reliability. Reliability measures how frequently the system fails; availability measures the percentage of time the system is in its operational state.

To calculate availability, you need to know both the mean time to failure (MTTF) and the mean time to recovery (MTTR). The MTTR is a measure of how long, on average, it takes to restore the system to its operational state after a failure. If you know both the MTTF and the MTTR, you can calculate availability using the following formula:

Availability = MTTF / (MTTF + MTTR)

For example, if your data center takes an average of six months to fail (MTTF = six months) and it takes twenty minutes, on average, to return the data center to its operational state (MTTR = twenty minutes), then your data center availability is:

Availability = 6 months / (6 months + 20 minutes) = 99.992%.

Notice the two ways to improve the availability of your system—increase your MTTF or reduce your MTTR. Well-run data centers typically attempt to do both. Carefully chosen hardware and software can increase your system's MTTF, whereas keeping a stock of spare parts on hand, having hot standby servers, and using clustering technology all reduce your system's MTTR.

People frequently classify system availability by measuring the percent of time in which the system is available. The following table shows these common classes and the associated availability percentages and related annual downtime.

Table 2 Availability Classes

Availability class	Availability measurement	Annual downtime
Two nines	99%	3.7 days
Three nines	99.9%	8.8 hours
Four nines	99.99%	53 minutes
Five nines	99.999%	5.3 minutes

Categories of Operational Procedures

Operational procedures used at different customer sites can vary considerably since each customer has a unique environment and a unique set of reliability and availability requirements. However, because each customer attempts to tackle roughly the same set of problems, this document uses the following problem-oriented classification to categorize operational procedures.

Planning and Design	Given a specific environment (for example, geographically centralized versus distributed) and specific reliability and availability requirements, how do you design a data center that most efficiently meets the requirements? Planning and design decisions that have an immense impact on all other procedures fall into this category.
Operations	This class of procedure addresses the problem of keeping the data center running under day-to-day conditions. Backups and preventative maintenance are examples of procedures that fall into this category.
Monitoring and analysis	Customers typically want to spot potential problems before they escalate and become failures. Procedures that monitor the health of servers and analyze long-term trends fall into this category.
Help Desk	Even in the most well designed data center or network, problems arise with both server and client computers. Procedures for diagnosing and resolving these problems fall into the Help Desk category.
Recovery	If a server fails, the operations personnel will need to return it to an operational state. The procedures for recovering a failed server fall into this category.
Root cause analysis	Frequently, customers want to ensure they are fixing problems rather than just treating the symptoms of problems. Customer procedures for determining the root cause of failures fall into this category.

Chapter 2 – Best Practices

This chapter summarizes best practices described during interviews with data center and network customers. Each customer was selected because their business operations have demanding requirements for high availability and reliability.

Who Microsoft Interviewed

Microsoft interviewed nine customers who operate highly available and reliable data center, network, and desktop computer systems. Microsoft tried to determine, in as much detail as possible, the operational practices these customers use to improve the availability of their systems. Because many basic best practices do not depend upon a specific operating system—for example, you can use similar backup procedures for both a Unix-based server and a Windows NT-based server—Microsoft did not exclusively confine this study to environments that used only Windows NT-based systems. However, this study did focus on practices that can be used with computers running Windows NT.

Universal Best Practices

At a high level, the customers had many traits in common. The study compares the traits customers had in common in the following categories of operational procedures. (See Categories of Operational Procedures for descriptions of these categories.)

Although exact operational procedures vary greatly depending upon business needs, these case studies demonstrate that customers know the importance of having well-defined procedures. Ad hoc methods inevitably produce erratic results—and the consequences of erratic results are intolerable in when the maximum allowable downtime per year is measured in minutes.

Planning and Design

All customers plan their operational procedures using clearly defined availability goals based upon the relative importance of the business functions that they support. For example, the securities trading and investment firm planned its operational procedures around ensuring the system was available for the final 15 minutes of the trading day. Customer A implements different availability strategies for its process control computers than it does for e-mail servers.

One best practice in the planning category implemented by four of nine customers was use of redundant hardware and software. Careful use of redundancy allows a data center or network system to tolerate failures of individual components and computers. Building redundancy into both hardware and software architecture is one of the most important planning decisions customers can make.

Customer strategies that provide reliable systems include use of redundant hardware and software components, such as redundant arrays of independent disk (RAID) volumes, redundant power supplies, and redundant network interfaces. This allows individual component failures to occur without affecting the overall reliability of a system. The same principle that makes a storage subsystem more reliable than individual disks can make a data center more reliable than its individual computers.

Monitoring and Analysis

All customers use some form of automated procedure to monitor the availability of their servers. Automated monitoring is a key best practice that enables identification of failure conditions and potential problems. Monitoring can also help reduce the time needed to recover from a failure.

One reason monitoring is so important is the simple fact that operators need to know a failure has occurred before they can initiate procedures to restore service.

Additionally, some customers monitor performance characteristics of their servers to spot usage trends. This is a proactive best practice that allows customers to identify the conditions that contribute to system failure and take action to prevent those conditions from occurring. Monitoring strategies implemented by each customer in this study varied depending on the type of service the customer provides.

Help Desk

In the event of a failure, most customers have a clearly defined process to obtain help and escalate failures to increasingly skilled groups of people.

Recovery

In general, each customer uses procedures to prevent failures leading to loss of data. Eight of the nine customers described procedures to quickly and correctly recover from failures. Exact recovery procedures are unique to the type of service each customer system performs; however, common recovery strategies include keeping a stock of spare hardware parts and multiple copies of critical data in read-only formats.

Root Cause Analysis

Most customers agreed that determining the root cause of failures instead of treating symptoms is important. However, only two customers described procedures for diagnosing failures.

Implementing Best Practices

This document examines methods for implementing system components and operational procedures in the following chapters:

Chapter 4 -- Planning: Hardware Strategies
Chapter 5 -- Planning: Software Strategies
Chapter 6 – Planning: Pitfalls To Avoid
Chapter 7 – Monitoring and Analysis
Chapter 8 – Help Desk
Chapter 9 – Recovery
Chapter 10 – Root Cause Analysis

You can go directly to a chapter by clicking on the chapter title in the list above. Each chapter focuses on a specific category of operational procedure and will help you understand and implement best practices to improve the availability of systems in your unique environment and organization.

Outstanding Examples of Best Practices

Detailed customer case studies are presented in Chapter 3 – Customer Case Studies. You can read each individual case study in that chapter or quickly scan examples of best practices by clicking on any reference in Table 3. Customer Best Practices.

Table 3 Customer Best Practices

Practice Area	Reference
Planning and Design
Define availability targets for system	Continuous Process Manufacturing (Customer A) Customer H- Planning and Design
Stage all system changes prior to releasing to production	Customer F -Planning and Design
Use service level agreements	Lotus Notes Hosting (Customer H) Corporate Information Technology Group (Customer I)
Standardize on both hardware and software	Customer B – Planning and Design Customer I - Planning and Design
Use redundant hardware where possible	Customer B – Planning and Design Customer D - Planning and Design Customer I - Planning and Design Customer G - Planning and Design
Operations
Automate common tasks	Customer A - Operations Customer C - Operations
Restrict physical access	Customer H - Operations
Have a results reporting mechanism	Customer C - Operations Customer D - Operations Customer I - Operations
Monitoring and Analysis
Use a monitoring tool, either off-the-shelf or custom-designed	Customer B – Monitoring and Analysis Customer A - Monitoring and Analysis Monitoring H - Monitoring and Analysis
Monitor at the application level	Customer G - Monitoring and Analysis Customer D - Monitoring and Analysis Customer B - Monitoring and Analysis
Analyze monitoring data for trends that may signal problems	Customer D - Monitoring and Analysis Customer C - Monitoring and Analysis Customer B - Monitoring and Analysis
Use monitoring data for helpdesk	Customer H - Monitoring and Analysis Customer I - Monitoring and Analysis
Monitoring must be lightweight and not interfere with your applications	Customer D - Monitoring and Analysis
Choose size of monitoring window carefully	Customer H - - Monitoring and Analysis
Help Desk
Create well-defined problem escalation procedure	Customer C - Help Desk Customer I - Help Desk Customer A - Help Desk Operations
Provide good tools, training, and documentation for the first-level helpdesk staff	Customer H - Help Desk Customer I - Help Desk
Ensure that second-level helpdesk staff can remotely diagnose problems or can quickly get to a problem server	Customer I - Help Desk Customer A - Help Desk Operations
Your third-level helpdesk should be system architects arranged in well-defined crisis management teams	Customer A - Help Desk Operations Customer I - Help Desk Customer C - Help Desk
Recovery
Store spare parts on-site	Customer I - Recovery Customer G - Recovery Customer B - Recovery
Configure active standby systems	Customer B - Recovery Customer D - Recovery Customer G - Recovery
Use a "fast load" or "cloning" process to quickly install, configure, and start, a new computer	Customer B – Recovery Customer I - Recovery
Establish data recovery processes before a failure occurs	Customer C – Operations and Recovery Customer G - Recovery Customer B - Recovery
Provide hard-copy recovery manuals	Customer C - Planning and Design
Establish catastrophic recovery procedures	Customer C - Recovery Customer D - Recovery
Practice recovery procedures	See Practice in Chapter 4 -- Planning: Hardware Strategies
Root Cause Analysis
Although restoring system availability is the first priority when dealing with a problem, the next priority must be review and evaluation to determine the cause of the failure. (This process is frequently referred to as a post-mortem.)	Customer I – Planning and Design and Root Cause Analysis Customer G - Root Cause Analysis
Use root cause data to iteratively improve processes	Customer B – Planning and Design Customer I – Planning and Design

Chapter 3 – Customer Case Studies

This chapter presents in-depth case studies of the operational procedures of customer data centers.

This chapter studies customers in the following industries:

Customer A is a continuous process manufacturing company
Customer B is a consumer Internet service
Customer C is a securities and investments company
Customer D is a stock trading company
Customer E is in the insurance industry
Customer F is in the financial services industry
Customer G performs discrete manufacturing
Customer H hosts Lotus Notes applications
Customer I is a corporate information technology group

Each customer was interviewed to obtain information about their experience and best practices using interview methods based on the Categories of Operational Procedures defined in this study.

Information for each category was not available for all customers. Instead, the customer descriptions focus on the procedures most important within each customer environment.

Continuous Process Manufacturing (Customer A)

The continuous process manufacturing customer (referred to as customer A) has worldwide operations that include both continuous process manufacturing systems and critical business systems. There are approximately 450 sites worldwide and over 30,000 employees. All systems are managed by one of four operational centers—2 in North America, 1 in Australia, and 1 in Europe. The information systems staff has been reduced to half its previous size.

The customer is entering a world of real time business where transactions from both clients and suppliers take place at any time of the day. All process control, business, email, messaging, file and print services must be available 24 hours a day, seven days a week.

Planning and Design

The customer currently has a three-tier architecture comprised of:

process control systems
process information systems
line of business systems

A clear distinction is made between the process information and process control systems. This includes use of different operating systems and hardware.

The architecture is a combination of proprietary software and commercial software. Redundant data collection is performed by the process information system running on DEC VMS. The business system is a SAP R/2 application running on an IBM mainframe. Business applications include central data processing and generation of invoices and schedules.

Their current systems architecture is illustrated in Figure 2. Continuous Process Manufacturing Information Systems.

Figure 2: Continuous Process Manufacturing Information Systems

The customer is in the process of migrating this architecture to use commercially available products in all systems including the proprietary systems. The new architecture must have the capacity to react to changing business needs in an evolutionary manner. Additionally, they are designing the architecture to provide a reliable environment built on commodity hardware and software.

The continuous process-manufacturing customer uses redundant hardware wherever possible. This includes installation of redundant process control devices, redundant physical networks, and redundant data collection and stores. (Note that the double parallel lines in Figure 2. Continuous Process Manufacturing Information Systems represent redundant networks and devices.)

All hardware and data are integrated with tight checks and balances in the proprietary process information system. This provides fault tolerance and supports near continuous operations in their manufacturing plants.

Other features include the ability to send system updates to the process control devices while the devices are active. This is possible because the software running both the proprietary information system on the DEC VMS computer and the software running the process control devices are tightly integrated.

Process control devices can store multiple "manufacturing recipes" along with production schedules to produce batch orders. The schedules are sent to the process control devices from the proprietary process information system. Master schedules are created in the business system running on the mainframe (IBM APPC-based) computer.

During production, data captured by process control devices is sent to the proprietary process information system. The process information system and the process control devices share a series of acknowledgements to ensure the data is recorded for each batch (recipe) of product. Next, the proprietary information system stores the data and sends a copy of the data to the mainframe system. (The process information systems also store important environmental data for regulatory reporting.)

This architecture is specifically designed for Customer A and was developed when commercially available software was not available. This architecture is being phased out as commercially available software becomes available with all the features that currently exist in their internally developed proprietary systems.

The vision is to create a "data farm" environment that is independent of specific hardware and software configurations. The data farm would be able to coexist with multiple versions of the same operating system and applications and rolling updates of software would be tolerated.

Customer A has a desire to buy a single processor box and grow it without having to take it down. Typically, they use two computers—one in production and one in back-up mode. This is less expensive than other approaches but their optimum configuration is a disk farm accessed by multiple processors. Customer A would like to see a single address for a system regardless of what components make up the system.

The current state of technology does not allow this to take place. Digital VMS-based clusters come the closest to meeting their requirement and they have deployed them in business critical applications.

The customer maintains approved versions of the other operating systems including Windows NT Workstation and Windows NT Server. All vendor setup and install scripts are re-packaged by the customer to fit their environment. Service packs are reviewed for compatibility with production applications before being released and the customer prefers that operating system fixes be released separately from enhancements.

Customer A limits major changes to minimize the impact on operations—it is difficult to do large updates electronically due to the smaller bandwidth in their remote sites and due to conflicts with the business use of the network.

Operations

The customer believes in abstracting the steps necessary to perform business processes, to develop automated steps. Automation can include using a program or set of scripts that perform a business or production process. Ideally, a system operator has permission to run either the program or the scripts, but does not have permission to perform the operations done by the scripts. The customer wishes to avoid operators using command level interfaces to systems.

This provides an operational consistency that helps to prevent possible human errors. Predictable results are a requirement for systems with higher availability.

Operators are given system permission to run the programs and scripts, but are not given permission to run the operating system commands used by the scripts and programs. This keeps operators from running commands on the system that have not been approved in a script. Creating this type of a secure environment is a challenge for the customer, since this is done primarily with scripts.

When a problem in a process is encountered, escalation procedures ensure the problem will be handled in a consistent manner. Automation of procedures has allowed the customer to cut their IS staff in half.

The customer's high availability manufacturing and line-of-business systems (SAP) must have a failsafe design, as any failure would result in loss of production. The SAP business system is linked to the process control system. Multiple copies of the production schedules are sent across the link in from SAP to the proprietary information system to allow the plant to continue running if the link to the SAP system is unavailable.

In addition, the proprietary process information system monitors the product flow and sends this information to the SAP system.

There are nine stages involved in getting an application from development into the SAP production environment. The customer has improved SAP availability by improving operation processes (people processes).

Environmental monitoring is critical for maintaining continuous production. Failure in the environmental monitoring processes can stop the production line. The process control systems will shut down if a failure in environmental monitoring exceeds the maximum acceptable interval of 15 minutes.

Environmental monitoring processes capture critical data points on a regularly scheduled basis. The data scanner device scans the process control system to obtain data points. Captured data is transferred to the process information system which acknowledges the capture by sending a time stamp to the process control systems.

This allows each system to know exactly when that last snapshot of environmental data was successfully captured.

Monitoring and Analysis

The customer uses Tivoli software to monitor their systems. There is a core set of services monitored through event logs. The customer uses Tivoli to monitor Microsoft Exchange. The customer develops custom applications to monitor other applications like the process information system. The customer provides status information on an internal Web site.

Customer A has experienced problems using the Tivoli software, but as described in their planning and help desk practices, they work through the Tivoli vendor to resolve the problems.

Help Desk Operations

The first level help desk is outsourced to the company that manufactures a computer or develops a commercial software system or application. Customer A does this instead of obtaining support from a third-party service or vendor.

The customer believes they can increase system availability by buying their computers from a single manufacturer and setting a standard for computers, operating systems, and application systems. They have defined a standard customer workstation. This standard includes requirements for any applications installed on the computer. If a new application cannot use standard .dll files, the application must load its version of a .dll file into a special directory.

Standardization makes computers and software easier to support, lowers training costs, and provides a common environment to users. However, maintaining this environment is a challenge as the customer merges with, and acquires other companies that have different operating and application systems.

To manage support and problem resolution, Help Desk services are divided into the following three categories:

Problem Resolution – Users contact the help desk for first level assistance provided by a problem resolution team.
Vendor Support – The problem resolution team contacts the vendor for help according to service agreements made with vendors.
Design Support – The problem resolution team contacts the design team for help regarding software developments.

Recovery

The customer does not believe that all systems require hot backup. Hardware redundancy and the proprietary software allow rolling upgrades and rollbacks in the process control system.

However, the customer prefers not to use redundant hardware in the line of business systems. (See Figure 2. Continuous Process Manufacturing Information Systems.) Most of the systems are designed with a degree of latency to isolate the failure of one system from affecting another.

Service level agreements help determine requirements for hardware and system redundancy. The service level agreements also influence whether a system is tightly coupled with, or isolated from, different systems. The business customer determines the need. In this customer's environment, hardware options such as uninterruptible power supplies (UPS) and redundant arrays of independent disks (RAID) can achieve 99% availability. This is sufficient for many of the business systems.

In a similar method used by other customers interviewed for this study, Customer A has developed a cloning procedure that allows them to easily replace a failed server with a new identically configured computer.

This customer considers it best practice to compartmentalize disks on a computer. This is done by installing the operating system on one disk, data on a second disk, and applications on a third disk.

Data disks are backed up nightly. The customer uses a site backup server with a tape stacker. The customer is able to restore a system disk in less than 8 hours.

Consumer Internet Service (Customer B)

The consumer Internet service customer (referred to as customer B) provides a subscription service through the public Internet. Customer B's business involves serving large quantities of read-only data to its customers and maintaining an infrequently changing database of customer information and preferences. Because consumers visit Customer B's Web site on a 24 x 7 basis, Customer B has no maintenance window in which they can take the service offline.

Planning and Design

Customer B designed its data center to run on standardized commodity hardware. This practice provides hardware of known quality that can be easily interchanged. In addition, standardizing hardware helps establish common procedures for installation and replacement.

The business processes that Customer B uses in its data center fit naturally into a three-tier model. Therefore, Customer B divides its computers into multiple tiers and tailors availability strategy for each tier.

The first tier maintains the connection and service to consumers on the Internet. The first tier includes the network servers that provide connection to the Internet and the Customer B Web site. The network servers take incoming requests from the Internet and route the requests to the appropriate application server in the second tier.

The second tier performs any business logic required to complete a consumer request for service or information. Application servers also capture user information. Redundant hardware is used to keep, multiple servers available to provide application services.

The third tier reads and updates the permanent data store. This includes logging pages a consumer uses on the Web server, and recording any permanent information needed about the consumer visit to the Web site. Customer B's database and file servers are referred to as a permanent data store.

The architecture implemented by Customer B is illustrated in Figure 3. Consumer Internet Service.

Figure 3: Consumer Internet Service

Customer B considers the two key requirements for minimum system availability to be the Internet connection and the permanent data store. The system is considered available if Customer B maintains the connection to the Internet and can log consumer activity in the permanent data store.

Not all application servers must be running to meet the minimum availability requirement. Customer B further defines full service availability as the availability of all line of business applications as well as the Internet connection and the permanent data store. Application servers can function independently of the network and permanent data store servers and a failure to application will not alter the status of a different line of application, the Internet connection, or the permanent data store.

As illustrated in Figure 3. Consumer Internet Service, data from the real-time production data servers is replicated to secondary data servers. The secondary servers provide data used by business applications such as billing. This ensures that business applications do not affect the permanent data store in real-time production mode.

Customer B designed their architecture using a three-tier model because computers in each tier can fail in different ways and different procedures are necessary to handle failures in each tier. They use redundant network servers to maintain the Internet connection, pools of identically configured application servers to run business applications, and the highest quality hardware and environment to support the permanent data store. (These techniques are described in further detail in the Recovery section.)

Operations

Customer B high availability practices include operational procedures that determine which new applications, or version of an existing application, are deployed. Customer B uses a link feature on Unix-based computers to allow for multiple versions of the same application to exist on a system at the same time. This allows operations personnel to deploy a new version of an application while an existing version is still in use. Customer B can easily revert to an older version of the application if problems are discovered with the new version.

Monitoring and Analysis

System monitoring and analysis of collected data plays an important part in daily operations for Customer B. As with the operational planing, the type of monitoring differs depending on whether the monitored server is a network, application, or data store server. Monitoring is thought of as a data collection process that collects information and trigger alarms when a break in normal conditions is detected.

Network Servers

Customer B monitors the network servers for a condition of pass or fail. In the case of a failure, a newly configured server replaces the failed server. System health is checked by sending a TCP/IP ping message to each server, and monitoring errors generated through SNMP using a proprietary data process developed by Customer B.

The process is a set of Perl scripts and a graphic engine that analyze the results of ping messages and SNMP traps on a scheduled basis. The process uses an operating system utility for scheduling, and SNMP_Session, a Perl module to implement SNMP. Depending on the server, the scheduling utility can be set to fetch data and start a script to trigger an alarm.

Computer replacement is an effective strategy to remedy a failure and to provide continuous operation as Customer B uses redundant network servers. This allows all but natural disaster failures to be treated as hardware failures.

Application Servers

Application servers run programs that monitor the performance and status of applications and servers. Both application and server are monitored because applications may cause errors, such as memory leaks, that over time degrade and finally stop the server.

If the operating system on an application server supports the ping operation, or SNMP, the server is monitored in a fashion similar to network servers. Scripts are started by the scheduling utility to capture data points for each application and server. Scripts can be started and run remotely. The scripts can also be manually started by an operator when necessary.

Data points monitored on application servers include items such as current processor utilization or memory utilization. This information is compared with information captured over time. An alarm is sent to the operations console if a corrective action is required.

If trends indicate that certain failures occur in a predictable time, an alarm is also sent to the operator. Corrective actions include stopping and restarting the application, or the server. Which action must be performed is determined by analysis of the collected data points.

Lines of business application servers are subject to changes in the application. New versions of an application are rolled out in a controlled manner to ensure continuation of service. It is significant to determine whether current errors are related to a new version of the application or some other factor. Additional data points and the frequency of collection is increased to monitor newly deployed applications.

The monitoring procedures that are run from the scheduler consist of scripts that are remotely executed. The operators can manually run the same scripts it a failure is suspected.

Permanent Data Store

Data systems normally provide the best integration with the monitoring and diagnostic services. This is true for the data systems used by Customer B.

Log information is captured on a regular basis to help keep the permanent data store operational. The objective is to prevent failure conditions, not correct failures when they occur. Filling the database, filling the log files, and generation of database inconsistency checks, indicates possible failure conditions.

Actual failures are also recorded. Failures require the prompt response since they represent a discontinuation of service. Operations personnel are on call full time to react to a failure of this magnitude.

Recovery

This section discusses the procedures Customer B uses to recover from failures in the network, application, and permanent data store systems.

Network servers

Hardware faults, followed by operating system faults, are the most common types of failure on network servers. The stateless nature of network services allows use of redundant systems. Customer B has sufficient excess capacity so that individual server failures are handled without a noticeable loss in service. This provides near continuous service of their network and connection to the Internet.

The technique of using redundant servers is extended to the environment. Redundant network servers can be placed in different physical locations and use different power sources. This strategy helps ensure continuous operations, even in the event of a natural disaster.

Customer B replaces a failed network server with a new computer. This is done for a several reasons. First, the servers represent redundant services. Second, the commodity computer is easily configured to a known configuration in a short period.

Customer B uses a remote installation technique to first install the operating system, followed by network service installation and configuration. This can be done in forty minutes or less, minimizing the affect of the failure on the network as a whole. The general procedures Customer B follows to replace network servers are listed below.

Hardware in the server racks is replaced. The hardware comes from an inventory of commodity components maintained at Customer B facilities. A service technician must install the components in the server racks and restore network connections.
The operating system and applications are installed and configured using scripts remotely activated by Customer B operations personnel.
Normal monitoring tests are started to validate that the newly installed server is functioning properly.

These procedures are effective and allow a new network server to be installed and configured in approximately forty minutes.

Application Servers

Similar to the network strategy, redundant application servers are deployed to maintain availability. However, the configuration of the application servers makes them less easily duplicated and remotely installed.

Customer B installs pools of application servers to make available the line of business application servers. All the servers in a pool are identically configured any one server in the pool can handle the work of a failed server.

If an application server fails, the network reroutes work to servers in the pool that are available. The consumer notices no change other than response being slightly slower while the system detects a failure and fails over to another server. Technicians recover the failed application server by swapping hardware, reloading application software, and reconfiguring. The recovered server is added back to the pool of available application servers.

However, analysis shows that failures in the application servers are more often caused by errors in an application rather than hardware failures. Customer B uses a strategy to detect conditions when an application is unstable and a procedure to stop and restart the unstable application. Applications are stopped and restarted on a regular basis to reduce potential failures. Operators and remote monitoring stations have the necessary authority to stop and restart applications.

Permanent Data Store

The primary strategy to ensure high availability in the data store is to keep each data server operational. This includes providing the highest quality hardware and operational environment. Additionally, planning is required before installing new software or attempting to recover from failure conditions. Efforts to restore the permanent data store focus on ensuring data integrity, and performing rapid change over to an available standby server.

Servers running data store operations are configured using disk mirroring techniques, redundant power supplies, and frequent backups. Standby data servers are configured and ready to take on the work of a failed production data server. (Customer B does not use clusters.)

Online operations that use the data store are primarily read operations, but there is significant write activity. Data and file servers are configured to fail gracefully if a data server cannot write to the data store. The server degrades to what it can process without writing and network and application servers insulate the end users from the failure. This condition cannot last for an extended period.

Planning a smooth and rapid transition from one instance of the data server to another is the goal when a failure must be resolved by switching to a standby data server. (This is different than the recovery techniques used on network and application servers that can function independently and do not represent a disruption of service if a single server fails.)

To switch from a failed data store server to a stand-by server the following steps are performed:

The integrity of the data is verified. If the data is in a stable state, the standby server is started using the saved data. If the data is not stable, data recovery procedures are implemented.
Data and file applications are started and the stand-by server takes control of the data on a shared data device such as a raid controller.

Root Cause Analysis

Customer B learned that not all commodity hardware has the same level of quality by studying the root causes of system failures. They observed a large number of hardware failures and used this information during planning sessions. Customer B created a requirement that their systems must standardize on higher-quality commodity hardware. This has greatly reduced the number of failures they observe in their data center.

Analysis of software failures helps Customer B determine when application restarts are appropriate strategies. Certain applications that access the central database routinely leak memory and regularly require restarts. Operations personnel monitor the results on a scheduled basis. If a condition exists that can cause a server to become unstable, the application is restarted. Additionally, when a specific time elapses between restarts, the operations personnel restart the unstable application to prevent failures. Customer B only restarts the server when an application failure causes the operating system to become unstable.

Securities Trading and Investments (Customer C)

The securities trading and investments customer (referred to as Customer C) is an independent securities trading firm that engages in the management of individual, pension, foundation and endowment accounts. The company has manages assets in the United States and abroad. The value of assets they manage is approximately valued in the tens of billions of dollars.

To manage these assets, Customer C employs a variety of applications that must be highly available. The applications include trade order management, portfolio management and customer management.

Planning and Design

Customer C uses application isolation as a high availability strategy. Each application Customer C runs has minimal external dependencies and is primarily two-tier in nature. This is done in case of application failure. This allows one application to have minimal (if any) impact on another application. Application systems are loosely coupled using queue technology to create bridges between the systems. Many of the systems maintain data snapshots and replicate the data storage on local disk drives to further reduce inter-application dependencies.

Applications and systems are geared to smoothly process peak periods during the day. The final 15 minutes of the trading day are the most critical part of the business day. Applications have a batch cycle where the transactions from a business day are reconciled, and the applications are prepared for the next business day. The users of the applications are customer service personnel. Their business practices insulate their customers from all but severe system delays and failures.

Customer C chooses to minimize the human factor (operations are automated) in all system design. In addition, crisis management procedures are developed and highly documented for each system with high availability requirements.

Customer C considers a system available if it can perform its intended business functions. An outage is a quality condition that makes it impractical to use a system for its business purpose. This expands upon most typical definitions to include performance degradation outside the requirement's quality parameters (real-time criteria).

Systems are considered available by a user's perception. There are queues between the applications and asynchronous communications are used so the client's application does not stall waiting on another application.

During critical periods of the day, extreme steps are taken to limit the possibility of failures. This includes no "fire fights", additional system diagnostics or application changes

Applications are continually enhanced with new business functionality. The nature of the application and composition of the development team determine how often application changes take place. The highest frequency of change the end users can effectively manage is a new version every four to six weeks. Some systems only allow changes every six months.

Certification processes, like system monitoring, are done through surrogate business processes. During certification, root causes are treated rather than symptoms.

Operations

Customer C has defined operational procedures that match the types of application processing they run—online and batch. They have procedures that handle transaction processing during the online day and procedures for running batch jobs that reconcile the business transactions performed during the online day.

Operations during normal business hours focus on reacting to conditions that either limit or stop the daily use of the system by end users. Little, if any jobs are on the online servers during the business day. This is especially true during the last 15 minutes of the trading day when the online servers require maximum availability.

The business needs require a daily reconcilement and individual account processing. A balance is required between information availability for multiple time zones and markets, and the need to reconcile each account on a daily basis. Best practices for account processing have multiple functions processed on a per account basis, rather than individual functions run for all accounts. This allows a better determination of the account's state, and allows for better recovery.

The batch process must successfully complete for the online processing to be available the next business day. A daily period of idle time is planned. This allows operations staff to detect, recover and restart failures in the batch process without affecting the daily online operations.

Batch processes are automated. The job of the operators is to "watch the disk lights" and react to problems, if they occur. Event log entries are made for start-up and completion of batch jobs. The operations staff maintains a Web site where they post the batch job logs for parties interested in the status of the batch jobs.

Monitoring and Analysis

Customer C uses surrogate customer transactions to actively monitor their systems. They gather historical data to monitor long-term trends.

Customer C considers itself good at gathering statistics. Automated monitoring takes place through applications writing events to the system log, combined with applications that read the events. Statistics are gathered on all parts of their systems (operating system, network, and application). Trend analysis is done for sizing future load, capacity planning, and service levels. There are tradeoffs between cost and the benefit of building very high available systems.

Customer C's monitoring tools are a combination of purchased software and applications that were developed in-house. These tools consist of data collection, automated tests that simulate business transactions and the periodic manual entry of business transactions.

Since systems are considered available based on end user perception, the simulated user transactions and manual user transactions are used as the primary indicators of how well the system is performing.

The best monitoring tools act as surrogates for a user of the business system. The surrogate monitoring systems are checked against performance thresholds. Error conditions can be detected from a decrease in performance before a system failure.

Trend analysis is done through a manual process. Too many factors affect a system's performance for automated trending to be a cost-effective strategy.

Help Desk

Customer C operates a Help Desk to provide user support and crisis management teams. The normal action when a failure occurs is to send a notification to the Help Desk. The Help Desk will try to remedy the problem, or will turn the problem over to the appropriate crisis management team.

Using a combination of purchased software with systems developed in-house; the Help Desk gets feedback on key performance and availability metrics.

Crisis management teams are organized to handle problems that cannot be resolved by the Help Desk. Representatives from the respective application teams work to solve problems in real-time. They have formal procedures and guidelines based on the experience of the team's personnel.

Recovery

The corporation maintains multiple operations sites that are capable of "standing in" for each other in case of a failure. This is done using DNS routing, but this type of change is considered a severe remedy.

Warm standby servers are used for critical applications. The warm standby server is required to have exactly the same configuration as the primary server, and may not be used for background applications (like development and testing).

Application recovery requires restoration of both the central data server, and the client workstations. In the case of a natural disaster, the system is not functional unless end users have workstations where the application can be used.

Root Cause Analysis

Restoration of service may use procedures that do not necessarily fix the cause of the problem, but which provide a remedy for the results of the problem. Customer C works to find the cause of the problem after a system is recovered.

Stock Trading (Customer D)

The stock trading customer (referred to as Customer D) uses Internet-based uses online transaction systems to buy and sell stocks. The online transaction systems must provide near continuous availability during the trading day so Customer D can operate their business. The online transaction environment is maintained as a fault tolerant system.

Customer D also provides report and inquiry services via their Web servers for their business customers. The availability requirements for report and inquiry services are not as demanding as the online transaction systems and are therefore maintained by using redundant hardware and software.

Planning and Design

Customer D's Internet systems are composed of 25 servers supporting a stock trading information Web site. Each server has a database that requires a nightly refresh and synchronization. A combined database using cluster hardware is being considered as an alternative. Daily backup of the Web servers is not required, since the data is replaced daily. In the case of a hardware failure, the data from the previous morning is reloaded on the server.

Customer D considers application isolation an important feature of high availability applications. Functions are designed as componentsin the Web system, using sub-domains for various components. This lessens the impact of an outage. When a failure occurs, only the affected component is lost, not the whole system.

Constructing a test environment for new applications is a challenge because many variables exist that cannot be easily duplicated in a test environment. Customer D uses online tests and pilot deployments to provide the best indication of whether an application will perform well under stress.

There is not an easy way to simulate the load that they get in normal operation. Even replaying a capture from a billion-transaction day does not get the same error conditions found in production. This is because there are too many factors

The testing method is to drive an application until it breaks. Then use that point as a watermark and operate systems under that point.

Servers and applications are balanced so that an individual server is not overloaded. Applications are shifted to another server if a given server starts to get too loaded.

Operations

The online transaction system consists of fault tolerant Tandem systems. It is maintained in a mainframe type environment and has rigorous run instructions. These systems are an online transaction day system and a batch reconciliation run executed every business day.

The Web servers are operated in a separate facility. Operational procedures call for data to be replicated to the various distributed application servers as part of the daily system refresh. The Web servers operate continuously.

Monitoring and Analysis

Customer D's monitoring solution is an evolving process. Using a monitoring system or application that has a minimal impact on a monitored computer is preferred. They consider monitoring systems referred as fat clients to impact application efficiency. Customer D is currently standardizing on Tivoli as the primary network management tool.

It is important to detect application errors, not just system errors. They are able to monitor applications both locally and from the Internet. It is important to be able to monitor from the Internet to get a true reading on the availability of an application.

System logs are used extensively as sources of information on system performance and to identify potential problem areas. Low-level polls are use to detect the health of operating systems.

Application availability is taken from the end user's point of view. Sample user queries are used to test the health of Web applications. Automated systems test the running of user transactions. In addition, monitoring personnel manually test system response during a business day.

Help Desk

Help Desk functions are performed by a three level system.

Level 1 handles the initial incident and works with the end user to remedy the problem.

Level 2 has technicians solve the problem either through remote diagnostic tools or by dispatching a service technician.

Level 3 involves application team leads and includes service level agreements with vendors that supply the systems and applications.

Recovery

For the online transaction system, minimal recovery times are required. The goal is to have the system back up before the user can hit two carriage returns at the client application. Operating system recovery time needs to be measured in seconds not minutes. Preferably in under 10 seconds.

There are two physical locations for the online transaction system and the Web system. These locations can back each other up in case of a natural disaster.

Since the Web systems primarily use read only databases, recovery from a hardware failure consists of copying the image of a system from backup to new hardware. Data from the daily refresh is then loaded in the newly created system.

Customer D is currently investigating the use of Microsoft Cluster Server to consolidate some of the read only databases. This will provide the automated failure over for the shared database.

Root Cause Analysis

Customer D is constantly working to improve system availability. They consider resolution of the root cause as an important strategy to increase system availability.

Insurance Company (Customer E)

The insurance company customer (referred to as Customer E) is currently rolling out a new high availability system for their insurance agents located across the country. The new system is a multi-city application with a central hub, along with operations centers in the Southwest U.S., Southeast U.S., Texas and Canada.

The system architecture includes a Web front-end using Dynamic HTML (DHTML), Microsoft Transaction Server (MTS) and Java on Windows NT-based servers. The new system will eventually host 60,000-70,000 users nationwide with about one to two dozen applications running on the servers. The system must be available twenty-four hours a day, seven days a week (24 x 7). Maintenance times are scheduled based on service level agreements negotiated with internal staff and application vendors.

Planning and Design

Customer E's new system will be built on Windows NT-based computers running Microsoft SQL Server 6.5 with Microsoft clustering technology, Internet Information Server, Cisco Local Director for load balancing and several applications.

Customer E has implemented a performance tuning lab. The performance-tuning lab uses an application called Strategizer to make decisions that determine computer requirements such as memory, CPU(s), and so on. Customer E is also developing application requirements for their new environment including:

Performance testing an application to meet a performance goal of 10 transactions per second (TPS).
An application architecture that can be reviewed by a technical architecture review team for recommendations on computer, network bandwidth, and installation requirements.
A "good housekeeping seal of approval" from Customer E's network team that meets specific testing requirements before application changes are deployed in the production environment.
Test scripts for running application regression tests.

Customer E uses a staging server that contains the latest build to run tests that look for adverse actions caused by any new applications.

The network operating system team handles upgrades. Any upgrade must be successfully complete testing on the staging server before the network team will sign-off on the upgrade. Their goal is to be able to perform rolling upgrades.

Operations

At the time of the writing of this document, the new application is prepared for the initial pilot phase. Operational procedures are being developed.

Monitoring and Analysis

Customer E uses monitoring tools provided by the system integrator providing their hardware. The monitoring strategy of the system integrator includes the performing the following steps:

Watching predetermined thresholds
Watching application events in the event log
Developing peak load information
Analysis/Trending (currently being implemented)

Help Desk

Customer E has a service level agreement with a major system integrator to manage their Help Desk. The procedures provided by the system integrator include backup schedules.

Recovery

Recovery is based on the service level agreement with the system integrator. This includes system restoration within 2 hours and regular preventative maintenance.

Root Cause Analysis

Customer E is in the initial phase of implementing error analysis procedures and did have information to report for this study.

Financial Services Customer (Customer F)

The financial services customer (referred to as Customer F) maintains both business and Web systems that require high availability. The customer applications have been predominantly developed in-house.

Customer has an information technology (IT) department and several business departments that provide this service. The IT department uses service level agreements to support 700 servers and works with the business and user departments to help determine the most cost effective ways to meet their availability requirements. Approximately 40-50 percent of Customer F's applications require continuous availability twenty-four hours a day, seven days a week. (This is generally referred to as s 24 x 7.)

Planning and Design

Customer F uses a combination of hardware redundancy and application design to achieve high availability. Applications are developed in order to operate properly in the Customer F environment. Changes such as applying a service pack (fix) must have a justifiable business benefit before the change is released to production. The customer maintains a test domain environment to certify system changes. A new environment must pass a test domain and a test production domain before it is allowed into a production environment.

System maintenance is limited to periods when there is the least amount of user activity. This is typically one weekend a month. All fixes are concentrated into this one session. Work is performed on Saturday to allow a full day on Sunday to recover if something goes wrong.

Customer F has an enterprise environment where applications advertise the services they support. In addition, applications were designed to look for servers that offer needed services. The applications use this information to load balance work between the servers. This also provides multiple redundant servers that can stand in for failures that may occur on a particular server.

The customer uses the CISCO local director to help achieve application redundancy and high availability for Web servers. The customer is currently working with deploying Microsoft Cluster Server for applications and data that is not easily replicated.

The customer uses hardware redundancy with redundant disk drives (RAID), disk controllers, network interface cards, and power supplies.

Monitoring and Analysis

The customer has chosen Tivoli to monitor some systems. This is not currently in production mode. A general roll out is planned. Microsoft Systems Management Server is used for some inventory management and software distribution.

Customer F uses a standard form for collecting data in order to carry out performance analysis on a system. The form provides a checklist of information to be gathered for the analysis. Information is collected on memory and process resource usages. (For an example of this checklist, see Appendix D – Performance Analysis Checklist.)

Help Desk

Customer F has developed a structured set of response instructions for their Help Desk operations. These instructions are provided in a format that details a type of message received from a Compaq Insight Manager application paired with the appropriate response from the support team. Typical problems identified by these messages result from device and network problems. (For an example of this checklist, see Appendix E – Help Desk Escalation Procedure.)

A Problem Resolution Checklist is used to track problems that are reported to the Help Desk. This checklist ensures that consistent data is collected for each incident. Data collected on the checklist includes information on the server hardware, the operating system, pertinent log file entries, system dumps, as well as areas to record root cause, actions taken, and exposure. (For an example of this checklist, see Appendix F – Problem Resolution Checklist.)

Customer F's Help Desk uses a combination of in-house personnel and staff from vendors contracted through service level agreements (SLAs). This allows the customer to provide an escalation procedure that includes the use of vendor personnel. A channel is provided to the vendor's development staff for problems that cannot be adequately resolved at the local level.

Discrete Manufacturing (Customer G)

The discrete manufacturing customer (referred to as Customer G) runs several assembly plants that produce engines, engine parts, and operating chassis. These facilities are run on a 24 x 7 basis and rely upon distributed computer applications to run the manufacturing processes. Minimizing production downtime is critical to this customer.

The production-line computing environment includes a mainframe computer, approximately 65 servers, and 200 distributed workstations. The majority of servers and workstations have installed the IBM currently OS/2 operating system. The customer is in the process of replacing these systems with Windows NT-based computers to run their next generation, distributed systems.

Planning and Design

Operations at the assembly plants depend on the coordination between the central mainframe, the servers located at each plant and the workstations located in the production line. The mainframe initiates the process by sending the day's schedule to the local server. The local server then maintains the schedule/build information at the plant for the daily processing. Each workstation receives schedule information pertinent to its task. As assembly steps are completed, build information is sent back to the plant server. The plant server then forwards the build data back to the mainframe.

The production environment at Customer G has also been designed around independent processes. This approach logically segments the entire process so that downstream tasks may continue regardless of the state of upstream components.

In the current architecture, IBM Distributed Application Environment (DAE) handles data transfer between the plant server and workstations. In the new architecture, IBM MQSeries, Microsoft SQL Server Data Transformation Service (DTS), and replication will link the plant server to the mainframe. A combination of SQL Server Replication and Microsoft Message Queue Server (MSMQ) will perform data transfer between the workstations and a local server. MSMQ to MQSeries Bridge from Level 8 Systems will connect the MSMQ and MQSeries queues.

The queuing mechanism between the mainframe, server, and workstation processes, allows each process to run independently for up to 4 hrs. In addition, the workstations can perform their functions independently of each other.

Application Controlled Redundancy

Given the need for production to withstand machine or network failures, this customer relies heavily upon replication of data and system redundancy to ensure high availability. Most of the systems have been developed in house, and conform to standards for retries and redundancy. Clusters and vendor-supplied hardware redundancy systems are not considered robust enough or cost efficient.

Standards within the organization have led to application consistency. Applications are programmed to look for alternate resources if a resource fails. If no resources are available, the application will queue output until a resource can be contacted.

Through data replication and application redundancy, Customer G has established a window of time during which production can continue, regardless of network or machine failures. This window allows their personnel to correct any issues before the failure leads to production downtime.

Application Configurations

In order to minimize the amount of spare parts kept on hand, and to allow for efficient system swapping, Customer G defines specific system configurations, based upon the applications that will run on the system. By minimizing the number of combinations, they are able to minimize their spare parts inventory. This allows for simplifying their process of defining standard software builds for each type of machine. They also base spare parts inventory upon the number of systems of that type that they use, as well as the frequency of failure for a particular system component.

Approximately 50 application configurations must be maintained. This is accomplished by keeping a master version of the operating system loaded on a machine. The individual applications are then installed.

Once a system is placed in production, it typically runs unmodified until a failure occurs. The result is that computers running different versions of an operating system can be present at any given time. This presents a challenge as software applications and hardware platforms become outdated.

Monitoring and Analysis

Customer G performs basic system monitoring through automated server pings, as well as scheduled checks of system logs.

In-house applications have also been developed to respond to a set of status requests that are issued from a central station. These responses give indications of availability and current performance levels. Applications have a response of green (normal operations), yellow (responding, but experiencing delays or retries) or red (application not responding).

Recovery

In order to keep the independent workstations operational, spare systems are kept in case of failures. In the production facilities, when a failure is encountered, a backup system is swapped with the production system. They do not attempt system repairs while a system is running on the production line, but instead simply swap in a known-healthy backup system. This allows the production line to continue running, and allows the opportunity to perform in-depth diagnosis of the failed component.

Depending upon the particular failure, a repaired system may be swapped back into the production line in place of the spare, or the spare may simply become the new production system.

Root Cause Analysis

By replacing failed systems immediately and then analyzing failed components, Customer G support personnel have the opportunity to attempt to trace back to a root cause for a failure. Given physical realities, Customer G does not maintain a full replica of their production system for testing purposes. Because of this, some failures will be difficult to reproduce and diagnose. Customer G does however, maintain a scaled-down network in which they can test failed components, in an attempt to discern the root cause of a failure.

Lotus Notes Hosting (Customer H)

The Lotus Notes customer (referred to as Customer H) provides Lotus Notes application services to outside enterprises. Customer G establishes a service level agreement with each of their customers to support the customer Lotus Notes environment. Terms in these agreements include availability, continuous operations, and system redundancy.

Customer H monitors and tracks all problems related to hardware, networking, operating system software, and the Lotus Notes application itself. Customer H's responsibilities exclude any customer specific applications and related problems which are not part of the Lotus Notes environment.

Planning and Design

Customer H has a dedicated server for each customer for hosting Lotus Notes. This is typically a Compaq 1600 single processor system with a RAID 5 storage array for all Lotus Notes data. The internal network is switched 100 base T across six segments. Customers are connected through the Internet, modem, or the use of dedicated lines. The most common connection is through X.25 over a private line.

Customer H offers a continuous operations solution that includes a second Compaq 1600 and the use of Domino Level Clustering for load balancing and automated fail over.

The servers are kept in a secure environment with limited access. In addition to physical security the systems uses security features built-in to Windows NT to control system access permissions and Lotus Notes security to control application access permission. Customers are only allowed access to their Notes applications - they do not have direct access to the underlying operating system software. Customer G personnel are the only users with access rights directly to the Windows NT-based system.

A large pair of redundant Liebert UPS systems that receive their power from separate power grids powers all computer systems.

Customer G maintains operating systems at current levels for service packs and hot fixes that affect their environment. Of special concern is any fix that enhances customer security, as that is a particular concern in this environment.

Customer servers are typically restarted during a weekly maintenance schedule. Operations personnel may also use schedule restart of the server to apply any system updates.

Customers may also specify that their systems not be restarted during this maintenance phase. These systems can run continuously for six months or more. The service relationship between Customer G and the customer determine the length of time between restarting a system.

Operations

Microsoft Systems Management Server is used for software distribution. This is combined with attended installs to minimize the impact of new software versions to customers. Customer G uses Microsoft Systems Management Server options to offer a software update if desired, require updates by a certain date, or require an immediate update.

A common software update practice used by Customer G service technicians is to check for any pending software updates. The technician will install all the pending updates, restart the computer and check the results.

Some customers do not wish to have updates applied to their systems. Customer G honors this type of customer request, but informs the customer of potential reduction in service they might receive by not keeping their system current.

Monitoring and Analysis

Customer G developed their own system for monitoring mid-level systems and applications that takes input from event logs and SNMP traps. The mid-level manager reports to a vendor supplied system monitoring software application. Customer G uses HP OpenView, however the mid-level manager is designed for use with any network management software.

Customer G can detect potential failures through monitoring the system logs, and can detect possible error conditions such as running out of disk space. Error conditions are updated approximately every 2 minutes. Unless an error condition is acute enough to cause a failure in that timeframe, corrective action can normally be taken to avoid the failure.

In addition to the internally developed mid-level manager, Compaq's monitoring agents, with the exception of the thermal agent, are also used to monitor the health of the server.

This method of monitoring and error detection has proved effective in handling the various error conditions that can take place during a day. It is estimated that about two of these conditions are detected and handled during a normal business day.

Help Desk

Customer G bases its support strategy on effective use of the mid-level monitoring software. A support group is assigned the responsibility for the first level of monitoring. They determine whether a technician needs to be paged, or whether they can handle the problem with the operations personnel, or through remote control. Microsoft Systems Management Server is used for remote control. Additionally, some of the monitoring support groups are allowed access to the data center to fix problems.

When a problem is detected and dispatched to a technician, the technician has access to a specific error identification number that gives the technician information on how to resolve the problem. There are approximately 130 possible error conditions documented in this manner. The error conditions are primarily for operating system errors. However, root cause analysis will often find that a customer application was actually responsible for the failure in the operating system. They estimate that a large majority of failures can be traced to customer applications.

Recovery

A team of technicians assess a problem and dispatch the appropriate resource at Customer G. Resolutions may include application restarts, computer stop and restart, or replacing the computer. As previously mentioned, root cause analysis often points to user application as the cause. Since Customer G does not alter the customer's application code, a common remedy is the scheduled stop and restart of all the customer servers. This is normally done at off-peak hours on a weekend.

There are times when hardware failure requires a new system to be loaded. The entire process usually takes less than 2 hours and has the following steps.

Load the customer certified version of the operating system. This normally takes about 40 minutes.
Next, load the customer's applications from backup. The time for this varies by customer, but is usually minimal.
Finally, the customer's data is reloaded from backup. This time will vary depending on the size of the Lotus Notes database. Backups are normally done once a day, so the customer will be required to re-enter any work since the last end of day backup.

Typical system backups are conducted using the Legato package. Specific customer applications may also be backed up directly using the Lotus Notes API

Corporate Information Technology Group (Customer I)

The Corporate Information Technology Group (referred to as Customer I) faces the challenge of increasing system availability in a constantly changing environment. Customer I uses the latest releases of operating systems and applications and must test the new technology in business critical systems.

To provide high availability under these conditions, the corporate information technology group uses a strategy to certify platform builds. The certified platform build consists of a computer operating system with the latest service pack, selected hot fixes, and core applications such as Microsoft SQL Server, Internet Information Server, and Site Server.

Applications running on a certified platform receive full support from the corporate information technology group. Non-compliant platforms receive reduced support that varies in scope depending on how far non-compliant platform deviates from the certified platform. Using this strategy, the corporate information technology group provides a range of the highest level of support for certified builds, to varying levels for non-certified builds. This strategy allows the user communities to choose a platform that suits their application needs.

Planning and Design

Customer I provides network services by using servers configured with Windows NT Server and network service options such as DNS, Dfs, DHCP, Microsoft Netshow™ server, Proxy Server, RAS, WINS, and Microsoft Systems Management Server. In addition, they also maintain a variety of applications in the corporate data center.

Hardware Configurations

The corporate information technology group standardizes on hardware platforms based upon availability and application needs. Server configurations are rigorously tested and certified for data center use for high availability applications. Certification ensures each platform is fully supported by the corporate information technology group operations staff and meets acceptance criteria put forth by their hardware and software vendors. Platforms certified by the corporate information technology group are leaders in price, performance, and availability.

Platforms are categorized as shown in the following table.

Table 4 Platform Categories

Platform category	Platform uses
high-end	Meet scalability and performance requirements for critical business applications such as Microsoft SQL Server and Microsoft Exchange Server.
mid-range	Serve as departmental servers and group application, file and Intranet servers.
low-end	Meet infrastructure and workgroup requirements, such as web clusters and group file servers.

See Appendix B – Implementing Hardware Standardization, for more information about criteria that can be used to create standard and certified hardware platforms.

Types of Application that Require High Availability

The corporate information technology group supports a broad variety of applications: for example, Internet sites, business applications (such as SAP R/3), and shared file systems.

Each type of application has different requirements for high availability. Some common practices can be applied to each type of application; however, there are also differences that require procedures unique to the requirement of the applications.

External Internet Applications

The corporate information technology group uses redundant systems and load balancing to help achieve high availability for external Internet systems. Web sites maintained by the corporate information technology group are independent application clusters using a common activity log method and network connections. Windows Load Balancing Services (WLBS) provide redundancy services in the application clusters.

The application clusters function independently to provide redundancy for a common service. A failure in one application cluster does affect the performance of a different cluster. Within a cluster, each server can pick up from where another left off.

The strategies of redundancy and application isolation provide for near continuous operation with the proper network connectivity.

SAP R/3

This implementation of SAP R/3 is a business critical application but does not have a requirement for continuous operation, as do the external Internet systems. Like most business systems, SAP R/3 will have peak periods, and periodic batch operations. Availability results from having the systems ready when needed by using the non-peak times to for preventative maintenance and prompt recovery when failures occur.

Preventative maintenance includes the treatment of symptoms rather than problems. This includes restarting the applications and restarting the servers. Root cause analysis of the frequency of failures, combined with analysis of the application schedule, must be done to determine what further steps to take and when to take them.

If an application is known to require restarting a server every 10 days, and the application has a known idle time on a weekly basis, then a good preventative maintenance practice would be to restart the server once a week during the application idle time. The corporate information technology group uses this type of practice with its implementation of SAP R/3. This allows a potential problem to be dealt with when the resolution will have the least affect on the application's availability.

File Share Servers

The corporate information technology group supports servers that act as a central repository for products and documents. Their personnel throughout the world use these servers at all hours of the day. The data on the servers is fairly static and easy to replicate. They use a distributed file system and DNS routing to provide redundancy for this service.

Operations

Customer I primarily reacts to conditions on systems monitored in their central monitoring center. They actively perform system maintenance to keep current releases of operating systems and service packs available for all applications. This practice has been proven to increase reliability and availability. For example, frequency of computer restarts was reduced by 50% when Windows NT 4.0, service pack 4 was installed, as compared to Windows NT 4.0, service pack 3.

Customer I uses separation of function and redundancy as a common practice to achieve high availability for systems. This may include strategies such as hardware clusters, application clusters, multiple logical DNS servers and data mirroring.

System monitoring also plays an important part for attaining high availability. Individual systems are pinged on a regular basis, and the event logs are scanned for notifications and instrumentation information. The more information an application can provide about its general well being, the better Customer I support personnel can respond to potential problems before they become failure.

Monitoring and Analysis

The corporate information technology group uses Microsoft Systems Management Server for monitoring and system maintenance because it offers a core set of management functions in the their data center.

Dynamic Hardware Inventory

Inventory collection yields a consolidated, current and comprehensive information database. This database contains detailed information about each computer including installed hardware components, and operating system configuration and version. The gathered inventory information is the foundation used by other Microsoft Systems Management Server features.

Inventory data is stored in a central SQL database. It is used to perform the following steps:

Count the number of computers running a specific operating system with greater confidence - The Microsoft Systems Management Server database contains information regarding the operating system version (including hot fixes) and service pack level. This version information is made available on a dynamic basis for the entire data center.
Efficiently target software upgrades, hot fixes, and service advisories - Hardware configuration and version information facilitates the task of managing which servers require operating system updates, OEM drivers, and Flash ROMs. Hot fix installations are tracked by system using the Hotfix.exe application in conjunction with built in Microsoft Systems Management Server features.
Make proactive purchasing decisions - The level of detail contained within an Microsoft Systems Management Server inventory database allows views into the current hardware configuration of a participating server. This facilitates the hardware upgrade process by knowing the exact configuration and therefore ordering the correct hardware.

Event Monitoring

The corporate information technology group uses Microsoft Systems Management Server to log various conditions to the Windows NT Event log. These include capturing SNMP alerts. The events are forwarded to the SeNTry monitoring console. The SeNTry architecture consists of a Window NT Server, which collects information (gatherer) from other Windows NT-based servers or workstations running the sender service (sender). Notifications and responses for events can be handle through email paging or batch scripts. The sender compares information contained in the Event log against pre-defined filters and applies alert criteria before sending this information to a gatherer server.

The types of events monitored depend on the services installed on a server. Servers with SQL Server monitor events such as device full warnings, WINS servers monitor events such as replication problems.

Automated Software Deployments

Automated Software Deployments allow for unattended software installation to targeted systems based on inventory information gathered. Software deployments would therefore be consistent across the data center in a much shorter period, requiring less labor. Since Microsoft Systems Management Server lacks the ability to schedule installation and system restarts at a granular server level, an intermediate scheduling system may be required to leverage the automated deployment feature in the data center. Automated software deployments are used in the following ways:

Hot fixes are installed and tracked on targeted systems from a single management console.
Service Packs are installed and tracked on targeted systems from a single management console.
New or upgrade operating systems and applications are installed using new or modified installation scripts to deliver the operating system and application packages to target servers.

Remote Console Control

Remote console control allows for remote server console control via the Microsoft Systems Management Server administration UI. Remote Console Control will be used for administration of infrastructure servers outside of the data center (i.e. WINS, DHCP, and DOMAIN) that would otherwise require the administrator to be present at the console.

Help Desk

The corporate information technology group maintains help desk functions for a wide variety of applications, hardware and operating systems.

Customer I actively works to keep systems current on releases of operating systems and service packs for all applications. This practice has been proven to increase reliability and availability. For example, frequency of computer restarts was reduce by 50% when Windows NT 4.0, service pack 4 is installed, as compared to Windows NT 4.0, service pack 3.

Customer I provides diagnostic assistance and runs debug sessions. They use an escalation requirement document to obtain the information needed to begin an investigation. Guidelines for using support provided by the corporate information technology group are shown below.

Diagnostic assistance is provided when: -

An ongoing support problem involves the operating system
An operating system development issue cannot be resolved through normal administrative means
A problem is either acute (is not caused by a failure) or chronic

Debugging is provided for:

All groups.
All operating system components and services.

A hot list is maintained for systems that require after-hours support to maintain high availability.

Escalation Methodology

First Occurrence

On the first occurrence of a given issue, the most important thing to do is gather all pertinent information. Since a pattern is yet to be established, the corporate information technology group does not generally escalate first occurrences to the second-tier support. Instead, they gather all information, including the user or kernel mode dump file, if available, do an initial analysis, track the occurrence, and if it reoccurs they will then escalate the issue as appropriate.

Crash Dump

When a first occurrence results in a system crash, or if a system crashed and cannot be left down, once the system is back on-line, they perform an analysis of the dump file. This same methodology applies to user mode crashes. The analysis includes searching the bug databases and Support Knowledge Base for reports of similar crashes. If they find the issue is already known and a patch available, they can request for regression testing by the server owners. If the dump does not match a known instance, they will define an action plan for resolving the issue.

Defining an Action Plan

Once they can reproduce the problem, or have analyzed the issue and the components involved, they will establish an appropriate action plan. An action plan will vary according to the nature of the problem, but will typically include who the parties are that will be involved, who will need status updates, hours of availability, and a contingency plan. Action items can include reproducing the problem, analyzing logs or dumps, and performing run-time debugging. Depending on the system, and the scope of the problem, they may escalate issues to Support and Development groups. If an ongoing issue requires immediate live debugging in a limited amount of time, or needs off-hours support, Customer I "hot-lists" the system.

Escalation Requirements

The corporate information technology group gathers as much information as possible about a server before escalating problems to the second-tier support. The following list is used as a guideline for the amount and type of information to gather:

Machine name and location
Dump Information (Stop Codes and Parameters)
"Srvinfo -v" Output
"Hotlist -l" Output
"TLIST" Output

Memory Configuration
1. Physical RAM - NTMem or MEMAVAIL output
2. Page File - NTMem or MEMAVAIL output
3. Process utilization - Memsnap log file
Problem description (be as detailed as necessary, if there is an error message, please copy it exactly, or provide a screen shot)
Was any pertinent system information noted? (Event viewer, log files, etc.)

Copy any crash dumps or Drwtsn32.logs to the appropriate location, named in the following manner, where MMDD is the day and month of occurrence:
- Kernel mode dumps: machinename.MMDD.dmp
- User mode dumps: machinename.MMDD.U.dmp
- Drwtsn32 logs: machinename.MMDD.log
Who is reporting the problem?
- Is this person an administrator on the system?
What applications does the system run?

Has this been seen on this machine before?
- When?
- Under what conditions?
What was the last change to be made to the system? (Please be as specific as possible).

Recovery

Usually the corporate information technology group will try to fix the problem, rather than reloading the server. However, there are common recovery methods to restore a system using a new installation. A typical sequence of events

The corporate information technology group performs the following steps to reinstall the Windows NT operating system on a computer:

Determine if the failure requires a newly installed system. Operations personnel at the remote location determine the cause of the failure. Loading a new hardware unit is the best practice used by this customer when a failure is hardware related or can delay restoration of service for an extended time. Loading a new unit in the case of severe application problems offers the technicians the opportunity to analyze the system off-line, while restoring service as fast as possible. Analyzing the system off-line improves the likelihood that the root cause of the failure will be found.
Once a failure is determined to warrant a system replacement, a service technician first replaces the hardware.
Next, a disk with a known operating system and, optionally, application configuration is installed in the new system. Installing a pre-configured disk in the new system, or using a copy utility to move a configuration from one system to another might accomplish this. This type of configuration is referred to as cloning. The cloning procedure that the Security Identifier (SID) and several other security parameters be altered to make the machine unique. This is accomplished using a tool such as SYSPREP, available in the Windows NT Resource Kit.
Next, the application is configured on the new machine. This may be done as part of the cloning process, or may be separate procedure.
After the application is restored, any data required for the application is reloaded. Steps 4 and 5 can be combined and remotely performed. This depends on the capabilities of the applications being installed on the systems.
Finally, normal operational tests are performed on the computer to test the installation and operation of the operating system and applications. Additionally, some computers are stopped and restarted to ensure normal operations are restored after the computer is restarted.

The corporate information technology group commonly uses both application and server restarts to prevent potential failures. A scheduled restart of a server can be part of normal operations and not cause a loss of service, when redundant services are available. An unscheduled system failure, however, could potentially cause an interruption in the application.

The corporate information technology group has an additional procedure that mandates restarting a server as part of normal operations. This happens whenever the configuration of a server changes due to the installation of a new application.

Root Cause Analysis

The corporate information technology group uses established and common procedures to diagnose failures. These procedures include loading symbol tables for deployed applications and the operating system. A process of continuous improvement using information gained from failures helps reduce future down time.

The corporate information technology group recognizes that restarting applications and restarting servers is a symptom treatment process and has instituted a continuous improvement process to evaluate the failures for each application. Working with the application developers and support personnel, the group objective is to correct the root cause and eliminate the need stop and restart preventative maintenance.

Chapter 4 -- Planning: Hardware Strategies

The purpose of this chapter is to describe strategies that may be effective in improving the availability of a system. Hardware strategies are usually considered during system planning and implemented during deployment. These strategies can range from common sense practices to using expensive fault tolerant equipment. Topics covered in this chapter are listed below.

Commodity Hardware
System Preparation
Fault tolerant components
Environmental Concerns
Backup

Implementing a well-planned hardware strategy helps increase system availability while also reducing the support costs and failure recovery times.

Commodity Hardware

In all of these strategies, the emphasis is on using commodity hardware solutions when possible to achieve the best price/performance possible. Using commodity hardware also provides the advantages of better vendor support for interoperability, drivers and repairs. Customer B had the following comment about commodity hardware:

"The secret to commodity hardware is picking good commodity hardware – our failure rate has reduced dramatically when a quality commodity hardware is used"

The other common thread that ran through all of the interviews and emphasized by Customer G was the need to adopt one standard for hardware and standardize on it as much as possible. This typically consisted of picking one type of computer, with a standard network card, disk controller, graphics card, etc. This computer type is used for all applications, even if it was overkill for some. The only parameters typically modified are the amount of memory, number of CPUs and the hard drive configurations.

Standardizing on one type of hardware has the following advantages:

Having only one platform reduces the amount of testing of the hardware for certification.
Testing driver and application software updates is performed one time.
With only one system type, fewer spare parts are required.
Experience with only one system reduces the training of the support personnel.

The standardization of the hardware also coincides well with the concept of a reference platform as described in the interview of Customer I. Customer I also referred this to as "certified platform builds". Standardization also has an effect on the planning of standby equipment and spares.

Information Technology Service Pack (ITPAK)

The concept of an IPAK is to set standards for hardware and software to use across a wide range of applications. The IPAK is revised and tested at scheduled intervals of typically six months. The support group will provide full support for the previous, current and next IPAK. Older IPAK versions will receive reduced support levels.

Obviously, the hardware revisions for the IPAK are not as simple as software. It is impractical, as well as expensive to replace hardware every six months. A longer refresh cycle on hardware is acceptable for environments where the number of systems is relatively fixed. In these cases, a somewhat longer hardware refresh cycle is practical, perhaps 24 or 36 months, perhaps even longer.

However, in an environment where additional servers come on line to increase capacity or to add new functionality, a long refresh cycle may cause problems. This is due to the high rate of change in computers, disk subsystems and other components. Machines bought at the beginning of the cycle will likely not be available at the end of the cycle.

To accommodate the differences in hardware a slightly different approach to IPAK revisions is used. At some interval of IPAK revisions, typically annually, refresh the hardware specifications. The specifications specify the CPU speed, number of CPUs, network cards, graphics cards, UPS units, power supplies and so on. The number of permutations depends on the number of supported application variances. Not all application variances must be tested. The base platform can be tested with a general configuration. Applications varying from this standard are tested separately or supported under a variance.

As an example, consider a base platform that is a dual processor machine with a RAID 1 boot drive, a RAID 5 data drive with 5 spindles, and set amount of memory. Examples of two applications that are not appropriate for the example standard platform are:

Microsoft PPTP Server – This machine needs very little hard drive for storing information because it acts as a gateway for virtual private networks. Consequently, the deviation from the standard platform may be to remove the RAID 5 drives. This is typically viewed as a minor deviation that could be supported.
Exchange Server – The messaging group has determined that they wish to build a server with four processors, additional drive capacity and more memory. This server deviates significantly from the standard platform. The deviation is mitigated by using the same types of controllers and drives as used in the standard platform. The application group becomes responsible for performing the testing to validate the configuration. The support group will provide limited support to this platform.

Spares

One of the advantages of using a standard configuration is the reduced number of spares that must be kept on site. As an example, if all of the hard drives are of the same type, fewer drives are needed for spares. This reduces the cost and complexity associated with providing spares.

The number of spares that need to be kept on hand varies by the configuration and failure conditions that can be tolerated by users and operations personnel. As an example, overnight replacement of components may be adequate if all of your drive arrays are RAID 1 or RAID 5. However, if the component is a network card, without a secondary network to fail over to, then having the network card available is crucial.

Another concern is availability of replacement parts. Some parts are easy to find years later such as memory and CPUs. Other parts such as hard drives are often difficult to locate after only a few years. For parts that may be hard to find, and where exact matches must be used, plan to buy spares when you buy the equipment. Consider using service companies or contracts with a vendor to delegate the responsibility. Consider keeping one or two of each of the critical components in a central location. If a component fails and your vendor is not be able to deliver a replacement component in a timely fashion your own supply of replacement components may be critical to returning your systems to operational condition. Maintaining an adequate supply of replacement parts will allow you to respond more quickly to emergencies while allowing you to restock spare components on the vendor schedule.

Standby

Standby equipment is another approach to getting equipment back on line. The standby system is used to quickly replace a failed system, or in some cases as an ultimate source of spare parts. The standby system can also be used to certify IPAK releases. Customers A, B, C, and G, all used a hardware strategy that involved the use of standby systems.

Should a system have a catastrophic failure, it may be possible to remove the drives from the failed system or use backup tapes to restore operations in a relatively short time period. This scenario does not happen very frequently, but it does happen, in particular with CPU or motherboard component failures.

One advantage to using standby equipment to recover from an outage is that the failed unit is available for a leisurely diagnostic to determine what failed. Getting to the root cause of the failure is extremely important to prevent repeated failures.

Standby equipment should be certified and running on a 24x7 basis, just like the production equipment. Monitor the equipment to make sure it is always operational. Keeping the equipment running is important. If it were not running, there are no guarantees that it will be available when it is needed.

Standby equipment is primarily used in data center operations where it has the highest return on its investment. However, in some cases where the costs of downtime are very high and clustering is not a viable answer, stand by systems can be used to provide reasonably fast recovery times in some cases. This is particularly true of process control where loss of a computer can cause very expensive or dangerous conditions.

Preparation

Before a computer is put into production, at least in normal operations, it must be prepared. This would of course consist of installing hardware, setting interrupts and I/O, installing and configuring software and other tasks. Preparation should also include several other aspects not always considered. Burn in, a recovery manual, change control, and a checklist are all aspects that are vital to successful operation and recovery of a server.

To keep the servers prepared, consider a computer survey and change control. These should be started at the same time as preparation of a computer. That way they will be in place when the unit goes into production.

Checklist

The checklist is a complete list of all operations that need to be performed for the server to become operational. Some customers will even create a checklist before the server is ordered. The order and receipt of the components then become checklist items. At a minimum, the checklist should show the steps needed to get the server up and running for production.

To improve the quality of the checklist and to provide a way to back track the operations, a signature or initials and a date should be required for each item. It is important to stress to the individuals performing these operations that they follow the checklist very carefully, especially if they have done it many times before. The reason is that the checklist may have changed in a subtle but important way.

Generate a checklist for new units as well as for any major and perhaps minor upgrades to the servers. Ideally, the checklist should be stored in the recovery manual and/or put in change control. The checklist can be either electronic or paper. Most customers have chosen to use paper as a way of providing handoff with less complexity. However, that may cause time delays and could cause other problems. The choice will depend on the environment.

Burn-in

Before putting a computer into service, most customers will verify that the computer is functioning properly by starting and running the computer. This is referred to as the burn-in period. Ideally, running programs designed to test and stress the performance of hardware and software components in the computer perform the burn-in. The time used for the burn-in varies by customer from several days to several weeks. The primary purpose of burn in is to get past the infant mortality that affects computer systems. Infant mortality in this context is the tendency of computer components, particularly solid-state components to fail early or not at all.

To apply a load during burn in, several strategies are used. One customer starts a batch file that endlessly copies files around, the intention is to fill the disk and use it at near full capacity. Another customer wrote a custom application that stresses a number of components, including memory and disk. The idea is you want to try to break it. If you can break it, it probably will not survive in production very long.

Another aspect of the burn in time is to gather information about the computer in monitoring. As well as make sure that the computer is backed up and the event logs monitored. While nothing of consequence is likely to be found in the backup or the monitoring, making sure that the system is in the normal schedule of operations is very important. Proof of X numbers of days of monitoring and backups should be a crucial check off item before putting a new computer into production.

Labels

One simple practice can greatly reduce the number of errors and greatly speed the re-assembly of a computer, label everything. This is vital in several areas such as hard drives in an array, SCSI and network cables, and on cabinets. It may also be important to label cards in the computer, both internally and externally. This is particularly true when multiple network interface cards (NIC) or disk controllers of the same type are used.

Color coding can also be useful with external cables. A number of consumer computers are color coded to make it easy for consumers to hook up peripherals. Use of color coding is not limited to consumer computers. Color-coding also aids technicians working with components placed in poorly lit rack cabinets.

Recovery Manual

One of the most overlooked points of preparation with most customers is the recovery manual. A recovery manual should contain everything known about the computer and standard procedures for repairing it. The manual should be a printed copy as well as an electronic copy. The printed version should be located near the server, ideally chained to it so that it does not get lost. The electronic copy should be available for operators and technicians at the operations center. This also allows remote operators to have access to the information as well.

The recovery manual should contain all of the configuration information for the hardware and software. This includes the IP address, computer name, interrupts and I/O settings for any network adapter cards, disk controller configuration settings, the amount of memory, the number and types of cards and what slots they are in. In other words, this is essentially all of the information required to rebuild the computer if necessary.

The recovery manual ideally should contain a set of CDs or floppies with the drivers, applications and the operating system needed to reload the computer. A printed manual can be replaced with links to network servers that contain copies of necessary drivers and applications if preferred. However, at least one printed copy must be available locally for times when the network servers are not available. In addition, printed copies must be available at remote locations in case the network servers cannot be connected to.

Keeping copies of all hardware component manuals is recommended. If the hardware manuals cover more than one possible of component model or component type, highlight or note which component model is installed in your computers. (Installation notes in the margin may also be useful.)

The paper copy should ideally be updated whenever a change is made. Realistically, however, it should be part of a regular survey of the computers, preferably monthly.

Change Control

Change control has two aspects to it, the first is getting consensus on when to make a change, and second on what changes have been made. In many situations, a bad problem can be made worse by having too many people make changes unaware of others making changes.

Change control tends to operate at two levels, planned changes and emergencies. In planned changes, usually one or more computers will be changed, usually at a scheduled interval. In emergencies, the tendency is to do whatever it takes to get the system running.

Planned Changes

For planned changes, most customers have a procedure where the support team meet at regular intervals to plan changes. The changes are then assigned to an individual or team to plan and implement, typically with a schedule and a list of servers affected. It is vital that no one make changes that the rest of the team are unaware of, this can cause confusion and at worst outages. Decisions on changes should be put in writing and sent to all relevant individuals. The change process should also update the Recovery Manual if appropriate.

Emergency Changes

For emergencies, "Desperate times require desperate measures" should never be the slogan to use to correct problems. A better answer is to use a logical systematic approach. Designate a change control individual to make all changes or to control the changes made by others. Make no changes without the change control individual's knowledge and concurrence. The change control individual is responsible for recording all changes made to the system and the side effects, if any.

By controlling the changes and making them one at a time in an orderly manner, the chances of success are greater, and the likelihood of making the situation worse is reduced. Ideally, the number of people working on the problem should be kept as small as possible. This facilitates communication, and reduces confusion. This is especially important with distributed control and monitoring.

At the end of the emergency, do a post mortem to discover what went wrong, and what could have been done better. If possible, try to discover the root cause of the problem, and institute corrective or preventive procedures. At worst, write all changes to the system to the change log.

Change Log

The change log is vital piece of information that should be maintained on every server. The change log should show, what was changed, why it was changed, who changed it and when. A change log works best as an electronic document that can be reached from any computer. Often the best change log is to use a plain text file to store the changes. A plain text file can be edited and viewed with base tools in the operating system on any computer. The file should be stored on a server accessible from anywhere, preferably a well-known location. Some customers store the file in the root directory of the boot drive; others use a file server. Custom web applications are also possible, but require a browser and more effort.

Computer Survey

Customer I performs a scheduled inventory survey of each computer. The primary reason of an inventory survey is identification of current computer configurations and inventory. The results of the survey should validate the software and hardware installed and configured on the computer.

If differences are found between current computer configurations and past or expected inventory records, the recovery manual can be updated, and corrective action can be taken. This is important because changes to the computer may not have been logged, or unauthorized changes may have been made. Should a failure occur, and the recovery information is inaccurate, recovery will be more difficult and lengthy.

System tools such as Microsoft Systems Management Server or other third party tools can be used to gather much of the hardware and some of the software information. This information can then be easily compared with the previous inventory to find changes.

Fault Tolerant Components

This section describes some of the hardware choices that are currently in use that help to improve availability. This section will make suggestions on areas where availability and performance can both be improved at the same time. Discussed solutions will be focus on relatively generic techniques to avoid recommending a specific technology. For more information about specific technologies, contact your favorite hardware vendor or research organization, magazine and trade journals that provide evaluations of specific technologies.

Storage Strategies

Storage strategies are based on the type and quantity of information that must be stored and the cost of equipment. If a particular computer is not used to store data then the storage solution can be very simple and inexpensive. However, if the computer will be used to store large amounts of data and perform frequent database reads and writes the storage strategy is more complicated. Consider the cost of any storage components when developing a strategy for storing the data that your organization needs. It does not make sense to spend more on the storage system than the expected cost recovery in which your may lose time and data.

Another aspect to consider is whether preventing data loss is good enough. Alternatively, do you need to maximize the amount of time the server is available for use? If preventing data loss is enough, a simple RAID 1 (mirror) or RAID 5 arrangement may be sufficient. If the application must be available at all times, strategies such as attaching multiple disk controllers or clusters to a RAID 5 disk array can be a better solution. RAID technology was used by Customers A, B, F, and H.

Another issue to take into account is the fact that MTBF (Mean Time Between Failures) for hard drives is increasing each year. In theory, this should reduce the need for strategies like RAID. In reality, what it means is that you are less likely to need to replace a drive that has failed in your RAID array. All customers can report incidents where a drive has failed, despite high MTBF ratings.

One other issue that is not covered in detail in this discussion is the use of multiple disk controllers. Multiple disk controllers make it possible to remove the single point of failure in the RAID array, the controller itself. While disk controllers are very reliable, and are seldom the cause of an outage, they are still a single point of failure. The idea behind using multiple controllers is that they both connect to the drive arrays to provide redundancy. In SCSI, this typically costs the loss of one of the SCSI addresses, which in most configurations is acceptable. Check with your computer and RAID vendor for availability and options on using dual controllers.

Hot Swappable Drives

A number of vendors offer hot swappable drives. However, a number of customers have found to their grief that this does not always work as expected. The natural inclination is to pull out the bad drive and insert the new drive. However, in all too many cases the customer has lost all of the data on that RAID partition. The recommendation before trying a hot swap on a production server is to practice on a test server first, preferably one under load. Read all of the instructions from the vendor on hot swaps. The time to prepare for swapping out a hard drive is not when one fails, but long before it fails and in practice situations. After practicing the procedure, document it and put the information in the recovery manual.

If possible, the safest process is to do an offline backup of the data, repair the device, and then bring the application back on line. If something goes wrong, the application can be stopped, the drive space fixed or prepared and the data restored. This procedure should also be used any time the RAID drive configuration is changed, since the possibility of accidentally destroying the contents of the drives is high.

NTFS vs. FAT

One aspect of drive configuration has reached high levels of confusion. That aspect is what file system to use on each drive. The two existing choices are NTFS and FAT. If your space requirements exceed the 4 GB limit of FAT, then your choice is simple, use NTFS. If not, your choice is a bit more complex. NTFS offers a great many advantages in fault tolerance over FAT. NTFS uses transaction logging and recovery techniques not available to FAT. The confusion is primarily evident in two areas, the boot drive and drives used to store log files.

With the operating system, the logic has been if anything fails, then boot from DOS and fix it. The reality is that, if something happens, booting to DOS and fixing it just does not happen very often. The only area where this has been particularly useful is in areas where development code is being run; in this case, files are often replaced one at a time with new versions. Should it fail, an older version of the file can be used instead. This is not a production system, and the logic for that mode of operation does not exist here. NTFS typically provides better availability than a FAT partition and in most cases better performance because the file system is more effectively indexed.

The other area of contention is with log files. FAT was markedly faster for sequentially written files than the comparable version of IBM OS/2 with HPFS. Consequently, the practice of using FAT for log file drives was adopted. NTFS was introduced with the release of Windows NT. NTFS is far faster than HPFS and today is comparable, if not faster, to FAT. Given the size of log files is exceeding the 4 GB limit of FAT, using NTFS is happening anyway. The higher availability of NTFS over FAT, given that performance is not an issue, makes using NTFS the logical choice for log file drives.

RAID

The table below shows the currently commercially available RAID strategies4https://www.raid-advisory.com by Joe Molina.

Table 5 RAID Levels

	Configuration	Fault Tolerance	Advantage/Disadvantage
RAID 0	Data is striped across two or more drives.	None	High performance
RAID 1	Each drive has an identical mirrored drive, provides high performance.	Tolerates the failure of one drive.	High performance requires double the number of drives.
RAID 0+1	Portions of data elements are written to separate disks. All data is written in its entirety to multiple disks.	Tolerates the loss of one drive.	High performance can be expensive to implement.
RAID 3	Data striping of single records across all disks, parity is written on a single drive.	Tolerates the loss of one drive.	High transfer rate, tolerates the loss of one drive. Features high bandwidth (large block size).
RAID 4	Data is interleaved between disks at the sector level using larger stripes	Tolerates the loss of one drive.	Very good read performance. Poor at write performance.
RAID 5	Data and parity information are distributed per sector across all disks.	Tolerates the loss of one drive.	Very good at reading and writing small blocks of data at random locations.
RAID 6	Two sections of each disk are set aside for parity, providing the highest degree of fault tolerance.	Tolerates the loss of two drives	Similar to RAID 5, but lower utilization due to twice the parity information.

Drive Layout

When designing a computer, the first step is to determine the amount of disk space that will be used and the performance levels that need to be achieved. Given the rapid change in storage technologies it is beyond the scope of this document to cover the size and performance issues. This document will cover the next step.

Drive space is typically carved up into the five categories as described in the following table.

Table 6 Drive Space Categories

Operating System	This covers the operating system, any drivers and typically any utilities. It will also in most cases cover any DLL's used by the applications. The operating system is typically installed on the boot partition on C: drive. Obviously, the operating system should be on a fault tolerant drive. Failure of the operating system drive will stop the computer.
Swap Space	Windows NT uses a file called the swap file to provide virtual memory. This enables the operating system to utilize more memory that is actually installed in the computer. Some customers create a separate partition, or in some cases, physical drives for the swap drive. Having a separate swap drive can provide some minor improvements in performance, but not typically enough to make it worth the extra cost. If the swap drives are put on a separate drive, make sure that it is at least a RAID 1. Failure of the swap drive will cause the operating system to stop. Many customers keep the swap space on the same partition as the operating system, with good performance and reliability.
Applications	Many customers will install applications on the operating system drive, but some install the applications in a separate partition for logical separation. This is purely an aesthetics issue, using a separate application partition may be neater looking and easier to work with. As with the other categories, if the application drive space fails, then there is a good chance that the application will as well. This is not true of all applications; some may load and then run without a disk access. This is not guaranteed though, if the operating system needs to swap one or more of the applications code segments out of memory, when it goes to reload them, the drive containing the executables or DLLs must be available.
Data	Data can be represented in many different forms. Some of the most common forms are databases, log files, configuration information, flat files, images, and many other types. It is very important to understand how the data is used by the application. Some applications may store part or all of the data in the same directory with the application. Database applications (this includes things like SQL Server, Oracle, Lotus Notes and Exchange) store the data in very large files. These files may exist on one or more partitions. In some customer configurations, each database file is located on a separate partition with the index for the database on a separate drive spindle. The separation of index and database is for purely performance reasons. Failure of either data source will cause the database to fail.
Log Files	Log files are used by database, communications and transaction applications to store a history of operations the application has performed. Frequently the log files are used to replay the transactions that have been performed if the main data source has been damaged or lost. Using a tape backup of the database, the log files can usually be replayed to restore the database. Log files are typically stored on a separate drive for performance reasons. Since log files are sequentially written, high performance can be achieved by using a single drive because the drive head does not need to move, hence the need to seek is removed. While the data on the log files, may be disposable, assuming of course the main database files are intact, the ability to write to the log files is a requirement of most applications. The application will quickly stop if the log files cannot be written to. For this reason, having fault tolerant drives is important not to protect the integrity of the log files, but to make sure that the log file drive space can be accessed.

When a decision on how these categories will be partitioned, one of the best practices is to use logical drive partitioning to create the same drive letter for each of the types of drive space. For instance, the following mapping could be used on all computers:

C Operating System

D CD-ROM
L Log Files
P Applications
S Swap Drive
T Data Drive

Recommendations on how to partition these spaces are often included with the application, as well as in technical articles. Please consult them before you start designing a drive and partition layout. The table below shows some typical configurations that provide high availability using different RAID combinations.

Table 7 Disk Drive Layout

	RAID 1 Only This configuration consists of a pair of drives in a RAID 1 configuration. All components are installed on this drive array. Partitions are used to separate the logical functions. · This configuration is excellent for applications not storing large quantities of data, such as gateways, DNS, domain controllers or print servers. · This configuration is excellent for remote locations with a small number of users. The only limit is the size of a single disk. The server could be used for mail, database, file and print serving. · This configuration is excellent for middle tier servers running MTS or MSMQ line of business applications.
	RAID 5 Only This configuration is similar to the RAID 1 configuration above. The difference is that it offers potentially much higher disk capacity. · This configuration is excellent for remote locations with a small number of users. · This configuration is excellent for departmental file servers. · This configuration is excellent for web servers
	RAID 1 and RAID 5 This configuration uses a RAID 1 drive array for the operating system, swap space, and the applications. Data is stored on the RAID 5 array. If log space is required, the log space should be stored on the RAID 1 drive. · This configuration is excellent for applications needing large amounts of hard disk space. Good examples are web servers and file servers. · This configuration is acceptable for intermediate scale applications using databases. The lack of a separate drive for log files limits the performance and scale of this configuration.
	Two RAID 1 and one RAID 5 This configuration uses a RAID 1 drive array for the operating system, swap space, and applications. The second RAID 1 drive array is used for log files. The RAID 5 array is used to store data. · This configuration is excellent for applications using databases. The separate log file drive provides high performance, and recovery failure.
	Three RAID 1 and one or more RAID 5 This configuration uses a RAID 1 drive array for the operating system and applications. The second RAID 1 drive array is used for log files. The third RAID 1 drive is used for swap space for virtual memory. The RAID 5 array is used to store data and may consist of one or more arrays. This configuration uses Windows NT 4.0 Enterprise Edition and very large amounts of memory. · This configuration is designed for very large databases.

Network Interfaces

The network interface card (NIC) in most servers is a single point of failure. Fortunately, the NIC is typically very reliable and failures are rare. However, the NIC is not the end of the story on redundancy of the network connection. Other components outside of the computer can fail and have the same effect as the loss of the NIC. These include the network cable to the computer, the switch or hub, router, DNS/WINS and the domain controller. Any one of these components can fail and cause the failure of one or more servers, potentially of all of the servers.

"Exchange is the best network monitor ever invented… Any failure anywhere will cause some user to call the help desk saying that E-mail is down within minutes of the failure"

One strategy to contend with failures is through redundancy. Many components lend themselves to backup or load-sharing strategies.

HUB/Switch, NIC and wiring

While all of these components are very reliable, if service must be guaranteed, then using redundancy is very important. When installing two NICs, make sure that you have run cabling from two separate hubs or switches. Make sure the network cables are color coded or some other way marked to signify network A and B. This prevents the cables from being plugged into the wrong NIC.

Always use fixed IP address on servers and do not use DHCP. This prevents an outage due to the failure of the DHCP server. This can improve address resolution by DNS servers that do not handle the dynamic address assignment provided by DHCP.

Network hubs and switches are very reliable, but they do fail. Each of the two segments should be on a separate hub or switch. These switches should each be connected to the main communications line leaving the data center if possible. If not, to avoid a single point of failure make sure the devices they are connected to are redundant.

Since this configuration provides two paths of network connectivity, plan to use both of them. This will double the network throughput to the server and guarantee that you have network connectivity on both. This may take some clever addressing schemes on the DNS/WINS, or in the routers, but it can be done. There are varieties of ways to accomplish this depending on your own architecture.

Routers

Routers do fail, fortunately not frequently. Nevertheless, when they do, entire computer centers can go down, affecting an entire company. Having redundant routing capability in the computer center is very critical. Detailing how to accomplish this is beyond the scope of this document. You should contact your router vendor for recommendations on how to protect against router failures.

DNS/WINS

Fortunately, DNS and WINS are two of the easiest infrastructure components to replicate. Both were designed to support replication of their name tables and other information. One important suggestion is to make sure that these servers are not on the same segment. Preferably, not even in the same building for ultimate reliability.

The Windows NT Server 4.0 Resource Kit has very extensive information on WINS and Microsoft's DNS server and ways to replicate the information. There are also extensive books on both the Microsoft technologies and conventional DNS found on most UNIX systems.

If multiple NICs are in each server, it may be advisable to use static mappings in DNS and WINS to help control the division of work to each server.

Domain Controller

Domain controllers are very critical to any application that needs to validate a user's credentials against a domain managed by a Windows NT-based server. Most of the Microsoft BackOffice components rely on Windows NT authentication for allowing user access. If the server(s) providing this authentication should fail, then the clients will be denied access. Keep in mind it is the Windows NT 4.0-based domain controller (DC) that relays authentication requests for the resource domain to the account domains. (Windows 2000 domain strategies are different.) If the server cannot reach a domain controller in the resource domain, then authentication will fail.

Consult the Microsoft Windows NT Server 4.0 Resource Kit for more information about domain strategies. There are also excellent books on the subject of domain design available.

Memory

Memory chips used in today's computers are incredibly reliable. Consumer PCs commonly use memory without parity checks. It is imperative to use error checking and correcting (ECC) memory in servers. ECC memory uses a parity check scheme to guarantee that the failure of any one bit out of a byte of information is corrected. Two bits failure will trigger a failure, but that is very, very rare.

Be aware that even with ECC, memory chips do fail. Try to keep on hand enough memory to replace the entire memory of a computer. If memory check errors become frequent, or the machine will not boot, replace all memory chips rather than spend time trying to figure out which memory chip is bad. You can figure out which one is bad at your leisure.

Cooling

Cooling is one of the most overlooked elements of a server. If the cooling fan should fail, the processor(s), hard drives or controller cards will overheat and fail. If when opening the chassis, the computer feels extremely warm, then you may have had a fan fail. Most servers have two or more fans to protect against this. Some servers also have thermal sensors to detect abnormal temperatures.

Another aspect of cooling is room temperature. A long rack of computers can generate a huge amount of heat. For this reason, almost all computer rooms have some form of cooling or air conditioning. When adding servers to a computer room, be careful to make sure you will not exceed the cooling capacity of the room. A good rule of thumb is that computers are rated for cooling at about 70º F. If the environment exceeds that temperature significantly, you could have problems.

Power Supplies

Most middle to high-end servers offer the option of a second power supply. If one of the power supplies should fail, then the other will continue to provide power. Again, power supplies are very reliable, but they do fail.

When using dual power supplies, it is a good idea to use two separate power feeds. Using two power feeds allows protection from a circuit breaker tripping or someone unplugging something they should not.

Also, do not forget about external cabinets for RAID arrays or modem banks. If they have power supplies, check and see if dual supplies are available.

Environmental Concerns

One of the most neglected items is the environment the computers must run in. Most servers are incredibly reliable despite the abuse they get. They continue to run. In most computer rooms, this is not much of a concern, because they were designed to be friendly to computers, but elsewhere this can be a serious issue.

Temperature, Humidity, and Cleanliness

As discussed earlier, computers perform best in (approximately) 70º F temperatures. If the computer is installed in an office, temperature is not much of an issue in most cases, but avoid long holiday weekends in the summer with the air conditioning turned off.

Humidity is not very important up to the point where condensation forms or static becomes an issue. Obviously, water condensing in a computer would be very bad. In addition, you do not want mold forming on the computers that could affect cooling or cause a short. Dry air can present a problem as well. People near the computer can develop static. A good static jolt can damage internal components at worst, or just cause the computer to restart.

Cleanliness is very important for computers; dust and dirt can cause shorts and even in extreme conditions fires. For computer room computers, whenever the case is opened for any reason, a quick check should be made to determine if the unit needs cleaning. If it does, then all of the units in the area should be checked.

For computers in office areas, the computer should probably be checked quarterly, or more often if in a dirty area. For plant floor computers or in other hazardous areas, an enclosure with air filtration and climate control is a necessity. The air filters on the cabinet should be cleaned as per the manufacture recommendation. At the same time the computer and its cabinet should be checked and cleaned if necessary.

Power

Obviously, without power a computer will not run. Unfortunately, power grids are not always that reliable. Consequently, backup power may be a necessity, at the worst to allow the computer to shut down in a controlled manner. There are two scopes of outages; the first is building or computer room failure, the second a regional outage.

In building power failures, particularly in a corporate computer center, it may be necessary to continue providing service to other buildings in the area or to areas geographically remote from the computer center. In this instance, short outages can be survived using UPS units. Longer duration outages can be handled using standby generators. There are two strategies for UPS, one big unit or many small units.

Using a very large UPS or series of UPS units to cover the entire computer room has the advantage of usually being easier to maintain and monitor. However, it has the disadvantage of a big problem if it does not work. Stories float around about this or that company that scheduled a test of their battery backup and took down the entire computer center. The moral, be very careful with a large UPS system and make sure the system works, preferably during weekends or holidays. Customer I used a pair of UPS systems that received their power from separate power grids.

The other strategy is for every computer to have its own UPS. This tends to be more practical for computers outside of a computer room. The upside of these systems is that they can interface directly with the computer to signal a shutdown warning when the battery power drops to a set point. The other advantage is that a breaker trip or some other isolated power outages will not shutdown the computer. The downside of these units is that maintenance of the units is more involved because of the number of units, both from a record keeping and physical replacement of batteries and testing of units.

The other dreaded power outage is the regional power outage, similar to what happened to New York City. In these cases, the UPS and generators may work fine, but your telecommunications links may fail. A regional failure can be very expensive if your company has distributed locations, or your business is actively involved with e-Commerce or the Internet. The best alternative in this case is to have another facility in a geographically separate location. This facility should duplicate as many server resources as practical. This is easy with web systems, but databases and other resources, particularly with volatile information are more difficult. Replication and in some cases long fiber lines can be used to mitigate some of this with a form of offsite storage.

The fiber replication options is most useful for protecting against a building disaster (for instance terrorism, fire or even a burst water pipe.) Since the data is at an offsite facility, backup equipment can be switched in.

The important part of such disaster recovery is to make and implement a plan for disasters. This includes identifying what resources truly are critical to the operation of the company.

Cables

Earlier in this chapter was a discussion on labeling that pertains heavily to cabling. In addition to that however is the need for specific discussion on the cables themselves. Here are a few do's and don'ts with cable.

Do's

Do make sure cables are neat and orderly, either with a cable management system or tie wraps. Cables should never be loose in a cabinet, this leads to accidental disconnects of cables.
Strain relief should be used when possible to secure the cables to something the computer is connected to; in particular with pull out rack mounted equipment. This way a tug on the cable will not pull the cable out of its socket.
Make sure all cables are securely attached at both ends where possible.
If multiple sources of power or network communications are used, try to route the cables feeding the cabinets from different points. This way if one is severed, the other will likely still be functional.
Label all cables at both ends if possible. Color-coding tape or labels helps as well.
Make sure rack mounted pull out equipment has enough slack in the cables, and that the cables will not bind or be pinched or scraped.

Don'ts

Don't plug dual power supplies into the same power strip or use separate power sources.
Don't leave loose cables in cabinets.
Make sure that cables cannot be accidentally snagged on someone walking by or by a cart. All cables should be inside the cabinet.

Backup

The ultimate recovery method is to restore from backup at Customers A, D, and H, as it is at most other installations. To guarantee that backup tapes are available when needed, creating a backup plan is essential. To make sure the backup plan will work, practice doing restores on a regular basis for a variety of disaster scenarios. To make sure the tapes are good, monitor the backup process. Also, be aware that some applications have extensions to allow online backups. Another consideration is backing up on the computer or over the network. Consideration should be given to where the tapes will be stored and how long they will be good.

Backup Plan

The backup plan should consist of a document describing the servers to be backed up, and the information on each server to backup. Typically, backup plans are drawn up for each type or class of server. This is necessary to accommodate the various applications and the data stored on each server. Some of this information can come from a master backup plan that details the general procedures for backups. The backup plan will include the following information:

The tape software used to perform the backup.
The amount of data to be backed up and the number of tapes to be used.
Should compression software be used?
How frequently will backup tapes be made?

Will the tapes be incremental, differential or full backups? If incremental or differential backups used, how often should full backups be done?
- Incremental Backups – Backup all of the changes since the last full backup. These are usually considerably faster than full backups, but take somewhat longer to restore. To restore, the full backup is restored, then the last incremental backup. If the full backup is bad, then the incremental tape is worthless.
- Differential Backups – Backup all of the changes since the last full or differential backup. These are typically the fastest backups. To restore a differential backup, the last full backup is restored, and then each differential tape is restored in order. Differential backup tapes have the longest time to restore.
- Full Backup - A full backup will copy all data to tape, and usually resets logs and other tracking information. Typically, a full backup is considered the most reliable, but the time considerations can be excessive.
Will the tapes backup just the data, or will they back up the Operating System, applications and configuration information? One strategy that can be used is to store the operating system, the applications and configuration on one tape, and then just backup the data. When any change to the operating system or applications is made, a new tape must be made. This strategy is however rarely used. This is because of the risk of not capturing a configuration change, and the disruption of the normal backup operations.

Will the tape be an online or offline backup? This is particularly important to applications that are continuously modifying data files, and in many cases may never actually close the data files.
- Online – Online backups generally require extensions from the application to permit access to the databases. The application retrieves the data from the data storage media and sends it to the tape software for storage. This implies that the backup software must cooperate with the application for backing up online. Online backups are also typically much slower than offline backups and can generate a significant load on the server. The best time to do backups online is usually during slack times if possible.
- Offline – Offline backups imply that the application is not writing to the data files at the same time as the backup is running. This is not a problem for servers that are usually read-only, like web servers or file servers. Files that are open are typically retried later, and if missed are logged. Offline backups are much faster than online backups and do not require interaction with the application software.
  
  Offline backups are useful for backing up database applications before repair or replacement of disk systems. During scheduled down time, the applications are stopped, the data is backed up, the drives are repaired or replaced, and if necessary, the data is restored. Doing an offline backup is an excellent idea before major hardware or software changes. If something should fail, offline backups are the fastest way to recover.
Develop a plan for labeling and recording when the tapes were made. Many companies use a numbering system, and tracking logs to locate the tapes. This has the advantage of allowing for rapid retrieving the correct tape. It also eliminates the clutter and confusion from writing on tape labels or cases. It also tends to be more practical when tape loaders are used.
How long will it take to backup the system? If the time to backup the system is longer than the available time window, different strategies will need to be employed, or perhaps faster backup systems will have to be used.
How long will it take to restore the tape should disaster occur? This information is important for providing downtime estimates. It is also important for determining when it is time to consider improving the reliability of the computer system, or upgrading to faster tape units. Another approach is to immediately start restoring the tapes to a standby unit if a disaster occurs. If the unit can be repaired, then nothing is lost. If it cannot then time has been saved and the time to repair is the time it takes to restore the unit.

Practice

The old saying, "Practice makes perfect" should be gospel to operations staffs. Practice backup procedures routinely, and budget time for the practicing. Some recommendations have suggested quarterly backup practices for each backup plan. This is particularly important for operations staffs with high turnover.

When doing the practice operation, use a new unit before it is loaded, a test unit, or a standby unit. When choosing a system to practice with, choose a production unit, preferably one of the units with the highest utilization to get real world experience. When doing this practice run, remove the practice unit from the network to prevent address or name collisions with the actual production unit.

Also, consider varying the type of disaster, fail out components and practice restoring them. The process for recovering the operating system can be different than if the data drive fails.

When the practice is complete, do a post mortem, also a good idea for real situations. Identify what went wrong, what procedures need changing, how can the operation be done faster? Learn from it to improve the procedure.

Monitoring

One essential element is ensuring that the available backups are in good. All commercial backup software produce logs, and in some cases event logs as well. Check the logs daily to verify that the backup operation was successful. Any abnormalities may signal a problem with the tape unit, the server or the application. Diagnose any of these failures immediately. In addition, failures of the tape backup may prevent a restoration should the server fail.

In some cases, backup failure can signal a corrupt database, or other problems with the application. These faults frequently will not be noted in any other event log or application behavior. Left uncorrected, these faults may eventually cause an unscheduled outage.

Network vs. Local Tape Units

Many customers backup servers over the network. This is an economical way of backing up many servers with only one tape unit. Typically, the tape unit has a tape feed mechanism to switch tapes automatically. Some of the advantages of network and local units are as follows.

Item	Local	Network
Performance	A local tape unit is very fast because of the direct connection. For very large backup operations, where time is an issue, this is the preferred solution.	Network traffic can slow the tape transfer rate considerably. A fast network or separate network for backups can mitigate the performance issues somewhat.
Cost	Higher because each unit has its own tape unit.	More cost effective, multiple servers can be backed up with the same tape unit. Multiple tape units can be installed on one server for parallel operation.
Recovery Time	Having a local tape unit may be slightly faster.	If a restore on a server is required during normal backup cycles, the backup schedule may be disrupted.
Operator efficiency	Poor, having tape drives scattered all over the computer room is time consuming for the operators to change tapes. Monitoring and scheduling of tape operations may be more complex.	Tape units can be grouped in one area making it easier for operators to change tapes. Having fewer servers to schedule jobs on and monitor reduces the effort and complexity.
Remote locations	For areas with only one or two servers connected over the WAN, local backup is typically much more practical.	Network backup is not typically practical over the WAN.

Offsite Storage

Offsite storage is a very old practice that sends a copy of the backup offsite. The concept is to make sure that if a disaster should happen at the building with the servers, that eventually most of the data could be recovered.

Which tape is sent offsite varies on how quickly the tape can be retrieved and the level of protection required. In some instances, the last full and any incremental tapes are stored offsite. In other cases, the previous full backup is all that is needed to be stored offsite. Business practices will determine the need in these cases.

In extreme cases, some companies have opted to use fiber to mirror the data drive at a remote location. This provides very high recovery of virtually all transactions or data changes.

Physical Media Life

Contrary to popular belief, no media is permanent. Tapes loose information at a certain rate as the magnetic information degrades. Writeable CDs will eventually deteriorate as well. Even hard disks will eventually loose their information over time. Check with the manufacturer of the media to determine how long the media will store information with a high degree of reliability.

Another aspect to consider on physical media is the wear and tear. Tapes and even rewriteable CDs have a fixed number of cycles that they can be used for. Check with the manufacturer for recommendations on replacement frequency.

When recycling tapes, discard any tapes that appear to show errors. Far better to discard a tape and replace it then to have a tape fail when a restore is needed.

Other Approaches

One customer came up with an innovative approach to high-speed recovery, not typically from media failure, but from database corruption. Three disks are used to hold the data. Two disks at any given time have mirror images of the current production data. The remaining drive has the data for the previous day. At the end of the day, the following sequence occurs:

The application is stopped.
Drive A is removed from the mirror set and drive C is added to the mirror set.
The application is restarted.
Data from B is synchronized with C.
Drive A is backed up to tape along with the log files.

Pre-recovery and recovered phases are illustrated in the following figure.

Figure 4: High-speed data recovery

This procedure and configuration provide a method that can be employed should the data become corrupt, the old data drive is brought back into service and the transaction logs are replayed. The disadvantage of this procedure and configuration is that the application must be stopped nightly for about 15 minutes. The advantage is data can be recovered up to the previous day in about 30 minutes.

Chapter 5 -- Planning: Software Strategies

Like hardware strategies for high-availability deployment, highly available software deployments involve work in several phases. These phases break down roughly into:

software selection
integration testing
training
software deployment
software support
Application isolation
Error detection
Recovery

Each one of these phases is critical to the overall success of the effort. Software strategies involve the architects of the system from development, testing, deployment, operations, system monitoring and support.

Software Selection

It is important to understand the availability requirements of the user community when selecting the software to meet their needs. This is when the "number of nines" becomes important. Different strategies are needed when depending of how many nines the application is required to support. An application that requires 99% up time may not require special load balancing or fail over support. An application that requires 99.9% or greater availability probably does require features that support redundancy, load balancing and automated fail over.

When selecting software it is important to understand the inter-relationship between components in a system. If component A never fails, but depends on the availability of component B which frequently fails to complete its functions, then the work performed by component A is subject to the availability of component B. Listed below are strategies for improving component A's availability with regard to its relationship with component B.

Improve the availability of component B using redundant systems and automated fail over. This type of strategy is appropriate when component B represents a database that is part of a transaction in order for the work to complete. This action will usually not cause a change to be required in component A.
Change the relationship between components A and component B from synchronous to asynchronous. This would allow component A to function independently from component B.
Allow for a period of latency between component A and component B. This would allow temporary failures in component B to be gracefully handled by component A.

Not all of these options will be available for every software solution that meets a certain business requirement. Listed below are items to consider when selecting software for a high availability application:

Is the software cluster aware?
Does the software support load balancing?
Does the software support data replication?
What type of instrumentation does the software support?
Are there hardware compatibility requirements for the software?
Is the software certified for the environment (i.e. Designed for BackOffice Logo)?
Was the software tested and certified for the environment where it will be deployed?

System Integration Testing

Application isolation is a technique to minimize dependencies between applications. Invariably, however, real-world installations will involve several applications running together in an environment. This coexistence of applications can result in unexpected system outages that are difficult to diagnose and may be sporadic in nature.

One technique to attempt to avoid system conflicts between applications involves systems integration testing. To accomplish this level of testing, IT personnel will need to stage an application in a replica of the real-world system. An important concept is that this test environment may involve several machines, a network, and a variety of users actively using the system, all in order to effectively duplicate real-world usage scenarios. Customers C, D and E have set up test labs where applications are certified.

The requirements for certification vary by applications; however, the simulated customer transactions are used for testing. Some of the items to test include compatibility of system components (DLLs), database deadlocks, system performance issues and system capacity.

Training

A common theme for all the customer interviews is to have properly trained resources handling the various tasks and issues that might occur during a normal day's processing. This includes training for the following.

Operations personnel
Help desk personnel
Service technicians
Crisis management teams

It is important for operations personnel to understand the responsibilities and the limits for performing their tasks. A best practice indicated by customers C, F and H is to automate the daily process as much as possible. The operations personnel in the data center are normally trained to watch the daily operations processes, and report the status of the various jobs. Additionally, operations personnel may act to resolve system issues based upon run-guide documentation for each specific, known system event. Typically, the operations personnel perform the same type of tasks in all facilities.

Help desk personnel are required to diagnose problems and either assist the operations personnel in fixing the problem, or escalate the problem to the correct resource for resolution. This requires the help desk personnel to have both general operating system knowledge and knowledge of the applications being used. Help desk personnel usually are trained in both user supplied and vendor supplied courses.

Service technicians are the resources that are assigned to a problem when a resolution cannot be done with the help desk resource and the operations resource. If help desk personnel cannot resolve a system issue then a service technician may be assigned to help resolve the issue. Service technicians typically specialize in one or more specific areas and have in-depth knowledge required to fix a problem once the help desk determines which area of the system requires the attention. Service technicians are typically trained in advanced courses offered by the vendors or supplied by the in-house development staff.

Software Deployment

As with software selection, it is important to understand the availability requirements for an application when considering software deployment. First environmental issues should be addressed.

If the system design calls for isolated, uninterrupted power, a UPS system should be available. If the design calls for a warm standby server, the configuration of each system should be identical. The standby server may need to be in a separate facility on a different utility power grid. Customers C, D and H all have their standby servers in separate facilities, using different power grids. If the system has a hardware compatibility list, all the hardware components need to be verified to be on the list.

If there are special redundancy requirements, such as DNS routing, this needs to be included. Customers D and I both use Windows Load Balancing Service for DNS routing. Customers E and F use Cisco Local Director.

When deploying a system, capacity planning plays an important part in the success of a highly available system. A good capacity plan can limit avoidable failures. A capacity plan should consider current and future anticipated system usage. This would include CPU usage, network usage and data storage requirements. A good capacity plan will include future scalability and growth requirements. Customer I provides certified configurations for both hardware and operating systems to assist in this type of planning.

Another important consideration for system deployment is software updates. This includes hot fixes and service packs to the operating system along with fixes and upgrades. Customer F uses Microsoft Systems Management Server as an automated distribution tool, and plans a monthly block of time for updates. Customer A repackages vendor setup routines to fit their environment.

An important part of the software update process is certification. Certification helps ensure the quality of the updates, component compatibility with other application on a server and provides stress tests. Certification usually includes a combination of automated tests and end users testing the software. Customer E has a set of performance and conformance tests each application must be before deployment. Customer C and D both believe the most effective tests simulate actual business functions. However, Customer D finds that even the most exhaustive tests do not find all the potential issues, and uses a pilot installation to test real life situations.

The deployment of a system also needs to take into account how the system will be monitored and supported. This may include the deployment of management and monitoring tools such as Microsoft Systems Management Server, HP OpenView and Tivoli. This may also include the use of functions like regular ping operations or running remote scripts. This will largely depend on the features and capabilities of the system being deployed.

Cluster Considerations

Customers D, E and F are in the process of testing their systems for Microsoft Cluster Server (MSCS). Specific considerations for deploying cluster aware applications are discussed in this section.

A two-server cluster improves data and application availability by allowing two servers to trade ownership of the same hard disks within a cluster. When a system in the cluster fails, the cluster software recovers the resources and transfers the work from the failed system to the other server within the cluster. Consequently, the failure of one system in the cluster does not affect the other systems, and in many cases, the client applications are completely unaware of the failure. This results in high server availability for the user. In addition, this two-server cluster system can be used for manual load balancing and to unload servers for planned maintenance, without downtime.

The Microsoft BackOffice family of applications, such as Microsoft SQL Server Enterprise Edition 7.0 and Microsoft Exchange Server Enterprise Edition 5.5, support the fail-over capability of the cluster. Microsoft SQL Server is planned to support a partitioned data model and parallel execution to take full advantage of the shared nothing environment.

Windows NT Server Enterprise Edition clustering provides basic benefits for non-cluster-aware server applications. Developers who wish to take direct advantage of cluster services for easier setup, faster fail-over, or detection of more subtle error conditions can use the cluster APIs and tools in the Windows SDK. Over time, Microsoft development tools will be enhanced to support the easy creation of cluster-aware applications that more fully exploit the cluster for faster recovery, easier manageability, and higher scalability. It is also important to note that not all server applications will need to be cluster-aware to take advantage of cluster scalability benefits. Applications that build on top of cluster-aware core applications, such as large commercial database packages (for example, an accounting or financial database application built on top of SQL Server) will benefit automatically from cluster enhancements made to the underlying application (for example, SQL Server). Many server applications that use database services, client/server connection interaction, and file and print services will benefit from clustering technology, without requiring application changes.

Windows Load Balancing Service

Both Customers D and I use Microsoft Windows NT Load Balancing Service (WLBS), a feature of Windows NT Server 4.0, Enterprise Edition, to provide load balancing and clustering for mission critical Internet applications. WLBS dynamically distributes IP traffic across multiple cluster nodes, and provides automatic fail over in case of node failure. WLBS also provides multihomed server and rolling upgrade support, ease of use and controllability. WLBS runs on economical, industry standard PC platforms and interoperates with both Microsoft and third-party clustering products to provide a complete three-tier solution.

Software Support and Monitoring

This section will discuss software issues to be aware of for monitoring and support. Specific issues regarding system monitoring and support are covered in Chapter 7 – .

The essential element for software support strategies is to understand what causes the end user to experience a disruption in service. This could include a system failure, performance degradation or application timeouts.

Both Customer I and D include regular tests of actual business transactions as part of the system monitoring. This will supplement any ping, SNMP trap, or scripts that might be run as part of the automated network management.

If the business functions do not complete successfully within the allotted time, then support personnel are notified.

Another important consideration with software support strategies is understanding the condition of the application. The server may be running, and the application active, but the application may not be responding to input. A database deadlock, worker threads aborting or a priority problem could cause this. Customer G uses a ping operation to the application on a regular basis will help determine if this type of condition exists. The application responds with a status of "green" if conditions are normal, a status of "yellow" if the application has retries of recoverable errors. A status of "red" occurs if the application does not respond.

Application Isolation

A running system may be constantly performing tasks and generating events. Examples of this may include accessing network resources, interacting with a local data store, or alerting an operator that a printer is out of paper. Each of these tasks could effect other applications within your systems or could cause an error condition within your systems. A high availability system must not only withstand these events, but should also alert other systems when error conditions occur so that recovery procedures may be called into action. Key concepts will include:

isolating a high availability application interference by other systems
including high availability applications in standard error detection
restoring a high availability application if an error does occur

Application Isolation Techniques

The Windows NT environment supports the isolation of applications with processes and threads. When an application is started under Windows NT, it is assigned to its own process. Each process provides the resources needed to execute a program. A process has a virtual address space, executable code, data, object handles, environment variables, a base priority, and minimum and maximum working set sizes. Each process is started with a single thread, often called the primary thread, but can create additional threads from any of its threads.

All threads of a process share its virtual address space and system resources. In addition, each thread maintains exception handlers, a scheduling priority, and a set of structures the system will use to save the thread context until it is scheduled. The thread context includes the thread's set of machine registers, the kernel stack, a thread environment block, and a user stack in the address space of the thread's process.

Since each application resides in a separate process, the affect on the system and other applications on the system is minimized.

Application System Isolation

Over and above the features Windows NT provides for application isolation, it is recommended that each system within a high availability environment be reviewed for data and system dependencies that might exist with other active systems.

Each application should be able to operate completely independently, or have latency built into interaction between the systems.

Examples of an independent application are the Web servers used by Customers D, F, and I. These customers use either the CISCO Local Director or Windows Load Balance Service to run multiple, independent copies of the same application. Each application may have redundant servers providing the same service. Each application uses its own database.

Another example of an independent application is Customer C's decision to replicate read only customer data from one application when that data is used by a separate application. This keeps each application immune from each other's failures and service interruptions.

Customer A provides an example of where latency between loosely coupled systems provides isolation from potential failures. The day's production schedule is loaded from the Enterprise Resource Planning system to the Manufacturing Execution System. The production for the day can continue even if the Enterprise Resource Planning system becomes unavailable for that day. This type of loose coupling with latency allows operations to continue for a period of time determined by the amount of data down loaded at the start of the day.

Use of Queues for Inter-System Communication

A common technique to provide application isolation and limit the dependency on separate systems is through the use of queues. Queues can be implemented using queuing software such as Microsoft Message Queue Server (MSMQ), IBM MQSeries, or Compaq DEC Message Queue. Customer G uses a queuing strategy so that workstations can operate independently for a period of time in the case of network failures.

Use of queues is highly desirable. Some of the benefits of using queuing technology are listed below.

Queues do not require the sender and receiver to be available at the same time.
Queues provide a mechanism to guarantee message delivery
Queues can be configured as durable, with their messaging able to survive network and hardware failures.
Queues are generally lightweight and do not add a lot of additional overhead.

A common technique to implement guaranteed message delivery between the Windows NT environment and a mainframe is to use MSMQ on Windows NT and IBM's MQSeries on the mainframe. The MSMQSeries Bridge from Level 8 Systems is used to connect the two queuing systems. The MSMQSeries Bridge is available with Windows NT 4.0, Enterprise Edition.

Isolation of Legacy Code

There are times when a Win32®-based version of an application is not available. Under these conditions, a 16-bit application may be the only option. Windows NT provides the ability to run 16-bit applications together under a single process, or with each 16-bit application under its own process.

To provide application isolation it is recommended that each 16-bit application run in a separate process. This may not always be practical since some 16-bit application only function properly if applications are allowed to write in each other's process space. This was a common practice with 16-bit applications. It requires the applications affected to run in the same process space, at the expense of application isolation.

COM provides a technique to isolate and inter-operate with 16-bit applications. If the 16-bit application can run as an out of process COM server, 32-bit applications can interact with it as COM clients. COM provides the marshalling necessary to move data between 16- and 32-bit applications. Should the 16-bit application crash, it could be restarted without affecting the 32-bit clients. A word of caution, the 32-bit client must be developed to detect a failed COM server, and restart the server if required.

Error Detection

All the customers interviewed made a point of the importance of error detection. An error detection strategy works best with an error is detected, resolved and recovered without the knowledge of the user community.

Error detection in a Windows NT environment will enhance a system's reliability and availability. Early detection and handling of application and system errors can help avoid a system shutdown, or at least allow for an orderly shutdown. It can also increase availability by allowing the system to continue operating in a degraded state.

There are three common mechanisms available in Windows NT-based systems that allow for error detection. The SNMP protocol captures, or traps, configuration, and status information from a Windows NT server. Event logs provide a system for capturing and reviewing significant application and system events. Exception trapping capabilities allow for reliable programmer and program control over responses to exceptions that occur during the execution of an application.

SNMP

This protocol captures, or traps, configuration and status information on the systems in a network running an SNMP agent (the SNMP client software.) SNMP can be configured to send this trapped information to a designated machine for event monitoring. The SNMP service is installed on the server and a second system is designated to receive the trapped messages. This designated system must be running an SNMP Monitoring application. For more information on SNMP monitoring applications, please refer to Appendix A – Tools."

Event Logs

The Windows NT-based Event Log provides a standard, centralized mechanism for applications and the operating system to record important software and hardware events. The event-logging service stores events in a single collection called the event log. Windows NT also provides the Event Viewer for displaying the information from the event log, as well as a programmatic interface for examining the log file.

Four types of log files are available:

application
security
system
custom registered log file

These log files can be manually reviewed by using the Event Viewer application provided with Windows NT, or they can be scanned for specific information through custom applications developed for that purpose. An application may be developed that periodically searches the event log files for a specific event. The occurrence of this event would be an indication that an error has occurred on the system and that appropriate action should be taken to handle the error.

A mission critical application would create its own log file. This log file would be examined periodically by a custom application that would search the file for critical application events. If a critical event were detected, operations personnel would be notified.

There are five types of events defined by Windows NT. Each event type has well-defined common data and supports optional event-specific data. An application must specify the event type when it reports an event. The five event types are: Information, Warning, Error, Success audit, and Failure audit.

Examples of how event logging can be helpful are:

Resource problems. An application gets into a low-memory situation caused by a code defect or inadequate memory.
Hardware problems. A device driver encounters a disk controller time-out.
Bad disk sectors. A disk driver encounters a bad sector, it may be able to continue after retrying the disk operation, but the sector will go bad eventually.
Information events. A server logs an error event that it cannot access a file.

Exception Trapping

Both Customers F and G internally develop applications for their high availability systems. This allows them to standardize on how errors are trapped and handled in programs. This sections methods for exception handling in programs.

An "exception" is an unexpected event that prevents a process from following its normal execution path. Both hardware and software can detect exceptions. Examples of hardware exceptions are dividing by zero and the overflow of a numeric type. Software exceptions include those detected in your application, which are then signaled to the system through a special function call.

Exception handling allows you to write code that is more reliable. You can ensure that resources, such as allocated memory and open files, are properly closed in the case of an unexpected termination of your application. You can also handle specific problems, such as insufficient memory, with code that does not have to rely on the elaborate testing of return codes.

Exception trapping in the C programming environment is best handled using a technique called structured exception handling. This technique involves the writing of exception handlers and termination handlers. The C++ language provides built-in support for handling exceptions. In C++, the process of raising an exception is called "throwing" an exception. A designated exception handler then "catches" the thrown exception. While structured exception handling works in C++ programs, you can ensure that your code is more portable by using C++ exception handling.

While exception trapping is a good programming practice, it does not solve the problem of error detection at the operational level. Unless the application is designed to write to the event log when significant exceptions occur, operations personnel cannot detect that an application is struggling with a system problem. An application may be handling exceptions resulting from reading or writing to a faulty disk. Since no application or system crashes result, there is no outward indication of any problem. For this case, the monitoring of disk retries would provide operations with an indication of a problem on a system. Even with well behaved applications, it is still critical that operations monitor essential system performance parameters like processor and memory utilization, network or disk retires, and so on. This will allow for the detection of potential problems before they become critical problems.

Poorly Instrumented Applications

Customers B and I both use system logs to assist with error detection, recovery and root cause analysis. Not all applications meet their requirements for instrumentation. This section discusses strategies for handling poorly instrumented applications.

Two schemes are typically used to address the situation where poorly instrumented applications exist in a production environment. The first scheme is quite simple. It involves moving the poorly instrumented applications to non-mission critical servers. This minimizes the impact these applications can have on production servers. Sometimes it is impractical to move these applications to another server. Then the best policy is to ensure that sufficient system monitoring is occurring on the server where these applications reside. This would require taking snapshots of important system performance measurements at regular intervals, as discussed in Chapter 7 – . Operations personnel would then monitor these measurements for any anomalies.

Recovery

There are three basic techniques available to recover information following a system failure. Restoring information from backup media (tape) and replaying log files enables past information to be reconstructed. The use of a fail-over system prevents the loss of a minimal amount of data and is used to reconstruct the data on the primary system.

Restore from Tape

The periodic storage of information from a system to off-line media is a technique used to ensure that a system can be brought back to a known state following a failure. The frequency at which a system's state is recorded to an off-line storage media determines how closely the restored system will resemble the system before the failure. A typical scenario for recording information from a system involves the capture of the entire system once a week, followed by incremental captures of the changes that occur each day.

Replay Log Files

System components such as relational databases and message queues maintain log files that record the activity for that component. These log files can be replayed to recover activity that was lost due to a system failure. Both Microsoft SQL Server and Message Queue (MSMQ) provide log file capability.

SQL Server uses the transaction log of each database to recover transactions. The transaction log is a serial record of all modifications that have occurred in the database as well as the transaction that performed each modification. This log file is used to restore a SQL Server database following a system failure.

MSMQ enables applications to send messages with delivery guarantees. These guarantees can be applied on a message-by-message basis. When networks go down between applications, or receiving applications are offline, or when machines containing message queues fail, MSMQ will ensure that messages are delivered as soon as connections are restored or applications and machines are restarted. MSMQ implements these guarantees using disk-based storage and log-based recovery techniques.

Fail Over

A common system architecture that provides a method for recovery of information involves the use of two similar systems. One system is designated as the master system and the other as the back-up or fail-over system. Duplicate copies of all critical information are kept on both systems. If the master system fails, the fail-over system maintains a copy of the information. Once the master system comes back on-line, it is updated with the information from the fail-over system so that there are again duplicate copies of the critical information.

Conclusions and Best Practices

An essential part of ensuring system reliability and high availability is ensuring that applications are designed and implemented in a way that they will not interfere with each other on the same system. A proactive approach to error detection and system monitoring will help avoid critical system problems before they occur. Having a well thought out recovery plan would ensure that a system could be successfully recovered when critical system problems cannot be avoided.

Chapter 6 – Planning: Pitfalls To Avoid

Certain operational practices can limit the availability of applications and computers running on Windows NT. These practices include:

Using early versions of the operating system
Installing incompatible hardware
Failing to plan for future capacity requirements
Failing to monitor computer, network, and application performance and availability
Failing to determine the nature of the problem before reacting
Treating symptoms instead of root cause
Performing out of date procedures
Stopping and restarting to end error conditions

Using Early Versions of the Operating System

Maintaining computers running early versions of the operating system is a practice that you should avoid. Even though using previous versions of an operating system can reduce short-term expenses, customers interviewed for this document discovered long-term costs associated with using early versions of an operating system.

A requirement to maintain a previous version of the operating system in order to support existing applications is a warning sign that resources need to be devoted to replacing or upgrading those applications. Supporting applications that have not been upgraded or redesigned for current versions of the operating system can limit server availability as discovered by customers B and I. Both customers have servers running early versions of the operating system and have learned to simply live with outages they encounter on these servers. Additionally, the servers can only run applications designed for the earlier operating system and must be isolated from the rest of the network because security features are not as strong as provided by servers running the current operating system.

Some factors you should consider if you have not yet implemented a plan to upgrade hardware and software:

Older hardware parts can cause computers and applications to become unavailable. Hardware does become difficult to replace as vendors discontinue production and support of older parts. Older parts and their associated software drivers become impractical to support as new technology is utilized. As an example, many hardware vendors no longer provide or support parts designed for computers running Windows NT version 3.51 or earlier because of radical new technology implemented in current commodity computer hardware.
Early versions of an operating system can cause computers and applications to become unavailable. Upgrading or debugging early versions of an operating system can become impossible or impractical. For example, Microsoft only makes best effort to correct bugs found in very early versions of Microsoft operating systems and this is also true of vendors supplying software applications and drivers.
Older networking or communications technology can cause computers and applications to become unavailable. As network and communication devices implement newer communication protocols or techniques, Integrating older network or communications technology in earlier versions of an operating system with current network and communication devices increases operations support costs, as well potential security problems that may be exploited.

Postponing operating system upgrades can cost money, in terms of higher support costs and the loss of opportunities to do business better. Avoid dedicating 100 percent of your operation time and budget to maintenance of older operating systems and applications. Do schedule time for planning, testing, and installation of new applications and operating systems.

Solving this issue is not trivial. Budget, resource constraints, and user concerns about training or application loss can influence how and when operating system upgrades are performed. Strategies that can help resolve user and budget issues include:

Discussing operating system upgrade requirements with application users. Point out the issues and possible alternatives to using an application that is only available on an earlier version of the operating system. You may be surprised to find applications user are willing to support operating system upgrade, in particular those users who are unhappy with the older application.
Charging costs associated with supporting early version of the operating system to application users using the early version of the operating system. Support costs become increasingly expensive in direct correlation to the age of out-of-date an application and operating systems.
Establishing an annual budget plan for applications during the application development phase. The budget should provide for maintenance and upgrade costs to allow applications and operating systems to be upgraded in the future.
Establishing an end date for the application lifetime. At the end of the application lifetime, review the benefits and costs of either updating or replacing the application. It is highly likely that the cost of future technology that can replace the application will be less than the cost of maintaining it and an earlier version of the operating system.

After you resolve user and budget issues and begin planned application and operating system upgrades, you will most likely find technology advances make it possible to install and configure new software in a fraction of the time previously required and improved online wizards and help documentation decrease the amount of time users require to learn and use new software.

Customer I has developed a best practice of installing current service packs and hot fixes. Their results have proven the benefits of this implementing operating system upgrade practices. They have been able to reduce computer stop and restart by up to 50 percent by simply upgrading computers to Windows NT version 4.0 Service Pack 4 from Windows NT version 4.0 Service Pack 3.

Installing Incompatible Hardware

Installing hardware without first verifying that the hardware is compatible with the operating system and planned usage is a practice that you should avoid.

A best practice to avoid these problems is creating a hardware strategy as part of planning and deployment. This can easily be done using the hardware compatibility documentation provided by operating system vendors and by maintaining ready supply of replacement parts. When hardware failures require replacement parts, it is important to install only spare parts that are compatible with the computer and the operating system. Restoration of service may be delayed when hardware capability issues are encountered during parts replacement. See Chapter 4 -- Planning: Hardware Strategies for additional information.

Failing to Plan for Future Capacity Requirements

Capacity planning is critical for the success of any high availability application. Without proper capacity planning, an application can be running successfully one minute, then the next run out of disk space, have unacceptable network contention, or encounter database deadlocks. These types of interruptions in service can usually be avoided with proper planning.

An important time to consider capacity planning is during usage analysis when a particular event will potentially cause peak loads. Customer C has this type of event occur every business day when where peak transaction periods occur during the final fifteen minutes of the business day. For Customer C, the application is required to meet the load of the peak period in order to run the business. Serious business consequences occur if the application does not perform properly during this peak period. Other examples of where a business might be affected by not properly planning for peak network periods include special promotional which cause peak usage, or introducing new programs prior to analyzing the impact to the existing network.

System monitoring is the first level of defense against capacity-induced system failures. Without adequate monitoring, support personnel will not be aware when avoidable failures begin to occur. It is important to know when systems begin to become either unstable, have excessive retries, or reach capacities. With sufficient warning, these types of failures can generally be avoided.

Finally, hardware compatibility issues can also limit availability. Hardware compatibility issues usually surface when new systems are deployed, or when replacement systems or parts are used when a hardware failure occurs. Using commodity hardware without an assessment and certification can cause this type of compatibility problem. A best practice for commodity hardware is to standardize on one or a few vendors of commodity hardware. This ensures consistent quality, and helps with the certification process.

Failing to Monitor Computer, Network, Application Performance and Availability

Failure to use monitoring software and tools will limits to anticipate and resolve problems before they become critical and cause an actual failure. An application or server failure may be the first and only notification you receive of a problem, if you do not use monitoring tools. End users become the de facto monitoring tool as they encounter and report performance and availability problems.

Monitoring software can provide advance warning of potential failures on a computer or a network. This advance warning allows support and operations personnel to remedy the causal root of a problem before the problem becomes failure experienced by end users. Customers C, D, F and H implement extensive network and server monitoring. They do so primarily to identify problems as soon as possible and to resolve the problems before the user community being of aware of the potential failure.

Monitoring tools provide information about a wide-range of components and behavior, from notification about disk drives reaching capacity to notification about excessive network logon traffic. Monitoring should include detecting failure conditions such as power loss or operating system lockups.

Monitoring power loss conditions and operating system lockup require special attention. During a power loss or lockup, it may not be possible for the system to generate an alert for the monitoring systems. The result may be a failure that goes undetected for an extended period. If the failure occurs on a server running a long batch job, failing to detect the condition can affect the availability of other systems that depends on the successful completion of the batch job. A best practice for detecting error conditions on servers is using automated procedures that periodically contact the server. The ping command, a TCP/IP diagnostic command, can be used to contact and verify that a server is still connected to the network. This command can be used at the MS-DOS prompt or as part of an automated procedure or script.

Failing to Determine Nature of Problem Before Reacting

Failure to determine the exact cause of failure or problem symptom can lead to excessive time spent on resolving the wrong root causal problem or to an actual failure in spite of receiving advance warning of the problem from monitoring tools.

Effective problem management depends on determining the cause of the problem and using the proper resources to correct the problem. Proper training and empowerment of support personnel can assist in this process.

Customer H uses a mid level system management system that monitors and routes problems to coordinate responses to errors conditions. This ensures that support personnel are working efficiently and on the correct problems. The coordination process starts when an error condition is identified. The next step is contacting a support technician who can respond and the third step is providing the technician with the reference information that identifies the error condition and procedures to correct the problem.

Treating Symptoms Instead of Root Cause

Symptom treatment can prove an effective strategy to restore service when an unexpected failure occurs or when performing short-term preventative maintenance. However, symptom treatments that are added to standard operating procedures can become unmanageable. Support personnel can be overwhelmed with symptom treatment and cannot properly react to new failures. This happened to Customers D and F. Now, both rigorously use root cause analysis to avoid this problem in the future.

Long-term solutions require root cause analysis to determine and fix the actual problem. If the root cause cannot be determined internally, do not hesitate to bring in the vendor of the suspect application or component to determine the cause. It is likely that your vendor is eager to find and correct it any problem associated with their hardware or software. Do escalate problems that cause down time until the problem is resolved.

Performing Out of Date Procedures

An operational procedure developed solely to prevent an error condition may become a permanent addition to the operations schedule. Instead, do ensure that you remove any outdated procedures from operation and support schedules when a root problem in application, operating system, or hardware is fixed.

It can be difficult to remove out-of-date procedures from operational schedules. Customer I discovered this in the planned weekly stops and restarts of application servers that had become part of the normal operating schedule. The stops and restarts had been implemented to treat application memory leaks problem in their SAP R/3 application installation and operation. This procedure eliminated problem symptoms and operations staff are unwilling to discontinue the procedure. They fear the problem may reappear.

To ensure removal of the outdated procedure, Customer I is implementing a phased reduction in scheduled server stops and restarts. Monitoring the change in procedures is management staff with the change control authority. If the reduction progresses without a reoccurrence of the problem the old procedure can be permanently removed, if the problem reappears, management can reinstate the old procedure.

Stopping and Restarting to End Error Conditions

Stopping and restarting a computer may be necessary for a variety of reasons including preventative maintenance on application and operating system bugs, installation of new software and hardware components, system bug checks, system stability problems and system locks.

Memory leaks and improperly handled error conditions can escalate into serious stability problems if left unchecked. Stopping and starting the computer experiencing the problem may temporarily fix the problem, however, a less severe approach, such as restarting the application instead of the computer, may be all that is needed to fix the problem. Root cause analysis can identify conditions that cause operating system stability problems and help determine alternative procedures.

Chapter 7 – Monitoring and Analysis

A common theme among all the customers interviewed for this guide was the importance of monitoring system operation and heath. This includes the use of commercially available products such as Microsoft Systems Management Server, Tivoli, Net Manage and OpenView. Most of the customers felt that certain components from these tools were useful starting points. None of the customers used a single package as their only monitoring tool. Most of the customers chose to implement parts of these tools, but not the entire tool. Many of these tools were tied together or extended using customer written scripts or programs.

Successful monitoring of any operating system involves a methodical approach for measuring essential system operating characteristics. By analyzing the resulting data for deviations from a baseline or benchmark, personnel can trend results for analysis to address pending system problems or bottlenecks. This information can also be useful for capacity planning purposes, in particular with Web servers, where the number of hits may increase over time.

Monitoring also typically includes the capture of events in the Event Log or other logging resource to spot fatal errors or warning conditions that may signal the start of a problem. Most of the customers employed third party tools that monitor events written to the event log. Events of particular significance are then alarmed to operations for resolution. Some customers even have procedures tied specifically to certain events that have been seen in the past to reduce recovery time.

Monitoring typically resolves down to three main areas:

Performance Monitoring and Trend Analysis – Typically monitoring using tools supporting the performance monitor APIs.
Event Log Monitoring – This involves using tools that monitor the event log for particular events or for failure events.
Application Health and Performance – This typically involves executing an application operation to determine its response time and a successful function return.

Performance Monitoring and Trend Analysis

Most of the customers interviewed used some form of monitoring to measure resource utilization and performance. This typically involves the measurement of CPU, memory and disk space as well as particular resources of interest to the application.

With performance monitoring, one aspect must be carefully considered. Measuring a server or application can change the behavior of the server or the application. The act of measuring or recording information from a server will consume CPU and memory. If the sample rate is too high, the act of monitoring can exceed the resources used for the application and jeopardize the performance of the system.

Resource Threshold Monitoring

Most companies try to monitor a few important resources for particular thresholds. A number of tools can be used to accomplish this. Two of the most common are Performance Monitor using the alerts mode, and Tivoli (used by Customers A, D and F). The resources of particular interest are:

Hard Drives	Many applications and even the operating system in some cases may stop working if the hard drives become full. Consequently, most customers set the monitoring to alert at 80% of capacity. This number of course depends on the nature of the drive and the time necessary to release drive space. One situation that developed showing the importance of setting the value appropriately concerned a log file drive that was set to an 80% alarm value. During a normal day, drive 30 percent of the drive is filled. However, the process that removed the log files as part of the backup operation had failed for several days, which kept the logs from being cleared. By the time the alarm went off for the logs, the drive was getting full. By the time someone could react to the situation, the drive had become full and the application had terminated causing an outage. The moral is to make sure the alarms allow time to for the support staff to react.
CPU	Most customers do not monitor thresholds on the CPU because the nature of the CPU is that it may normally go to 100% of utilization. Unfortunately, most tools will not allow setting CPU alarms at 100% for a set period of time.
Memory	Two related alarms are the amount of available physical and virtual memory. In some applications, the amount of physical memory is crucial for performance reasons. In most other situations, the amount of virtual memory is the only metric that is important.
Threads	In some applications the number of threads currently running may indicate a problem with the application or a capacity problem.

Performance Monitoring

Most customers perform at least a current running snapshot of the performance of their servers. Typically, this is done using Performance Monitor to show the CPU, memory and other application counters over 8 or 24 hours, using long sample intervals. One Performance Monitor session may have 20 or 30 servers monitored at one time. The intention is that at a glance, a server that is having problems will stand out on the monitor, in particular if the problem is related to a gradual build up.

The other issue is long term performance monitoring. A number of attempts have been made to use Performance Manager to record long-term performance using logs storage. A couple of problems arise from this technique.

The resulting data files can be extremely large, and must be parsed or condensed to make sense of the information.
If the log files are stored locally, particularly on the boot partition. The size of the files may cause problems because of the size.
Finding particular segments of information may also be an issue.

Customers doing long term trending have tended to write or develop their own procedures to synthesize the information or to store it in a proactive way to a central database.

Trend Analysis

Trend analysis becomes practical when long term monitoring is in place. Customers B, C, and E all use some form of trend analysis in their operations. The positive aspects of long-term analysis are of course the ability to predict when expansion of the system will become necessary, long before the system starts having problems. Long-term analysis will also aid in the trouble shooting of problems such as memory leaks or abnormal disk consumption. Long-term analysis may also indicate strategies for load balancing.

The down side of trend analysis is that it is very easy to have more data than can be understood or processed. To use trend analysis, it is important to be able to understand the data, and have strategies and methods to gather the information, as well as store it and set base levels. Base levels are very useful for comparing how a system is doing on a snap shop basis. Base levels are difficult to determine without long-term trend analysis.

Event Log Monitoring

The operating system and many applications use the Event Log to store information concerning events that occur. Customers C, D, E, G, and H all made use of the Event Logs as part of their monitoring scheme. These customers usually take one of two approaches to the event log. The first is to look at the event log only when something goes wrong, typically as part of a diagnostics operation. The second approach is to actively capture events and filter out the serious or significant events.

Obviously, the first approach is reactive, and the second approach is proactive. Customers that capture events and react to them typically use Net Manage to capture and filter the events. One customer actually also used custom code to capture the event. The event was then sent to HP Open View where scripts were used to trigger procedure actions for the operations group.

Monitoring for Failure

Obviously, monitoring for failures is important. In particular, events that signal error level 15 indicating a serious error has caused the service or application to terminate. These errors need to be flagged quickly and operations notified to correct the situation. There are also error levels that may indicate corruption, loss of connections, dropped information, and other conditions that may not cause the service to terminate. These conditions may not be urgent, but left unchecked could cause loss of data or functionality.

Consequently, specific events that that are known to be symptomatic of problem conditions should be trapped by the monitoring system and operations notified. Ideally, this same trap procedure would also start a procedure to correct the situation.

The event log also has a Security log that can record security access failures, and other security related events. Monitoring this log can be very useful for detecting attempts to breach security for the server. Setting this up is very application specific because of the different nature of the events. For instance, Exchange will write an event when a delegate accesses another mailbox. This event is not significant and should be ignored. Events showing Access Denied repeatedly on the same resource may be a different situation.

Monitoring for Success

A number of applications write events to the event log to signal successful operation. The most important of these are usually backup operations. Knowing that the operation was successful is often more important than knowing the backup failed. Typically, the group responsible for backups does the monitoring of backup status, but it does need to happen.

Alarming

Several customers have devised ways to alarm or notify the operations staff based on event log monitors. Customer H actually could notify the operations group of the exact problem, the server involved, and the procedure that should be followed. This relied on Open View scripts and consisted of approximately 130 different procedures for recovery or repair. Tivoli and Net Manage have similar capabilities as well.

Application Health and Performance

Instrumenting applications to provide performance and error logging is a best practice. In some cases, this is not practical or possible. In those cases, another approach is to use a Ping approach. The ping operation will make a call to a server, measuring the content returned as well as the time it takes to respond. Customers B, D, and G used this technique as part of their monitoring strategy.

Several customers have written routines that make calls to the web servers to measure the amount of time for a response, and validate that the content is as expected. Since the actual production operations are being tested, this provides good indications on the level of performance that a customer may expect to get. This can be expanded to do the monitoring from a location on the Internet, thereby testing the connectivity as well as the application. Customer D used this strategy as an important part of system monitoring.

This technique can be used for other applications as well; in particular applications that have operations that do not change critical data. Databases are an example that can be tested with read operations. Microsoft Exchange has built in testing that sends applications between servers and sites to validate those servers are still working.

HealthMon

The HealthMon management console is a component of Microsoft Systems Management Server 2.0 that provides a central, graphical, real-time status view of Windows NT 4.0 Service Pack 4 or higher and BackOffice servers. Based on color-coded severity levels, operators can view operational conditions, including both normal and exception status. As events are received, HealthMon organizes and displays them according to predefined monitored object groups (Microsoft Management Console nodes).

HealthMon can monitor systems that are Microsoft Systems Management Server site servers or any other computers that have the HealthMon Agent installed. HealthMon monitors each system for the components that you specify. You can either use the default thresholds or configure the thresholds specific to your environment. HealthMon creates alerts that identify network and system problem areas so that you can resolve them. Types of components that can be specified in HealthMon are listed in the table below.

Table 8 HealthMon Components and Counters

Component Name	Monitored Counters
Processor	Interrupts Per Second Percent Total System Time
Memory	Available Memory Bytes Page Reads Per Second Pages Per Second Percent Committed Bytes to Limit Pool Non-Paged Bytes
Paging File	Percent Peak Usage Percentage Usage
Logical Disk	Percent Free Disk Space
Physical Disk	Disk Queue Length Diskperf Driver Started Percent Disk Time
Network Interface	Bytes Total/Sec
Server Work Queues	Context Blocks Queued/Sec Processor Queue Length
Security	Errors Access Permissions Errors Logon
Fault	Pool Non-Paged Failures Pool Paged Failures Sessions Errored Out
SQL Server	MSDTC Service Started MSSQLServer Service Started
IIS	IIS Service Started
Exchange	MSEXCHANGEDS Service Started MSEXCHANGEIS Service Started MSEXCHANGEMTA Service Started MSEXCHANGESA Service Started
SNA Server	Host Connection Status SNABASE Service Started
Microsoft Systems Management Server	SMS_Executive Service Started SMS_Site_Component_Manager Service Started SMS_SQL_Monitor Service Started

References on Performance Monitor

Microsoft Windows NT 4.0 Workstation Resource Kit.

Provides a detailed section covering performance measurement, tuning, and optimization provided by Performance Monitor and Task Manager. There is also discussion of additional tools.

Windows NT 3.5/3.51 Resource Kit

Contains an entire volume on Performance Monitor, "Optimizing Windows NT." The first part is an introduction to Performance Monitor, followed by a section on bottleneck detection. The second part of the book is for programmers and shows how one can add counters to Performance Monitor. This programming information did not make it into the Windows NT 4.0 Resource Kit, so Volume 4 of the NT 3.5 Resource Kit makes a great companion to the Windows NT 4.0 Resource Kit.

You can find Performance Monitor Counter Definitions in a help file, \common\perf tool\cntrtool\counters.hlp, on the CD-ROM that is provided with the Windows NT 4.0 Workstation and Server resource kits.

Chapter 8 – Help Desk

This chapter describes methods you can use to define Help Desk goals and objectives and support high availability systems. Using methods described in this chapter, you can:

Define service goals by using a service level agreement (SLA)
Define the role of the Help Desk
Establish problem resolution procedures
Obtain vendor support

Service Level Agreements

The availability of a system is most typically defined as part of a Service Level Agreement (SLA). This agreement acts as a contract between the service provider and the service consumer. This document may contain performance metrics for the system, bonus clauses for exceeding targets, as well as penalty clauses for failing to meet goals.

Within the customers surveyed, SLAs were used to accomplish a variety of tasks. For example, Customer A utilizes service-level agreements to help determine both the hardware and software that will be used for a particular application. This is unique in that there will be a wide variation in systems chosen based upon whether there are safety concerns, production concerns, or business-level concerns. Customer H utilizes the service-level agreement more as a portion of the business-level agreement between themselves and external customers. Because of this, the service-level agreement specifies items such as scheduled machine downtime, machine redundancy, and operating system maintenance.

Another important concept to note is that a service provider who enters into a SLA with a consumer may also be a consumer with respect to SLAs initiated with other providers. In other words, there may be a many-level chain of SLAs connecting service providers and service consumers.

Customer F is a good example in which a high-availability data center is itself dependent of SLAs established with outside vendors. In this case, the availability of applications may be directly affected based upon the performance of these outside vendors

Help Desk

Help Desk organizations generally provide support for users and applications within an organization. This support will include 1st level support for direct user questions and application issues, as well as in-depth support for new or difficult issues. Both types of support are critical with high availability applications, particularly as operations staff numbers decrease. Best practices for help desk organizations will include multiple levels of support and a well-documented escalation procedure for exercising each level of support within the organization.

Customers I, C, and H use help desk personnel to provide 1st level support for issues that arise in the data center. In each case, help desk personnel are responsible for gathering specific sets of information regarding a problem, assessing whether this problem is previously known, and then planning a resolution to the problem. Customer I also explicitly specifies the minimum set of information that will be gathered for each instance. To help assess whether a problem is previously known, Customer I also maintains a list of previously seen application-specific error codes.

In each case, help desk personnel are also responsible for initiating an escalation for any unknown or critical errors. Each customer has a unique infrastructure for handling escalation, but they all maintain the concept of multi-level support for issue resolution.

The IT organization has several sources for staffing the support group, which includes the following.

Remote support technicians
Support engineers that are dispatched to the source of the problem
Crisis Resolution teams
Vendor supplied remote support
Vendor supplication onsite support

Generally, the help desk functions are tiered into three layers, and are represented by the first three types of support personnel listed above. Vendor supplied support personnel may work at any of the three levels, but are primarily involved with the crisis resolution teams.

Customer A takes a unique approach for vendor-supplied support. Their entire 1st level help desk force is actually outsourced from one of their principal vendors. In this situation, the customer doesn't so much leverage the fact that these personnel have extra knowledge of their equipment, but this does solidify the relationship between the customer and their vendor. Solid vendor relationships become increasingly crucial within high availability installations, so this relationship is of the utmost importance.

Dataquest5 research, the types of services provided by vendors include the following.

Multi-vendor support across hardware, software, and networks
Bundled mission-critical service offerings with disaster recovery/business continuation services
Guarantees for restore times, backed up by penalties with " teeth"
Documented and consistent problem escalation and resolution procedures
More proactive and preventive support Up-front and ongoing planning and assessment offerings
Business-critical application support (for example, SAP/R3) tied to systems infrastructure support
Quality and frequency of communications with customers

Problem Resolution

Problem identification is the key to quick problem resolution and assignment of the appropriate resources. Identifying the root cause in high availability systems can be complicated. In fact, one of the best practices—automating common operations tasks— can make problem identification more difficult. While automation of tasks prevents common user-induced errors, this practice complicates the task of detecting, reporting, and resolving problems.

The Help Desk can perform several key functions during problem resolutions. The functions include supporting users and system infrastructure components, and identifying, diagnosing and resolving problems when they occur.

Help Desk operations are generally categorized by specific levels of support and escalation procedures. This level of support and escalation allow the Help Desk to provide the most appropriate assistance during problem assessment and resolution.

Design Help desk procedures to determine whether operations personnel reporting the problem can resolve the problem, or needs assistance. This usually occurs with level 1 support. The majority of help desk support should be handled in the manner.

Once the correct personnel are involved with an issue, the next key concept is to maximize the amount of time that those personnel have available to work on the system without affecting the rest of the enterprise. Customer C accomplishes this task by providing warm standby equipment, and even goes so far as to isolate this equipment in a separate physical location, to allow for more catastrophic failures. Customer A has chosen to design windows of recovery directly into their business processes, so that personnel have defined windows of time to work on a system before a failure in that system will begin to affect other systems within the enterprise.

Level 2 support usually enlists engineers and technicians that specialize in parts of the system such as network administration, database administration or particular application skills (SAP R/3). Level 2 technicians usually work with either remote diagnostics tools, or dispatch service technicians to the location of problem.

Customer I addresses problem assessment through extensive use of remote debugging capabilities. This extends from online debugging of application performance, to analysis of machine crash information. Customer H also addresses this issue using standard remote control software, to allow help-desk personnel to directly interact with servers that may be experiencing error conditions.

Level 3 support usually involves the group responsible for either developing or architecting the system that has a problem. If a problem is escalated to this level, a crisis resolution team will take responsibility for restoration of service.

Crisis Resolution Teams

A common customer best practice is the creation of crisis resolution teams. These teams are responsible for critical applications and systems in a high availability environment. The team will typically consist of an application architect or development lead, an operations supervisor, a representative from the user community and a representative from the support staff. The team may also include a vendor's support personnel.

Having individuals pre-assigned to these teams greatly enhances resolution response to the most critical issues. The team has both the responsibility and the authority to react to and resolve critical failures in a high availability system.

Vendor Support

Effective vendor support is an essential component in maintaining any high availability solution. Furthermore, obtaining support is not enough - high availability solutions require that you generally obtain the best support. Items to look for in selecting a support offering may include: customer prioritization of issues, guaranteed response times, fast turnaround of questions, direct access to skilled resource, and onsite vendor personnel.

As an example, Microsoft offers the Premier support program. This program targets several, specific deliverables to help maintain mission-critical installation:

Account Management to build a persistent relationship, to understand your systems and application business goals, and ensure Premier Support helps you meet those goals.
Proactive Services to help reduce your exposure to problems, increase system availability and supportability, accelerate your development cycle, and ensure Microsoft products are properly adapted to the environments where they are deployed.
Information Services to provide your staff with timely, relevant information that can improve productivity, system reliability, and diagnostic techniques through access to the Premier Support Service Desk Internet site, special technical briefings, regular support technical newsletters, and a variety of CD-ROM subscriptions.
Technical Support to provide fast, accurate solutions to technology issues around the clock.

For more information on Premier support, please refer to https://support.microsoft.com/default.aspx#premier.

As a further step for high availability installations, Microsoft has begun offering the Premier Availability Support (PAS) program. This program falls under the umbrella of Premier Support, with the charter of delivering specific product-tested information on high-availability solutions. PAS works directly with technical experts within client companies by providing workshops, support, and technical documents to help customers maintain their high availability systems.

Specific deliverables provided by the PAS team:

Knowledge transfer of tools and practices to improve their high availability solutions

In-depth, product-specific workshops:
- Windows NT Critical Problem Management Tools and Techniques
- Windows NT Critical Network Monitoring Tools and Techniques
- Advanced Troubleshooting - Performance Monitoring
- MSNA (Microsoft Networking Architecture [Windows NT])
- Exchange Server (four separate workshops for various aspects of Exchange)
Customer liaison to the Reliability groups within each product development organization
Technical best practices documents

For more information on the PAS services, please contact your Microsoft Premier Support representative or the Microsoft Premier Support Web site ( https://support.microsoft.com/default.aspx#premier ).

Chapter 9 – Recovery

Recovery takes several forms, depending on the type of failure. Recovery may require the replacement of hardware, rebuilding of the operating system and applications, restoring data, or hot backup systems that stand in for a failed system.

Many of the customers have disaster recovery plans that assure restoration and continuation of service, at perhaps reduced capacity and capability. This may require contractual relationships to provide systems in the event of a natural disaster. These types of plans will not be discussed in detail in this document. The focus of this document will be recovery that covers full restoration of service.

Like most topics in a high availability system, recovery starts with planning. Some of the major criteria used by customers to plan for recovery include:

Is the system a single point of failure? Customers C and D operate online transaction systems that must be available during peek periods for their business cycle to complete. Customer A can experience extreme conditions if a control system fails.
Is there a latency time between the failure of one system before it affects another? Customers A and G have a period of time before the scheduling system affects the production system.
Is there redundancy inherent in the system design? Customers B, D and I all have Internet solutions where multiple systems handle the same tasks.
How volatile is the data used in the system? Customers C and D both have systems that record online transactions, and have systems that use data that represents a snapshot from an earlier day's processing.
What is the business cost while a system is down? Customers A, D and I all run systems that require continuous operations.

Providing answers to these questions while planning your system can help identify methods to restore service and recover a system. Service restoration and recovery are discussed in the following sections. Note that some of the methods are preventative and can be performed while the system is running or before restoring a failed computer.

Restoring Service from Backup Systems

One of the most common methods for restoring service is the use of stand by (backup) systems. This can consist of using having a "hot stand by" with automated fail over, or by swapping the failed system with spare systems already configured for use.

Hot Stand By

In situations where prolonged outages cause severe problems, hot stand by systems provide a way to quickly recover. Customers B and D have systems where changes are updated to multiple servers. Changes are only committed, it all servers are updated. A primary system is used to process the day's work. However, the secondary system can take over for the primary, if the primary fails. In the case of Customer B, the fail over is a manual process. Constant monitoring, and the ability to temporarily operated in a degraded manner allows the manual fail over process to work effectively. Customer D has redundant updated with an automated failure detection and fail over system. The stand by system will become the primary system in 30 seconds.

Hot stand by systems are very expensive and complicated to manage, but their worth is measured by the reduced loss of service.

Spare Systems

Using spare systems to replace failed systems is another technique for rapidly restoring service. Customer G uses this approach to replace failed systems in their production environment. The failed system is fixed in the service area. In some cases the replacement system becomes the primary system, in others the failed system, when fixed system, is returned to operation as the primary system. The individual departments determine this policy.

The success of spare systems depends on a cost effective procedure to keep an adequate supply of spare systems, and the use of standard configurations. Customers G and I have hardware standards and configuration management standards to help accomplish this.

System Recovery

During their interview, Customer B indicated that failures can happen for a variety of reason, however, system recovery was a common procedure. For most of the customers, restoring service was not directly dependent on system recovery. System recovery depends of efficiently restoring a failed system to a know configuration and state.

Customers G and I copy saved disk images to restore configurations to failed systems once the hardware is determined to be stable. This process is often referred to as cloning. When a computer is restored using cloning, it is important that the uniqueness of the computer security identification (SID) is maintained. With Windows NT 4.0, Service Pack 4, a Resource Kit utility called SysPrep is available to assist with this process.

SysPrep

The new System Preparation tool for Windows NT 4.0, in conjunction with third-party disk-image copying utilities, provides an easy and efficient way to deploy Windows NT Workstation 4.0 throughout the organization. Administrators can deploy nearly every aspect of a system including Windows NT, business applications, templates, and links to company resources using disk duplication, using this approach instead of running complex installation programs.

After installing and configuring Windows NT 4.0 and any applications, administrators can use the system preparation tool to create a master hard-disk image that can be distributed using third-party disk-image copying tools.

After the copying process takes place, a wizard prompts users for computer name, administrator password, user name, and organization to configure unique settings. Administrators can simplify the setup process further by scripting the wizard.

The system preparation tool for Windows NT 4.0 is a simple utility that prepares a PC's hard disk for duplication. Once the disk is prepared, administrators can use a variety of third-party utilities to restore Windows NT 4.0 on multiple systems for use as spare systems.

Data Recovery

The requirements for data recovery are primarily dependent on the application. However, some best practices can be of assistance in this process.

First, if the application does not require access to real time updates, a copy of the data can be used. The benefits of application isolation are covered in Chapter 5 -- Planning: Software Strategies. One of the benefits of using a copy of the data, allows for a know data state to be used for recovery. Customers C and D use this technique to recover data to a read only server without interfering with the real time database.

When it is not practical to operate from read only copies of the data, maintaining a fault resilient copy of the data is an important strategy. This can be done by a variety of methods. Selecting the proper method to use depends on the applications.

The data could be stored on RAID disks.
Application logs can be stored on separate disks, and frequently backed up.
Recovery or checkpoints can be frequently made.

Minimizing the possibility of data corruption and data loss is the goal. Customer A uses a combination of RAID to minimize loss of data and frequent checkpoints in data logs to protect data in business systems and business processes.

Chapter 10 – Root Cause Analysis

In maintaining any high availability system, failures will occur within and around that system. Appropriate procedures, personnel, and tools can help minimize the affects of failures, but failures are inevitable. Given the relative complexity of these systems, failures may manifest themselves in a variety of ways. For example, dirty power can manifest itself as a high mortality rate for machine power supplies, or a faulty device driver could easily appear to be a core operating system issue. In each of these cases, personnel may quickly treat the symptoms of the failures, but this symptom treatment does nothing to help prevent the failure from re-occurring. Root cause analysis is the practice of actively working to determine the cause behind a failure.

Symptom treatment can prove an effective strategy to restore service when an unexpected failure occurs. Sometimes the symptom treatment, such as periodic restarting, is also effective in the short term for preventative maintenance. However, a long-term solution should include root cause analysis to determine the actual problem,

Within some environments, restoration of service takes priority over all else and must be carried out immediately. This makes it difficult to perform root cause analysis. Customers G and I regularly hold post-mortem analysis sessions for significant failures. These sessions involve personnel close to the failure, and they attempt to isolate a root cause, or at least to improve their procedures for responding to the failure in the future.

In other cases, business processes will tolerate short-lived system failures. For example, Customer A has designed some of their processes such that information is buffered throughout the systems. This allows a system to fail for a short period of time without adversely affecting the entire operation. These windows allow help desk and operations personnel a known amount of time to attempt to father resources, analyze, and recover from a failure.

Another technique seen within the customer interviews was the idea of replacing any failed component, and then performing off-line analysis to help determine root cause. Customer G does not attempt to repair failed systems within their production line. Any system that fails will simply be physically replaced. The failed system is then sent to a lab that allows personnel to rigorously analyze the failure and attempt to determine the actual cause.

As noted in Customers B, G, and I, root cause analysis forms a feedback loop in planning for high availability. Each of these customers utilizes the information from each significant failure to drive improvements in their other processes. Product choices, operating procedures, monitoring tools, and help desk practices may all be impacted by the knowledge gained though root cause analysis.

If the root cause cannot be determined internally, do not hesitate to bring in the vendor of the suspect application or component to determine the cause. If it's a problem in the software or hardware, the vendor is may be eager to find the problem and correct it. If the problem is causing down time for any reason, the problem should be escalated until the problem is resolved.

Planning for Root Cause Analysis

Optimal procedures for root cause analysis, like most operational practices, will vary from site to site. However, if you want to implement a policy to diagnose the causes of system failures, the following guidelines can help.

A successful policy of root cause analysis will likely involve an investment in trained personnel and equipment. To help your personnel, Microsoft, offers classes on critical problem management through its Premier Support program. These classes offer technical, hands-on training about how to diagnose failures on Windows NT systems. For more information on this and other Microsoft support offerings, see https://support.microsoft.com/ .

Investments in equipment include a repository server, a server on your network with lots of available storage. The repository server will store memory dumps, snapshots of the memory of failed applications, for all servers in your system. Memory dumps are an invaluable aid in debugging system failures. A successful root cause analysis policy must include a plan to manage memory dumps. Saving all of the memory dumps on a central server simplifies post-mortem debugging and helps diagnose chronic problems.

You should also consider investing in a crash cart, a laptop computer equipped with the necessary utilities, debugging files, and cables that are needed to perform a live debug. A live debugging session is an opportunity for support personnel to examine the state of a failed machine before it is reset. Although a live debugging session lengthens the time to recover for the failed server, it is one of the most effective ways to diagnose the causes of failure.

Specific advice on setting up a repository server and crash cart for Windows NT systems can be found in Microsoft Premier Support's critical problem management class.

Appendix A – Tools

This appendix is a compilation of operating system features or application products mentioned in the customer case studies as helpful in deploying highly available computers, networks, and application systems.

Monitoring

Tool	Description
Microsoft Systems Management Server	Integrated inventory, distribution, installation, and remote troubleshooting tools for centralized management of hardware and software. Microsoft Systems Management Server can be used in medium to large multi-site Windows–based environments to reduce the cost of change and configuration management of Windows based desktop and server computers. Details available at: https://www.microsoft.com/backoffice
Microsoft Performance Monitor (Perfmon)	Windows NT administrative tool that enables viewing behavior of processors, memory, cache, threads, and process objects. Each object has an associated set of counters that provide information about device usage, queue length, delays, and other data that measures throughput and internal congestion. Details available at: https://www.microsoft.com/ntserver
Microsoft Windows NT Resource Kit, version 3.51 Microsoft Windows NT Server 4.0 Resource Kit Microsoft Windows NT Workstation 4.0 Resource Kit	Microsoft Press® kits contain both technical documentation and a CD-ROM with useful utilities and accessory programs to help install, configure, and troubleshoot Microsoft Windows NT. See Details available at: https://www.mspress.microsoft.com
Tivoli Management Software	Family of products with a single management framework integrating disparate IBM systems management applications. Details available at: https://www.tivoli.com
Microsoft HTTPMon	Multithreaded Windows NT service that monitors web server performance by measuring how quickly the web server responds to requests from client browsers. Details available at: https://www.microsoft.com/ntserver
HP OpenView	Hewlett Packard family of products designed to manage distributed computer systems and networks from computers running Windows or UNIX operating systems. Details available at: https://www.hp.com
NetManage	Single-source PC-to-host connectivity solutions from NetManage. The company develops integrated applications, servers, and development tools for Microsoft Windows, Windows® 95 and Windows NT operating systems. Details available at: https://www.netmanage.com
PerlEx	Utility for Web servers running under Windows NT that improves the performance of Perl scripts. Details available at: https://www.activestate.com
SeNTry	An SNMP-based monitoring tool. Details available at: https://www.missioncritical.com

Software Distribution

Tool	Description
Microsoft Systems Management Server	Integrated inventory, distribution, installation, and remote troubleshooting tools for centralized management of hardware and software. Microsoft Systems Management Server can be used in medium to large multi-site Windows–based environments to reduce the cost of change and configuration management of Windows based desktop and server computers. Details available at: https://www.microsoft.com/backoffice
WinInstall	Automates installation and software deployment. Details available at: https://www.ondemand.com.
Installshield	Software distribution tool for creating Windows NT logo compliant application installation packages. Details available at: https://www.installshield.com.

Remote Diagnostics

Microsoft Systems Management Server

Integrated inventory, distribution, installation, and remote troubleshooting tools for centralized management of hardware and software. Microsoft Systems Management Server can be used in medium to large multi-site Windows–based environments to reduce the cost of change and configuration management of Windows based desktop and server computers. Details available at: https://www.microsoft.com/backoffice

Applications

Tool	Description
Microsoft Transaction Server (MTS)	Microsoft component-based transaction processing technology that supports development, deployment, and management of high-performance, scalable, and robust server applications. MTS supports applications based on IBM CICS and IMS when installed and used with Microsoft SNA Server.
Microsoft Message Queue Server (MSMQ)	Microsoft message queuing technology of choice for Windows-based applications. MSMQ supports reliable message delivery, cost-based message routing, and full support for transactions over unreliable but cost-effective networks. MSMQ can be bridged to applications based on IBM MQSeries using MSMQ Bridge licensed for use with Microsoft SNA Server.
Microsoft SNA Server	Microsoft BackOffice gateway and applications integration solution that connects Windows networks with AS/400- and mainframe-based networks. MSMQ—MQSeries Bridge (previously known as Falcon MQ Bridge) licensed from Level 8 Systems and included in SNA Server 4.0 SP2. It enables messaging between IBM MQSeries and Microsoft Message Queue Server environments using the native format of each messaging technology.
Microsoft Windows NT Load Balancing Service (WLBS)	Distributed clustering feature of Windows NT Server 4.0, Enterprise Edition, that employs a distributed clustering design to create highly available and scalable Web, virtual private networking, streaming media, and proxy services. WLBS runs on economical, industry standard PC platforms and can work with Microsoft and third-party clustering products to provide a complete three-tier solution.
IBM MQSeries	IBM family of application services that provide message queue and delivery services for applications.
Microsoft Windows NT Resource Kit, version 3.51 Microsoft Windows NT Server 4.0 Resource Kit Microsoft Windows NT Workstation 4.0 Resource Kit	Microsoft Press kits that contain both technical documentation and a CD-ROM with useful utilities and accessory programs to help install, configure, and troubleshoot Microsoft Windows NT. (See in this document.) Details available at: https://www.microsoft.com/mspress/

Remote Access

Tool	Description
Microsoft Systems Management Server	Integrated inventory, distribution, installation, and remote troubleshooting tools for centralized management of hardware and software. Microsoft Systems Management Server can be used in medium to large multi-site Windows–based environments to reduce change and configuration management of Windows based desktop and server computers. Details available at: https://www.microsoft.com/backoffice
pcANYWHERE	Provides remote access and control of a client computer by establishing a remote window on the client with display, mouse, and keyboard connections. Details available at https://www.symantec.com.

Testing

Tool	Description
PerlScript	Perl scripting engine for Microsoft ActiveX and Microsoft Windows Script Host environments. Details available at: https://www.activestate.com
Microsoft Windows Script Host (WSH)	Language-independent scripting environment for computers running 32-bit Windows NT. Microsoft provides both VBScript and JScript scripting engines with the Windows Script Host. Third-party companies provide ActiveX scripting engines for other languages such as Perl. Details available at: https://www.microsoft.com/ntserver
Strategizer	An infrastructure modeling tool. Details available at http;//www.ses.com.

Appendix B – Implementing Hardware Standardization

Implementing standards for computer hardware is a crucial planning and maintenance practice in deploying high availability systems. This presents key benefits in serviceability, failure analysis, and spare-parts inventory. Hardware standardization also allows a customer to replace hardware for quick recovery from machine failures.

The following hardware criteria are excerpted from procedures developed and followed by a particular customer interviewed for this study. This example is not meant to be completely comprehensive - it simply illustrates one customer's approach to hardware standardization.

In addition, while the specific hardware platforms are dated, the more important information is the levels of detail in the plan, as well as the references to general technologies used by this customer.

Server Hardware Evaluation Criteria

Footprint

Rack Requirement - the system must install into standard 19" racks used in the data center. The rack manufacturer is Rittal. They stand 42 U's in height.

Chassis - the system should be a true rack-mount solution built to an industrial standard, not a tower system placed on its side.

Aesthetic - power switches, reset buttons, etc., should be located on the front panel of the unit, properly labeled and protected by a mechanism to mitigate accidental power-down situations. There should be no CMOS/NVRAM reset buttons on the exterior of the unit. LED's for power, disk access, and error states should be located on the front panel of the server and corresponding drive bay(s).

Dimension - the current high-end server standard is seven U's for the CPU (five if no internal storage is offered) and four U's for the drive bay. Density is key, the system under evaluation should match or beat the current size/performance ratio. The mid-range standard consumes three U's.

Serviceability

Accessibility - each sub-system should be accessible and serviceable while the system is mounted in a rack. This includes the I/O, memory, and processor subs-systems, and drive cage and power supply(s).

Cable Management - internal cables should be cleanly routed within the system. The vendor should also offer an external cable management solution, addressing power, keyboard, mouse, video and external SCSI cables. There should not be any instances of free-floating cables.

Service Hardware - there should not be a need for proprietary tools to service the unit (e.g. board pullers, non-standard torx screws, etc.).

Fault Tolerance

Storage - a fault tolerant disk array including a fault tolerant cache module is a requirement. RAID levels include distributed data guarding and distributed mirroring. The system should provide hot swap and on-line spare drives. The drive slot should determine the SCSI ID of the physical disk. There should not be a need to manage jumpers.

Memory - the system must be able to correct dual-bit memory errors. The preferred method is via a secondary corrective memory sub-system, which allows for the use of standard industry memory modules. Secondary methods like ECC memory modules, though not optimal, are acceptable.

Power Supply - the system should offer the option of redundant, hot-plug power supplies.

Cooling - the system should offer redundant, hot-plug fan modules for the overall system cooling.

Network Adapter - systems should offer a dual-port redundant NIC that meets the customer's current network infrastructure requirement.

Scalability

Storage - the system should be engineered to scale over multiple instances of their RAID controller, and support multiple external SCSI device bays.

Processor - systems should support symmetrical multiprocessing (SMP), with low-end servers supporting two processors and high-end servers supporting four or more. Low-end systems configured in a single processor mode should not require a significant enhancement to upgrade to an SMP configuration.

Processor futures – systems should support an upgrade path to new versions of Intel and Alpha (3-year processor technology window) within the existing chassis.

Support

On-site - on-site support is a requirement during regular business hours, with pager support offered off-hours. Extended support is provided on a 24-hour a day, seven day per week (including holidays) basis for key technicians and on-site personnel.

Off-line Diagnostic Tool - a comprehensive diagnostic package, which can query the status of a specific server's sub-system, is a necessity. The tool should have the ability to record results of diagnostic sessions to a file.

On-line Diagnostic Tools – a solution that can simultaneously monitor server and client application while the applications are running, is a requirement. The solution should be compliant with the SNMP management information base version 2 (MIB2), and integrate with enterprise applications such as Microsoft Systems Management Server and CA UniCenter. The tool should also allow on-line access to firmware revisions and component utilization statistics, as well as flagging instances of degraded hardware.

Certification - the computer vendor and software vendor must offer training to support staff in a timely fashion. Certification must include hardware, diagnostic, and monitoring solutions.

Operating System

OEM Software - continual production level testing of beta operating system solutions. This testing is the precursor to the requirement that OEM beta software be in-step with the beta milestones for the operating system. No more than a two-week lag is acceptable.

Operating system Compatibility - the system must support current operating systems used to connect to the data center services.

Compatibility

Power Requirements - power requirements vary by sites. Systems must provide auto-switching power supplies supporting 110VAC and 220VAC environments.

Video and Keyboard Switching - currents standard is the Apex eight port video/keyboard/mouse concentrator. All systems must be compatible with this standard, and function in a diverse environment.

Current Standards

Processor - the following chart outlines the current standards.

CPU	Mhz - File Server	Mhz - SQL Server	Cache	Number of supported processors
PentiumPro	200Mhz	200Mhz	512KB	4
ALPHA EV5	466Mhz	466Mhz	4MB	4

Storage is defined as physical and logical. Physical storage defines the maximum number of drives a system can manage. Logical storage relates to the maximum size of a contiguous volume. The array controller, not the Windows NT fault tolerant disk driver, must manage a contiguous volume. The following chart outlines the minimum physical storage requirement.

Qty. Physical Drives	Physical Drive Size	Total
Physical Storage	84	9GB
Logical Storage	14	9GB

SCSI Protocol – Wide-Ultra SCSI

Memory - current requirement is 4GB.

I/OSub-system - systems with dual PCI buses must offer peer architecture.

Network Interface Cards – auto-sensing 10/100Mbit adapter.

Operating system standards – Windows NT version 4.0 with Service Pack 4 and hot-fixes.

Server standards matrix

Listed in the table below are server standards for Compaq servers. Standards also exist for Dell systems for the same range of servers

Server Model	Proliant 850R	Proliant 2500R	Proliant 5000R	Proliant 6500R	ALPHA 4100
Number of processors	Up to 2	Up to 2	Up to 4	Up to 4	Up to 4
CPU clock speed	200Mhz	200Mhz	200Mhz	200Mhz	533Mhz
CPU L2 cache size	256Kb	256Kb or 512KB	512Kb or 1MB	512Kb or 1Mb	4Mb
Networking	100Mbps Fast Ethernet	100Mbps Fast Ethernet	100Mbps Fast Ethernet	100Mbps Fast Ethernet	100Mbps Fast Ethernet
Integrated SCSI Wide-Ultra	Single Channel	Single Channel	Single Channel	Single Channel	Single Channel
Disk Array Controller
Controller	SMART-2DH	SMART-2DH	SMART-2DH	SMART-2DH	HSZ70
RAID levels	0,1,5	0,1,5	0,1,5	0,1,5	0,1,5
Controller cache	16MB	16MB	16MB	16MB	64MB to 256MB
Battery backed cache	Yes	Yes	Yes	Yes	Yes
Maximum configurations
Maximum memory	2GB	2GB	4GB	4GB	8GB
*Maximum disk	335GB RAID 5	559GB RAID 5	777GB RAID 5	777GB RAID 5	2452GB RAID 5
**Max Windows NT volume size	112GB	112GB	112GB	112GB	120GB
I/O support	PCI - 3 slots	PCI - 5 slots	Peer PCI - 7 slots	Peer PCI - 8 slots	Peer PCI - 8 slots
High availability features
Microsoft Cluster Server capable	Yes	Yes	Yes	Yes	Yes
RAID disk controller supporting RAID 0, 1, 5	Yes	Yes	Yes	Yes	Yes
ECC memory	Yes	Yes	Yes	Yes	Yes
SNMP monitoring	Yes	Yes	Yes	Yes	Yes
System Health Logs	Yes	Yes	Yes	Yes	No
Performance Metrics
Transaction processing benchmark results are available. Information from the Transaction Processing Performance Council can be obtained at https://www.tpc.org	Not available	4,040 - 2xPPro	8,070 4xPPro	Not available	6,429 2xEv5 @ 466Mhz
* Using 9GB SCSI drives
**Using 9GB SCSI drives configured with RAID 5

Appendix C – Implementing Software Standardization

Implementing standards for software products is a crucial planning and maintenance practice in deploying high availability systems. This improves operations costs, help desk effectiveness, as well as fail-over and application redundancy capabilities. Software standardization allows a customer to create well known builds of software and to certify these builds as reliable.

The example provided below is excerpted from procedures developed and followed by a particular customer interviewed for this study. This example is not meant to be completely comprehensive - it simply illustrates one customer's approach to software standardization.

Also, while the specific software versions are dated, the more important information is the level of detail in the plan, as well as the references to general technologies used by this customer.

Windows NT Server Configurations

This example is based on standard changes made to a default configuration of a Windows NT-based computer. For simplicity's sake, this document will continue to look at these changes in a layered model, outlining changes embraced by all data centers, then moving to describe mandatory changes made in specific data centers and to specific server types.

A standard set of tools is installed in the "LOCALBIN" directory. Tools can be added to a central location and servers are updated with additional tools each time the installation script runs. Currently, about 50 Windows NT tools and another dozen DOS tools, totaling about 5MB in size are installed.

Services

The license logging service is turned off by default. This may be turned back on by a specific application if necessary.
The scheduler service is set to start automatically under the system account.
The SNMP service is set to "AUTOSTART".
Event logging
All event logs are set to 20MB. System and Application logs are set to over-write as needed, the security log is set to never over-write.

Monitoring

Compaq Software Support Disk and Compaq Insight Manager software is installed on all X86 based systems as a standard SNMP based monitoring platform.

Ease of use

CD AUTORUN is set to "OFF". This allows installing from the CD without the system trying to run the CD. This requires the technician to run the CD installation manually, but keeps from unnecessarily running the CD when it is unexpected.

Tasking is set to "Best Foreground Response Time".

Page File

Page files are set to reside on the system drive and are set at the greater of memory + 50 MB or 10%.

Time Sync

All systems are to pull time from a central source at regular intervals.

Debug/Crash dump

All systems across the enterprise are set up to produce a crash dump file, and to be debugged live. This allows the flexibility to work from a dump file or a live crash, as applicable. They also set up for user mode debugging, both live and from dump files.

Kernel Mode:

Crash dump is enabled.
System is set to save memory.dmp to %systemroot% by default.
Auto-restart after crash is disabled, allowing for live debugging when necessary.

User Mode:

Dr. Watson is verified as the default debugger, but set for user intervention before a user.dmp is written.
The log file and user.dmp are both configured to write to %windir%.
Dr. Watson is set to create a crash dump.
Dr. Watson is set to append to the log file.
Dr. Watson is set to dump all threads.
The number of dumps created is set to maximum

Service Pack

The current Windows NT Service Pack is installed as part of the standard build. As new service packs are made available, they are tested and included.

Hot fixes

All currently approved Windows NT hot fixes are part of the standard build. As new fixes are picked up, they are added to the build and servers receive them each time the script is run.

Symbols

Install a symbol subset to allow immediate user mode debugging, before application specific symbols are loaded. While taking up much less space than a full symbol set (approximately 10 MB vs. 100 MB or more), they still allow the debug team to get valuable information before a full-scale debug is initiated. Actual disk space used is 8 MB including symbols and debug executables.

1st tier Domain Controllers

Sets Security and paged pool usage on all 1st tier Domain Controllers. This includes setting strong passwords, setting paged pool to the maximum allowable level (currently 192 MB) and registry size limit to it's maximum of 80% paged pool (152 MB).

TCP/IP Settings

A number of TCP/IP and AFD settings are modified.

Black Hole Router Detection

Enable PMTUBH Detect is turned on.
Maximum TCP retransmissions is set to 4 decimal.

SYN-ATTACK fixes

Dynamic Backlog is enabled.
Dynamic Backlog growth delta is set to 10 decimal.
Maximum Dynamic backlog is set to 14 decimal.
Minimum Dynamic backlog is set to 14 decimal.

Other

Default TTL is set to 40.
TCP timed wait delay is set to 60 decimal.

IPAKS - Multiple Supported System Configuration

IPAKS are used to solve the challenge of using beta software along with multiple configurations of the released versions of the operating system. Two IPAKS are created—one for released versions of Windows NT, and one for beta versions of Windows NT.

IPAKs encourage the use of the latest combination of Windows NT service packs and hot fixes and help lower the number of required computer restarts by up to 50%. Fewer restarts increase overall system availability.

Typical IPAK Specification

Component	Revision
Windows NT Server	4.00
Current Service Pack	4
Compaq Software Support Disk	2.10a
Compaq Insight Manager	4.2a

Bug #	Patch	Requested	File Version
238037	NDISWAN.SYS	10/28/98	4.0:1381.133
240059	SCHANNEL.DLL	11/10/9	4.84:1901.1877
250663	SNMP.EXE	11/16/98	4.0:1381.133
INSIGHT AGENTS	CPQMIDA.DLL	10/28/9	Not Compiled

Recommended Fixes (Gold)

Bug #	Patch	Requested	File Version
254289	WINS.EXE	12/17/98	4.0.1381.135
243887	DNS.EXE	12/17/98	4.0.1381.133
2.1	MDAC	1/22/99	2.1
250635	RPCLTS1.DLL	11/13/98	4.0:1381.133
117	WINLOGON.EXE	12/15/98	4.0.1381.141

These fixes have been tested and approved by ASDDT, but are not part of the standard installation. Microsoft recommends applying these fixes to your data center server after careful test and evaluation.

Fixes Under Evaluation (Silver)

Bug #	Patch	Requested	File Version
658	NDIS.SYS	2/5/99	4.0.1381.133
535	NTDLL.DLL	2/5/99	4.0:1381.158
277964	SNMPELEA.DLL	1/15/99	1.2:0.0
44 (ASDDT)	CPQ32FS2.SYS	1/9/99	3.21:1.1

Fixes Under Test (Bronze)

Bug #	Patch	Requested	File Version
541	RASPPPEN.DLL	2/5/99	4.1:1.100
Jet Private	JET 4.0 RC4	2/15/99	4.0:2521.4
742	IIS 4.0 Build 683	2/16/99	4.2:683.1

Tested and Approved Operating System and patches

In additon to IPAKs, ratings help their customers determine their business need for a specific patch are provided.

Hot fix Prioritization Categories

Must have. Should be on all systems as soon as possible. (i.e. security fixes, stack issues, other pervasive problems)
Highly recommended. Considered part of the standard platform. Needs to be present before ANY debugging takes place.
Recommended. Considered part of the bi-yearly ITPAK rollup . While not absolutely essential .this definitely aids in stabilizing the platform.
May be necessary. Needed only in certain cases. Probably not included in Customer B Service Pack.
Probably not necessary. Only needed in very special cases. Never included in Customer I Service Pack.

Support Policies

An escalation requirement document helps define support policies. Guidelines for support are shown below.

Diagnostic assistance -

If there is an ongoing support problem appearing to involve the operating system, Customer I should be brought into the loop.

If there is an Windows NT issue that cannot be resolved through normal administrative means, Customer I should be brought in.

If a problem is either acute (cannot come up from a failure) or chronic, Customer I should be involved.

Debugging -

Facilitate debugging for all groups.

Debug all Windows NT core components and services.

Maintain a hot-list of systems that require after-hours support to maintain high availability.

Appendix D – Performance Analysis Checklist

The following example checklist is excerpted from procedures developed by a particular customer interviewed for this study. This example is not meant to be completely comprehensive - it simply illustrates one customer's approach to gathering data about computer components and processes to perform monitoring and performance analysis.

Monitored Component and Process	Data
Memory Available bytes:
Memory Cache bytes:
Memory pool non-paged bytes
Memory pool paged bytes
Process handle count (all processes)
Process pool non-paged bytes (all processes)
Process pool paged bytes (all processes)
Process private bytes (all processes)
Process working set (all processes)
Processor % processor time (for all processors)

Appendix E – Help Desk Escalation Procedure

The following example table is excerpted from procedures developed by a particular customer interviewed for this study. This example is not meant to be completely comprehensive - it simply illustrates one customer's approach to defining Help Desk escalation procedures.

Type of Message	Help Desk Response
server-name: device inaccessible	Go to a command-line prompt. Type in this command: ping server-name and wait for a response. If there is a valid response, type in this command: net view \\server-name and wait for a response. If either of these commands does not produce a valid response, contact on-call server technical response.
Consecutive device inaccessible messages from regional office servers	These indicate communication problems between the regional office and home office, and both network support and on-call server technical response should be alerted.
More than 4 consecutive "device inaccessible" messages from different home office servers.	These indicate a possible network problem, and both on-call server technical response and network support should be alerted.
Drive status change	Look for these following messages for the same server: "DEVICE server-name: Physical drive status change. Controller slot #? Drive #? channel #?. Physical drive status: Failed" "DEVICE server-name: Logical drive status change. Controller slot #? Drive #? channel #?. Logical drive status: OK" This means the fault tolerance spare drive has kicked in to replace the bad hard drive. After the bad drive has been changed out, these messages will appear: "DEVICE server-name: Logical drive status change. Controller slot #? Drive #? channel #?. Logical drive status: Rebuilding" "DEVICE server-name: Spare drive status change. Controller slot #? Drive #? channel #?. Spare drive status: Inactive" If server technical response does not contact data center operations within 30 minutes, contact on-call server technical response.
server-name: application exception	If server technical response does not contact data center operations within 30 minutes, contact on-call server technical response.
Power on error	If server technical response does not contact data center operations within 30 minutes, contact on-call server technical response.
ASR Recovery complete	Server has re-booted itself. If there are more messages for the same server or if server technical response does not contact data center operations within 30 minutes, contact on-call server technical response.
Correctable memory module	If server technical response does not contact data center operations within 30 minutes, contact on-call server technical response.
Storage System Fan Status Change	If server technical response does not contact data center operations within 30 minutes, contact on-call server technical response.
Data Error: Device server-name User Threshold-Rising Alarm: File System n Value xxx Threshold xxx.	If server technical response does not contact data center operations within 30 minutes, contact on-call server technical response.
Device server-name: Network Interface Card Failed. Controller Slot #x Port # y OR Device server-name: Network Interface Card OK. Controller Slot #x Port # y	These two messages usually come one after the other, with the failed message coming first. This has been happening on two particular servers and the owner is aware of the problem, which is related to an etherlink card. If server technical response does not contact data center operations within 30 minutes, contact on-call server technical response.

Appendix F – Problem Resolution Checklist

The following checklist is excerpted from procedures developed by a particular customer interviewed for this study. This example is not meant to be completely comprehensive - it simply illustrates one customer's approach to creating a form to use while resolving problems.

Appendix G – References and Relevant Web Sites

For information about examples of industry solutions using Microsoft products see: (https://www.microsoft.com/business/industry/.)
For information about Windows NT 4.0 Service Pack 4.0 see: https://www.microsoft.com/ntserver/nts/downloads/recommended/nt4svcpk4/nt4svcpk4.asp
"Microsoft SQL Server 7.0 Performance Tuning Guide" (https://msdn.microsoft.com/developer/sqlserver/sql7perftune.htm)
"Deploying Windows NT Server for High Availability" (https://www.microsoft.com/ntserver/techresources/deployment/NTserver/HighAvail2.asp)
The service pack and the two papers noted above are also available in Microsoft TechNet.
RAID Advisory Board (https://www.raid-advisory.com)

1	Brendan Murphy and Ted Gent. Measuring system and software reliability using an automated data collection process. Quality and Reliability Engineering International, page 13, 1995. CCC 0748-8017/95/050341
2	This curve is the combination of three separate equations. An initial burn-in phase first identified in [Duane, 1964] as the "learning curve", is combined with near linear and exponential rates observed in the second and third phases of a component's lifecycle [See Lyu, 1996].
3	Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco, 1993.
4	A good description of RAID may be found at
5	"Mission-Critical Services Part Two: Definitions, Trends, and Competitive Offers", Dataquest Perspective, Eric Rocco, December 1997.

Implementing Systems for Reliability and Availability

On This Page

Chapter 1 - Introduction

Executive Summary

Why Microsoft Prepared this Document

Key Terminology

Chapter 2 – Best Practices

Who Microsoft Interviewed

Universal Best Practices

Implementing Best Practices

Outstanding Examples of Best Practices

Chapter 3 – Customer Case Studies

Continuous Process Manufacturing (Customer A)

Consumer Internet Service (Customer B)

Securities Trading and Investments (Customer C)

Stock Trading (Customer D)

Insurance Company (Customer E)

Financial Services Customer (Customer F)

Discrete Manufacturing (Customer G)

Lotus Notes Hosting (Customer H)

Corporate Information Technology Group (Customer I)

Chapter 4 -- Planning: Hardware Strategies

Commodity Hardware

Preparation

Fault Tolerant Components

Environmental Concerns

Backup

Chapter 5 -- Planning: Software Strategies

Software Selection

Software Deployment

Software Support and Monitoring

Application Isolation

Error Detection

Recovery

Conclusions and Best Practices

Chapter 6 – Planning: Pitfalls To Avoid

Using Early Versions of the Operating System

Installing Incompatible Hardware

Failing to Plan for Future Capacity Requirements

Failing to Monitor Computer, Network, Application Performance and Availability

Failing to Determine Nature of Problem Before Reacting

Treating Symptoms Instead of Root Cause

Performing Out of Date Procedures

Stopping and Restarting to End Error Conditions

Chapter 7 – Monitoring and Analysis

Performance Monitoring and Trend Analysis

Event Log Monitoring

Application Health and Performance

References on Performance Monitor

Chapter 8 – Help Desk

Service Level Agreements

Help Desk

Problem Resolution

Vendor Support

Chapter 9 – Recovery

Restoring Service from Backup Systems

System Recovery

Data Recovery

Chapter 10 – Root Cause Analysis

Planning for Root Cause Analysis

Appendix A – Tools

Monitoring

Software Distribution

Remote Diagnostics

Applications

Remote Access

Testing

Appendix B – Implementing Hardware Standardization

Server Hardware Evaluation Criteria

Appendix C – Implementing Software Standardization

Windows NT Server Configurations

Appendix D – Performance Analysis Checklist

Appendix E – Help Desk Escalation Procedure

Appendix F – Problem Resolution Checklist

Appendix G – References and Relevant Web Sites

Additional resources