Deploying MS Windows NT Server for High Availability

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

Abstract

Many variables affect system availability, such as hardware, system software, and data. Environmental dependencies and application software technologies can also influence system uptime. This paper outlines the infrastructure technologies, procedural guidelines, and service and support practices that customers should use to deploy reliable systems with Microsoft® Windows NT® Server.

On This Page

Introduction
Availability Metrics
Causes of Downtime
Deploying Reliable Systems with Windows NT Server
Best Practices Summary
Conclusion
Additional Resources

Introduction

Today customers rely on the Microsoft® Windows NT® Server operating system to deliver highly available solutions in a variety of critical environments. For example, in the report, Distributed Computing Platforms: Requirements for High-Availability Environments, Cahners In-Stat Group reported that the majority of respondents surveyed believe that Windows NT Server has sufficient capabilities to meet their high-availability needs.1https://www.instat.com/mktalrts/mds-high.htm .

Companies like Columbia/HCA Healthcare and Merrill Lynch are among those who have successfully deployed Windows NT Server-based solutions for high availability. These organizations, as well as others, have acquired extensive experience with production deployments of Windows NT Server, including departmental file servers, corporate branch offices, and regional data centers supporting thousands of concurrent users.

Planning for high availability deployments takes time and discipline. Successful deployments are invariably based on well-informed decision making, relating as much to how one goes about a deployment as it does to the specific technologies being deployed. Many variables affect the availability of any system, such as hardware, system software, data, or environmental dependencies and application software technologies. Furthermore, many of the factors critical to a successful deployment are managerial, organizational, and procedural. Each must come together in carefully thought out and correctly applied policies and procedures.

This paper describes the availability metrics most frequently used, explains the principle causes of downtime, and outlines the infrastructure technologies, procedural guidelines, and service and support practices that customers should use to deploy reliable systems with Microsoft Windows NT Server.

Availability Metrics

Building for high availability requires systems to perform required functions under stated conditions for specified periods. These systems need to have predictable, reliable behavior on which customers can build operating plans for the systems critical to the functioning of their businesses. To end users and to managers of information systems, application availability—not system availability—is paramount. Although there is no industry-wide standard for calculating availability*, mean time to failure* and mean time to recover are the metrics most often cited.

Mean Time to Failure and Mean Time To Recover

Software and hardware components have different failure characteristics, which make it difficult to manage or predict software failures. Hardware components usually have what is known as an exponential failure distribution. Under normal circumstances, and after an initial phase, the longer a hardware component operates, the more frequently it will fail. Therefore, if the mean time to fail (MTTF) for a device is known, then it may be possible to predict when it will enter its failure mode. The historical statistical data of mechanical and electrical components fits the so-called bathtub curve as shown in Figure 1 below:

Cc723486.nthiavl1(en-us,TechNet.10).gif

Figure 1: The Bathtub Curve 2 "learning curve", is combined with near linear and exponential rates observed in the second and third phases of a component's lifecycle [See Lyu, 1996]. 2

In this model, there are three distinguishable phases in a component's life cycle: burn-in, normal operation, and failure mode. Each phase is characterized by some signature or behavior, which varies from domain to domain. Failure rates are characteristically high during burn-in, but drop off rapidly. Devices seldom fail during the normal aging period. As the devices age, however, the failure rates rise dramatically, albeit predictably.

This observation can be used in two ways. First, we can observe the characteristics of devices in the normal aging phase and look for indications that correlate with observed failure rates. Some hardware vendors do this, and have devised mechanisms to measure the characteristics of devices that can be used with some success as failure predictors.

Second, we can keep track of the actual failure rates of specific devices, and replace them before they enter the expected failure mode. This strategy is most often used when the cost of a failed component can be catastrophic. It also requires relatively long sample periods in order to capture statistically significant MTTF data. Unfortunately, MTTF statistics are normally only useful predictors for components with an exponential failure distribution. As a result, MTTF applies only to certain categories of software defects. For example, data dependent errors such as those triggered by an anomalous input stream require the input stream itself to be predictable in order for previous failure rates to act as predictors of future failure rates.

For systems that are fully recoverable, MTTR is an equally important quantity, since it correlates directly with system downtime. Overall, system downtime can be reduced on a one-for-one basis with the mean time to recover. A recoverable system with a nominal failure rate–but a zero recovery time–is indistinguishable from one with zero failures. Figure 2 below shows the relationship of the MTTF with the mean time to recover (MTTR), which is a measure of the average duration of an outage.

Cc723486.nthiavl2(en-us,TechNet.10).gif

Figure 2: Relationship between MTTF and MTTR 3 3

This concept is worth emphasizing for several reasons. Since downtime is given by the ratio MTTR/MTTF, an exclusive focus on increasing MTTF without also taking into account MTTR will not achieve the desired results of maximizing availability. This is due in part to the exponential costs associated with designing hardware and software systems that never fail. Such systems tax current design methodologies, hardware technologies, and in all cases, are extremely expensive to deploy and maintain.

Customers with high availability requirements should maximize MTTF by carefully designing and thoroughly testing both hardware and software, and reduce MTTR by using failover mechanisms, such as cluster server services in Windows NT Server 4.0, Enterprise Edition.

Causes of Downtime

The high cost of downtime makes planning essential in environments with high availability requirements. The simplest model of downtime cost is based on the assumption that employees are made completely idle by outages, whether due to hardware, network, server, or application failure. In such a model, the cost of a service interruption is given by the sum of the labor costs of the idled employees, combined with an estimate of the business lost due to the lack of service.

A typical outage of a business critical application can cost $10,000 or more per hour, based on averages derived from a survey of 400 companies, called the Annual Disaster Impact Research.4 For financial sector companies, such as banking and brokerage firms, the numbers can be staggering, reaching the millions of dollars per hour.

Several factors cause system outages. Cahners Instat Group recently surveyed Information System managers and executives, who identified the following root causes of downtime: software failure, hardware failure, operator or procedural error, and environmental failures. This survey found that while hardware failure accounts for up to 30% of all system outages, operating system and application failures combined account for slightly less than 35% of all unplanned downtime.5

Software Failures

Identifying the root cause of an outage can be complicated. For example, Cahners Instat Group observed that about 20 percent of all outages reported were attributed to operating system failures and 20 percent were attributed to application failures.

However, a separate internal study of calls to Microsoft Product Service and Support found that most calls reporting apparent operating system failure turned out to be due to improper system configuration, defects in third party device drivers, or system software.

It should also be noted here that defects in virus protection software could cause Windows NT Server system outages due to their implementation as kernel-level filter drivers. The section, Deploying Reliable Systems with Windows NT Server, will help customers select anti-virus products from only the most reliable vendors. Many system outages attributed to software problems can be avoided through better operational procedures, such as carefully selecting the software that you allow to run on your servers.

Hardware Failures

Hardware failures occur most frequently in mechanical parts such as fans, disks, or removable storage media. Failure in one component may induce failure in another. For example, defective or insufficient cooling may induce memory failures, or shorten the time to failure for a disk drive.

Other moving parts such as mechanical and electromechanical components in the disk drive are also among the most critical. New storage techniques have dramatically enhanced the reliability of today's disk drives. Market pressures have driven storage vendors to provide these improvements at commodity prices.

In 1988, when the first RAID paper was written, the MTTF rating for a commodity 100MB disk was rated around 30,000 hours. Today, the MTBF specification for a commodity 2GB drive is 300,000 hours–about 34 years.6https://www.wdc.com/products/drives/drivers-ed/mtbf.html ) In these demanding environments, hardware requires sufficient airflow and cooling equipment. System administrators are advised to use platforms capable of monitoring internal temperatures and generating Simple Network Management Protocol (SNMP) alarms when conditions exceed recommended ranges.

Random access memories allow for the use of parity bits to detect errors or error correcting codes (ECC), to detect and correct single errors, and to detect two-bit errors. The use of ECC memories for conventional and cache memories is as important to overall system reliability as the use of RAID.

Network Failures

With distributed systems, it is very important to realize that performance and reliability of the underlying network is a significant contributor to the overall performance and reliability of the system.

Changes in topology and design of any layer of the protocol stack can impact the whole. In order to insure robustness, it is necessary to evaluate all layers. This holistic approach, however, is rarely used. Instead, many businesses view the network as a black box with a well-defined interface and service level, totally ignoring the evidence of any coupling between system and network. This can also happen between Application programmers and Operating Systems people.

Operations Failures

Reliable deployments of Windows NT Server-based systems require some procedures that may not be obvious to those rightsizing from more elaborate mainframe or minicomputer-based data centers, or migrating their IT operations from more informal personal computer environments. Customers can minimize or altogether avoid many of the problems identified so far through disciplined operational procedures, such as regular and complete backups, and avoidance of unnecessary changes to configuration and environment.

The Windows NT Resource Kit documents many features of Windows NT that are essential for the reliability-conscious IT manager to understand.7 For example, the Resource Kit describes how certain changes to the system configuration, including the addition or removal of system software or hardware change the registry. This information, combined with some essential initialization information can most easily be backed up using the RDISK utility. For example, this utility should be re-run after each major configuration change is made to the server system.

System engineers tasked with analyzing deployments of Windows NT for high availability and performance should reconstruct their emergency repair diskettes when the system hardware configuration changes, and particularly with the addition of new hardware. Administrators should completely back up the system before and after making any such significant configuration changes. This includes making sure that the registry is backed up as part of normal back up procedures.

Deploying proper data protection mechanisms is essential with the capacity of database systems often in the 100 GB range and increasing. In particular, RAID systems can both enhance scalability and performance of disk systems, but such systems can simultaneously enhance data integrity. Redundant storage subsystems are even more attractive now than they were when first introduced due to the descreasing cost of disks compared to the increasing cost of downtime. RAID is discussed further in the next section.

Consistent, detailed monitoring procedures are critical to deploying systems for high availability. First, logical and physical access to servers should be restricted. Second, the system event log should be monitored regularly in order to prevent failures and potential failures of systems from languishing undetected. Devices on the hardware compatibility list are required to use the event log to record problems. Many systems designed for maximum reliability are able to continue operation with a single failure, such as a failed disk in a redundant RAID 5 volume. A subsequent failure will cause an outage, and even loss of data. Automated procedures for alarm notification should be set up, such as pager notification of SNMP alarms.

Another way to avoid problems is to keep up with and understand the risks and benefits of system upgrades and service packs. Most large organizations establish their own testing organizations to qualify service packs and define baselines for their organizations. Procedures for doing so most efficiently and with the highest quality assurance are described in the following sections.

Environmental Failures

Statistics from one disaster-recovery study found that 27 percent of declared data-center disasters are recorded as due to power loss.8 In this study, a declared disaster was defined as an incident in which there was actual loss of data in addition to loss of service. This figure includes power outages due to environmental disasters such as snowstorms, tornadoes, and hurricanes.

Investing uninterruptible power supply (UPS) is an essential first step. Windows NT Server has built-in support for UPS.

Deploying Reliable Systems with Windows NT Server

Windows NT Server offers a variety of operating system technologies that can enhance overall system reliability. This section contains guidelines to help system and network administrators configure Windows NT-based server system for maximum reliability.

Deploying Windows NT Server for High Availability

Chicago Stock Exchange-Second Largest Stock Exchange in the U.S.

In September 1997, the Chicago Stock Exchange became the first U.S. exchange to standardize its trading systems on Windows NT 4.0. This decision was severely tested when the Dow Jones Industrial Average fell 554 points during the week of Oct. 27. The new system handled the record volume flawlessly, with capacity to spare, demonstrating that Windows NT would serve them well now and in the future. Says Steve Randich, CIO of the Chicago Stock Exchange, "We believe the Windows NT OS is exceedingly stable and reliable-we've proven that."

Uninterruptible Power

The Uninterruptible Power Supply (UPS) service is a system software component of Windows NT which can be configured to detect and warn of impending power failure. It has built-in electronics that constantly monitor line voltages. If the line voltage fluctuates above or below pre-set limits, or fails entirely, the UPS supplies power to the computer system from built-in batteries. UPS Systems provide a hardware interface that can be connected to the computer. Using appropriate software, this interface enables an orderly handling of the power failure, including performing a system shutdown before the UPS batteries are depleted.

UPS offers significant benefits when considering the fact that power loss accounts for almost 27% of all unplanned outages. In some locations, and at certain times of the year, power outages can occur as often as once a day.9https://www.apcc.com/products/) Operators should use redundant power supplies for maximum reliability.

Windows NT has built-in UPS functionality that takes advantage of the special features that many UPS systems provide. These features ensure the integrity of data on the system and allow the computer system and UPS to be shutdown in a controlled manner if a power failure outlasts UPS batteries. In addition, users connected to a computer running Windows NT Server can be notified that a shutdown will occur and new users are prevented from connecting to the computer. Finally, damage to the hardware from a sudden, uncontrolled shutdown can be prevented. The most important consideration in selecting an UPS product is to use only hardware that is listed on the Windows NT 4.0 Hardware Compatibility List (HCL).

To learn more about Uninterruptible Power Supply, including how to configure and use the UPS service, read the Windows NT Server 4.0 Resource Kit.

Fault Tolerant Storage

The term, Redundant Array of Inexpensive Disks (RAID), was first coined by Chen, Gibson, Katz, and Patterson of the University of California at Berkeley in their 1988 paper, although the general concept had been around longer. The RAID Advisory Board (RAB) has since re-named the term, replacing inexpensive with independent.10https://www.raid-advisory.com/ )

Essentially, RAID technology minimizes loss of data caused by problems with accessing data on a hard disk. RAID is a fault-tolerant disk configuration in which part of the physical storage capacity contains redundant information about data stored on the disks. The redundant information enables regeneration of the data if one of the disks or the access path to it fails, or a sector on the disk cannot be read. Some vendors sell disk subsystems that implement RAID technology completely within the hardware. Some of these hardware implementations support hot swapping of disks, which enables you to replace a failed disk while the computer is still running Windows NT Server.

Normally a RAID set appears to applications and the operating system as a single large disk drive, although it is actually an array of drives with equal capacity. RAID terminology is standardized by level, as indicated in Table 1.

Table 1 RAID Levels

Level

Description

Implementation

0

Striping

Portions of data elements are written to separate disks

1

Mirroring

All data is written in its entirety to multiple disks

0+1

Striping and Mirroring

Portions of data elements are written to separate disks
All data is written in its entirety to multiple disks

3

Byte Level Parity

Data is striped across separate disks, with parity (computed byte by byte) written to a separate drive.

5

Block Level Parity

Data is striped across separate disks, with parity (computed block by block) written to the next available drive.

Windows NT Workstation and Windows NT Server both provide a software implementation of disk striping at RAID level 0. Windows NT Server also provides an implementation of RAID level 5. Both can take advantage of advanced RAID capabilities provided by disk array controllers. Cluster server services in the Windows NT Server 4.0, Enterprise Edition uses RAID subsystems exclusively.

  • Use RAID 0+1 (mirroring and striping) for data disks, in order to maximize performance at the cost of half of your storage space. Use RAID 0 for Log disks holding the recoverable file system redo and undo information, in order to maximize fault tolerance, and RAID 1 for the boot disk. Administrators will need to disable mirroring in order to install or upgrade the operating system, and re-enable mirroring after installation.

  • Evaluate the use of separate disk controllers for each of the two mirror drives in a RAID 1 configuration to maximize throughput and avoid a single point of failure.

The Windows NT File System (NTFS) allows for recovery from errors using a transactional technique. A journal is kept of changes made to the file system, and in case of a failure, these changes can be redone or undone as necessary in order to maintain the integrity of the file system. NTFS also allows for hot fixing bad disk sectors, by automatically moving data from suspect sectors to known good sectors. This operation is transparent to applications.

Chapters 3 and 4 in the Windows NT Server 4.0 Resource Kit contain valuable information on disk management basics and reliable configurations using RAID. For information about RAID arrays that are compatible with Windows NT, see the Window NT Hardware Compatibility List (HCL).

Deploying Windows NT Server for High Availability

First Union Capital Markets-Handling billions of dollars each business day

Few businesses rely on the mission-critical availability of networks and computer data more heavily than banks. In a recent update of its network, First Union Capital Markets chose Microsoft Windows NT Server, Enterprise Edition for its backbone. The new system delivers a high degree of fault tolerance for critical operations and data. "Fault tolerance for the network was at the top of our list, because our traders cannot have computers that go down. Up time is key to our business," says Sushil Vyas, assistant vice president in charge of First Union Capital Market's Windows NT Server infrastructure.

Recoverable File Systems

The Windows NT File System (NTFS) is a recoverable file system that uses intelligent caching and allows recovery from a disk failure. It helps guarantee the consistency of the volume by using standard transaction logging and recovery techniques, although it doesn't guarantee the protection of user data. All data is accessed via the file cache. While the user searches folders or reads files, data to be written to disk accumulates in the file cache. If the same data is modified several times, all those modifications are captured in the file cache. The result is that the file system needs to write to disk only once to update the data.

When a disk error occurs during a write, NTFS is capable of automatically re-mapping the bad sector, and allocates a new cluster for the data. This occurs even on a conventional (i.e., non-fault tolerant) volume. On a fault tolerant volume, Windows NT Server is capable of automatically recovering, even from a read error, by recovering the lost data from a mirror drive, or re-computing the lost data from a stripe set with parity.

The Windows NT Server 4.0 Resource Kit contains detailed information on recovering data. For more information on NTFS and reliable management services in general, visit https://www.microsoft.com/ntserver/techresources/management/Reliable.asp .

Distributed File System

Microsoft Distributed File System (Dfs) for Microsoft Windows NT Server operating system is a network server component that makes it easier to find and manage data on a network. Using Dfs, administrators can create a single, hierarchical view of multiple file servers and file server shares on their networks. Instead of seeing a physical network of dozens of file servers–each with a separate directory structure–users will now see a few logical directories that include all of the important file servers and file server shares. Each share appears in the most logical place in the directory, no matter what server it is actually on.

Dfs has many features and benefits. For example, Dfs provides better data availability and load balancing. Multiple copies of read-only shares can be mounted under the same logical Dfs name to provide alternate locations for accessing data. In the event that one of the copies becomes unavailable, an alternate will automatically be selected. This ensures that important business data is always available, even if a server, disk drive, or file occasionally fails.

Load balancing is another advantage of using the Distributed File System. For example, multiple copies of read-only shares on separate disk drives or servers can be mounted under the same logical Dfs name, thereby permitting limited load balancing between drives or servers. As users request files from the Dfs volume, they are transparently referred to one of the network shares comprising the Dfs volume. This improves response time during peak usage periods.

The technical white paper, Distributed File System: A Logical View of Physical Storage, further describes Dfs technology.

Multi-homed Network Servers

Installing multiple network interface cards (NICs) can enhance the reliability of critical network servers. A host configured with more than one network interface in this way is called a multi-homed host.

In some cases, each NIC can be configured with a unique IP address. Configuring multiple interface cards onto separate subnets can enhance both performance and reliability. First, shortening the network routes between clients and servers can enhance performance. Second, this can boost reliability because clients may be able to find alternate routes to critical network services in the event of a failed network adapter. Care should be taken to properly configure servers in this way, because some network services do not operate as expected on multi-homed hosts. For more information, see article 181774 in the Microsoft Knowledge Base.

For maximum reliability, the secondary interface card should be configured as a backup in case of failures on the primary, including software, hardware, and even some network failures.

In this case, the secondary interface is kept in a hot standby mode. If the primary adapter fails, the standby adapter takes over the addresses of the failed device. This sort of network level redundancy is a feature of the particular network device and its associated device driver. For more information and examples of devices and software supporting hot standby, see https://www.alteon.com/cross1.htm and https://www.adaptec.com/products/ .

Deploying Windows NT Server for High Availability

ConXion-A top 10 Internet service provider based in San Jose, CA

To compete against larger, established ISPs, ConXion had to deploy a highly compelling combination of technologies that could provide 7-by-24 availability, great expansion capabilities, and pricing that would create a true competitive advantage. "We've heard about people who doubt that Intel and Windows NT systems are reliable for the kind of mission-critical services we offer," said Antonio Salerno, CEO of ConXioN. "But we have audited numbers that can back it up. We have complete confidence in this platform."

Reliable System Services

Windows NT Server 4.0 and Windows NT Server 4.0, Enterprise Edition include support for two reliable system services that applications can use to deliver high availability to end-users—Microsoft Transaction Server and Microsoft Message Queue Server. Microsoft Transaction Server is a robust runtime environment for deploying high-performance, on-line transaction-processing applications. Microsoft Message Queue Server provides a high-performance, asynchronous messaging infrastructure.

Transaction Server

Through Internet Browsers and Intranet client applications, users can access custom and standard business applications and data. Microsoft Transaction Server supports a component design that fosters reusability of business services.

Many applications create and interact with components running in the client process. Microsoft Transaction Server makes it possible to create and access components running in a server process. This improves process isolation–a component failure cannot take down the calling process. In addition, components can be placed on several machines. If one machine fails, components on other machines remain available to support client requests.

Microsoft Transaction Server automatically restarts server processes if they fail. Windows NT automatically restarts the Microsoft Transaction Server executive if it fails. Thus, on-line applications will only be down for seconds, unless Windows NT Server fails.

Updates that span components, databases, and networks are committed using a two-phase commit protocol. During the first protocol phase, updates are readied–databases write new records to storage, but do not commit the updates. During the second phase, the updates are committed and old data is purged.

Finally, Microsoft Transaction Server can be configured for clustered servers when running under Windows NT Server 4.0, Enterprise Edition. In this scenario, if a server running Microsoft Transaction Server fails or if it is taken offline, then another server can take over processing requests for the failed server.

For more information, go to Microsoft Transaction Server - A Guide to Reviewing Microsoft Transaction Server (MTS) Release 2.0.

Deploying Windows NT Server for High Availability

Merrill Lynch- Worldwide financial, investment, and insurance services

To address their growing need to manage electronic communication from clients, Merrill Lynch decided to build a new application called the Compliance, Archive, and Retrieval System (CAR). Ultimately, this application is part of a greater infrastructure that will affect 35,000 desktops, 2,500 Windows NT Server-based systems, and IBM mainframes. The solution depends on the Microsoft Transaction Server and Microsoft Message Queue Server technologies in Windows NT Server to provide an extremely reliable mechanism for store-and-forwards communication between the branch and the central office, and enables the CAR system to naturally handle the peaks and valleys of demand.

Message Queue Server

Applications communicate by placing messages in queues and reading messages from queues managed by queue managers. A message can be sent to a queue even when the receiver application is not on-line. The message remains in the queue until the receiver is ready to process it.

Messages are routed from a source machine to a destination queue using a store-and-forward protocol. Queue managers are responsible for forwarding messages to their destination queues. If a queue manager is unable to forward a message to another queue manager, it is kept in an internal queue until it can be forwarded.

Microsoft Message Queue Server supports express, recoverable, and transactional messages. Express messages provide extremely fast communications–thousands and even tens of thousands of messages can be sent per second–but they are stored in memory and will be lost when systems shutdown. Recoverable messages are written to disk and survive shutdowns. Transactional messages are recoverable and can be sent or received in a transaction. The success of transactional messaging operations can be coordinated with database and Microsoft Transactional Server component updates. These operations all share the same transactional context, so the benefits of all-or-nothing apply.

By default, a system's queue manager is started when the Windows NT-based server boots. If the queue manager fails, Windows NT will detect the failure and automatically restart it. Thus, operators don't have to manually intervene before communication can continue.

Microsoft Message Queue Server can be configured for clustered servers when running under Windows NT Server 4.0, Enterprise Edition. If a server in a cluster running Microsoft Message Queue Server fails or it is taken offline, then another server can take over processing requests to send or receive messages.

For more information, visit Microsoft Message Queue Server.

Deploying Windows NT Server for High Availability

PulsePoint Communications-Integrated v-mail, fax, and e-mail solution

PulsePoint Communications develops carrier-grade solutions for progressive and competitive telecommunications service providers. The PulsePoint Enhanced Application Platform is the world's first carrier-grade, Internet-ready, open-system using Microsoft Windows NT Server 4.0, Enterprise Edition. Users enjoy anytime, anywhere, any device access to messages and a full range of features. "Windows NT Server 4.0, Enterprise Edition is an integrated, mission-critical component of PulsePoint's carrier-grade messaging solution. Using Windows NT, we deliver 99.996% availability at a fraction of the cost of a comparable Solaris system".-Mark Ozur, CEO

Clustering Services

A cluster is a group of two or more computers interconnected in such a way that they can operate as a single system. Although composed of multiple nodes, clusters appear to clients, users, and administrators as a single unit. Clusters can extend overall system availability through redundancy. A service running on a failed node can be restarted on the remaining node with minimal, or in some cases, no measurable down time. Windows NT Server 4.0, Enterprise Edition supports 2-node clustering.

Cluster server services are capable of automatically detecting a failed service and restarting that service on a remaining cluster member. These services offer many other features crucial to enhancing overall system manageability and availability:

  • Enhanced manageability through the presentation of a single system image–cluster members are managed as a unit.

  • Support for rolling updates so that programs and data can be updated without taking the system offline.

Clustering is an extremely important element in Microsoft's High Availability strategy. Because configurations are tested as single systems, hardware and software validated for Microsoft Cluster Server has been demonstrated to pass some of the industry's most comprehensive compatibility and stress tests.

In MSCS-Can This Wolf Lead The Pack, DataQuest concludes, "The Windows NT clustering solutions in today's marketplace deliver higher availability to the network. Through failover, recovery, and management features, Windows NT clusters are able to offer up to 99.9 percent availability. Through the maturity of the Windows NT operating system and the industry-wide acceptance of Microsoft's clustering standard, Windows NT cluster solutions will continue to add features needed for mission-critical enterprise computing."

In August, Microsoft acquired Valence Research Inc., developer of TCP/IP load-balancing and fault tolerance software for Microsoft Windows NT Server. This technology includes the ability to build Internet web farms today with up to 32 cluster nodes. Called Convoy Cluster Software, this solution is available today. Microsoft has renamed it Windows NT Load Balancing Service (WLBS) and integrated it into Windows NT Server, Enterprise Edition. Microsoft currently uses this technology on MSN.com, MSNBC.com and Microsoft.com. At Microsoft.com, for example, service availability now exceeds 99%.

For more information on cluster server services, download Cluster Strategy: High Availability and Scalability with Industry-Standard Hardware. Additional details can be found in Microsoft TechNet or at https://www.microsoft.com/ntserver/ProductInfo/Enterprise/clustering/cluster2.asp.

The Hardware Component

Business should avoid the temptation to reduce costs by using untested hardware or skimping on quality assurance procedures for hardware deployments. For example, consider the following questions before deciding to reduce costs in this manner:

  • How easy might it be to accidentally power down or reset the device? When considering system enclosures, note where the reset and power switches are located.

  • Are the power supplies adequate to the tasks to which they are being put? System outages can be difficult–and expensive–to track down. Sometimes intermittent power can even manifest itself as apparent software failures.

  • Does the system have adequate fans and cooling features? Along with the storage devices, fans are the most likely points of mechanical failure.

Some facilities have been developed to monitor and identify potential hardware problems, such as excessive heat within an enclosure, noise, or high failure rates of disk drives.

For example, Compaq's Insight Manager XE uses management agents to collect data such as detailed status and fault statistics supplied from specially instrumented subsystems. This data is then interpreted and displayed through a single management interface for an entire installation, so that problems can be identified and in some cases, even predicted.

Lands Client Manager is another example. This is a software product from Intel that is capable of monitoring client systems for a variety of potential problems, including memory errors and internal device temperatures. LANDesk is based on the Desktop Management Interface (DMI), and Insight Manager XE supports both DMI and Simple Network Management Protocol (SNMP).

Consider following these guidelines when deploying hardware in environments with high availability demands:

  • Test the most critical systems. Are the devices firmly socketed and fastened in place? One hardware maintenance engineer reports that more than 60% of all reported hardware failures can be corrected by reseating the card in the socket.

  • Use redundant data paths and positive locking when interconnecting devices.

  • Safeguard the physical server. Is physical access to the device too easily compromised? Will the device be deployed in a temperature-controlled environment?

As more and more operations are being identified as business critical, the servers on which these services depend are more often found in closets or in moderate-to-high traffic areas.

The Operational Component

Most successful deployments are characterized by a relatively mature and disciplined procedure. Before listing specific recommendations for successful deployments of Windows NT Server with high availability requirements, it's important to highlight key points more focused on processes and procedures than technologies. A successful technology deployment follows a structured process. Many organizations have developed structured process models for their own deployments. For example, Microsoft's own infrastructure deployment process model described in Managing Infrastructure Deployment Projects is based on the widely used spiral development model, with four major phases: evaluation, planning, analysis, and deployment.11as Net 304.doc from Microsoft TechNet. The essential aspect of the spiral model is that it is iterative and open-ended.

Many organizations do not iterate. Operational practices lack the process of periodically taking stock, identifying, and implementing improvement. Such organizations can fall into a reactive, event-driven (or failure-driven) mode of operations.

The importance of understanding the implications of operational preparedness when investing in new equipment should not be underestimated. Every new user, every new computer, and every new network element has an operational cost associated with it.

Deploying Windows NT Server for High Availability

Columbia/HCA HealthcareBuilding a comprehensive network of affiliates to deliver quality healthcare services with maximum efficiency.

As one of the nation's largest and fastest growing healthcare organizations, Columbia is building a comprehensive network of affiliates to deliver quality healthcare services with maximum efficiency. Today, Columbia achieves 99.5% overall availability from an end-user perspective for their Windows NT Server-based deployments of Microsoft Exchange, Internet Information Server, and file and print servers, according to Chris Costello, vice president of operations at Columbia/HCA Healthcare.

Procedural Guidelines

These guidelines are generally accepted best practices, as well as recommendations specific to deploying Windows NT-based systems for high availability.

Follow a Formal Capacity Planning Process

Errors made in the planning phase, such as deploying the wrong technology at the wrong time or in the wrong way, can be very expensive to correct. For example, one of the busiest sites on the Web, Nasdaq.com is based on Windows NT Server systems and the Microsoft BackOffice® product family. The web site was developed using Microsoft development tools in close consultation with Microsoft and accredited solution providers. This site routinely counts hits in millions per day, with the industry's highest standards for availability.

To get started, read the technical paper, Managing Capacity in a Distributed Microsoft Windows NT-based System and visit the Windows NT Server planning web site. Microsoft also collaborated with Bluecurve to produce Microsoft Infrastructure Capacity and Reliability Management.

Adhere to the Hardware Compatibility List

Administrators should only deploy devices and system drivers that have been thoroughly tested. The Windows NT Hardware Compatibility List (HCL) is an extensive document, representing the enormous commitment Microsoft has made to ensure the quality of software for the Microsoft Windows® operating system platforms.

Many devices have drivers that work on Windows 95 and Windows NT operating systems, but have not undergone the extensive testing necessary for certification on the HCL. Others have been certified compatible for Windows 95 operating system, but not for Windows NT operating system. If an organization's desktops and servers warrant the security of Windows NT Workstation and Server operating systems, then they deserve the reliability of certified drivers and devices.

Current information about hardware compatibility can be found at https://support.microsoft.com/kb/131900 .

Understand and Use Microsoft Logo Programs

The Windows 95 and Windows NT Logo Program and the Designed for BackOffice Logo Program are the two most important system software compatibility programs.

A Designed for Windows NT logo means that a particular product has been independently tested, that it conforms to widely accepted usability criteria, and has minimal conflicts with other applications and devices when deployed on a Windows NT-based system. More information on the Microsoft's logo programs is available at https://www.microsoft.com/winlogo/default.mspx.

The Designed for Back Office Logo program has more demanding test requirements, which are totally focused on customer benefit. In addition to single product testing and desktop integration, this program includes expanded testing for proper integration with the BackOffice family of products. Every one of the requirements of the Designed for Back Office Logo program is primarily motivated by a focus on the customer's needs with explicit rationale to enhance robustness and integration of the product with other Windows applications.

More information on the Designed for Back Office logo program is at https://www.microsoft.com/backofficeserver/ .

Test Before Going Operational

Quality does not come without cost, but the cost of poor quality is even higher. Software quality statistics estimate the cost of finding and correcting a defect found in operation as more than ten times the cost of defects found during system test.12Humphrey, W.S.

Even so, some organizations find it difficult to justify dedicating adequate resources for offline testing. The cost of allocating fully configured server machines, adequate time for testing, and in some cases isolated network environments including simulated client workloads can seem quite high. When compared with the costs of online failures, however, the costs of a test bed server are minimal.

Offline integration testing is necessary even if a deployment consists solely of commercially vended software. Every deployment is unique and software and hardware updates can sometimes have unanticipated interactions. Test and evaluate service packs offline before deployment on your live system. Good sources for information related to Microsoft service packs are the news://msnews.microsoft.com/microsoft.public.windowsnt.apps and news://msnews.microsoft.com/microsoft.public.windowsnt.setup newsgroups. Other newsgroups of interest reside under the comp.os.ms-windows.nt hierarchy.

Careful in-place testing is also encouraged. Lab environments are rarely able to reflect the real-world network and operational environments into which the system will eventually be deployed.

Control Access to Servers

Administrators should establish and maintain formal procedures to control access and configuration changes to critical system servers. This includes both physical access and logical access. They should also minimize the number of individuals who have administrative authorities, and use limited administrative rights where appropriate.

This is standard practice for large data centers, but many organizations are in transition from more informal environments to more mission critical environments. Keeping track of who has access to what servers and what changes they make is essential. Even well intentioned changes made without careful planning could cause severe service interruptions. Changes made with malicious intent are also possible, and may cause irreversible damage if not anticipated.

Organizations can also benefit significantly from periodic security reviews. Administrators should review both the security architecture that the system was designed for and the current, in-place configuration of ACLs, ownership, etc. For more information on security, go to https://www.microsoft.com/security .

Analyze Failures and Keep system System Metrics

Many problems that may appear to be operating system errors turn out to be defects in third-party drivers or applications, or configuration errors. There should be an operation manual for every mission-critical environment that documents a configuration, suggests trouble-shooting, and gives problem isolation steps. If possible, steps should be non-invasive in order to avoid changing the state of the environment. Ideally, the manual will be updated periodically with information that helps the operator to isolate a problem source.

If a system fails for some reason, then spend the time to understand the root cause of the problem. In many cases, these defects have already been identified and corrected. Diagnosis and problem isolation are necessary to prevent similar events from re-occurring. Managers put mechanisms in place to report problems to the appropriate hardware and software vendors. They should also periodically test to make sure that the operations staff is following change, monitoring, and troubleshooting procedures. For example, operators should turn event logs and incident reports into charts so that they can more easily follow changes in slope, diagnose, and fix errant behaviors that affect availability.

Microsoft provides no-charge self-service support for Windows NT at https://support.microsoft.com/support .

Develop Backup Procedures

Backup programs vary in their specifics and procedures. As a result, it is important to thoroughly identify who should be doing certain tasks, what needs to be backed up, how to determine if the backups were done correctly, and how often should they take place.

Be sure to test backups for integrity before moving them to off-site storage. The best way to do this is to periodically reconstruct your test environment from your active backups, perhaps as part of a larger disaster-recovery fire drill.

Backing up more than 60 Gigabytes of data poses significant challenges. To overcome these challenges, operators could configure the disks that require daily backup separately from the Windows NT Server system disks by using Redundant Array of Independent Disks (RAID) arrays. For critical computers, operators can implement a software mirror of two separate hardware-controlled RAID arrays. With this configuration, if either a disk or an entire array fails, operations can continue. If a component such as a network interface card, video device, IDE adapter, or power supply fails, it can be easily replaced. If the computer running Windows NT Server itself fails, you should have a spare computer with Windows NT Server already installed to which the data disks can be moved.

The Windows NT Server 4.0 Resource Kit contains detailed information on backing up data.

Use Updates and Service Packs

Administrators need to keep abreast of updates to applications and system software. Updates to existing systems have benefits and risks so managers should apply them under controlled circumstances. Customers should also subscribe to premium support services, from the system vendor, application vendors, and Microsoft. Lastly, Microsoft TechNet is an excellent source for the latest information from Microsoft on applications and operating systems. For information on the currently available service pack, check Microsoft TechNet.

Released to manufacturer (RTM) last year (1998), Service Pack 4 improves system reliability because it corrects important customer-reported problems, such as memory leaks. It also includes over 30 support, diagnostic, and repair tools from the Windows NT 4.0 Resource Kit.

Avoid Any Single Point of Failure

In a complex system, it is often possible to have redundant applications of single components. In other cases, administrators have established some redundancy in their infrastructure (by installing uninterruptible power supplies, for example), but they may have overlooked some other critical system components. The next few paragraphs elaborate on this practice.

Use Clustering for Application Failover

Clustered systems are specifically designed for enhanced reliability, and provide other advantages for production environments. If a particular system cannot afford downtimes of more than five minutes, then use a cluster. This is a rule common to Unix and Windows NT.

Look for applications Certified for Microsoft Cluster Server. This is an important certification procedure similar to the Microsoft Logo programs, and provides one of the most extensive full system level integration testings available anywhere.

For frequently asked questions about Microsoft Cluster Server go to https://www.microsoft.com/hcl/ .

Use Multiple NICs on Separate Network Segments

By installing multiple NICs on separate network segments, administrators can significantly reduce downtime due to network outages of any single segment. In a clustered configuration, it is also a good idea to configure two network segments–one for normal network traffic, and a second dedicated to the heartbeat signal used by cluster members to monitor the health of the cluster.

Use Replication to Add Redundancy at the Information Level

Replication is the process of keeping multiple databases synchronized. The databases underpinning domain controllers and domain name servers are essential to the correct functioning of a network. Configuring backup domain controllers and multiple name servers in a network increases availability through redundancy and simultaneously enhances performance through load balancing. Domain controllers and name servers maintain consistency by synchronizing their databases periodically.

One significant feature of Windows NT 5.0 is the improved replication capabilities offered by Active Directory. Replication in Active Directory occurs within groups after specific intervals or at specific times, so that synchronization of databases can be more effectively controlled. For example, synchronization between directory servers separated by relatively slow links can be scheduled during off-peak hours, while those connected via a high-speed local area network can be synchronized more frequently.

In addition, Active Directory synchronizes at the object property level, so that if an object in two databases differ in only one property, only that property is exchanged, rather than the entire object.

Continually Monitor and Tune Performance

The spiral model described earlier is open-ended. After deployment, administrators should continue to evaluate their infrastructure needs and appropriateness of their solution to the tasks to which it is being applied. For most business critical applications, planning and deploying reliable systems makes good economic sense. Nevertheless, doing so does not come without some associated cost. The following table is distilled from customer surveys and interviews with experienced service personnel.

To get started, read the technical paper, Using Windows NT Performance Monitor.

Deploying Windows NT Server for High Availability

Dell ComputerInternet-based direct-to-consumer selling

Internet-based revenue is becoming a major sales channel for Dell Computer. The computer giant currently receives more than two million visits each week at https://www.dell.com , where the company maintains more than 40 country-specific sites. The on-line "Dell Store," which opened in July 1996, sells more than $4 million in merchandise each day. Dell relies on Windows NT Server, Internet Information Server, SQL Server, and Site Server, Commerce Edition in order to keep https://www.dell.com highly available in this demanding environment where the costs of downtime can escalate to hundreds of thousands, even millions of dollars.

Service and Support Offerings

Major system vendors have either introduced or are planning to introduce service and support programs to guarantee minimum end-to-end levels of system availability. For example, Hewlett-Packard recently announced a server suite to offer a 99.9% percent uptime commitment for selected Windows NT Server-based systems.

Such offerings demonstrate how Microsoft works collaboratively to deliver highly available systems that customers can confidently deploy in the most demanding mission-critical environments.

For more information on Hewlett-Packard's recent announcement, go: https://www.hp.com/pressrel/sep98/15sep98d.htm .

Best Practices Summary

This table summarizes important recommendations and best practices for reliable and highly available deployments.

Table 2 Best Practices Summary

Good (99%)

Better (99.9%)

Best (99.99%)

Hardware Selection

Use only Windows NT HCL certified hardware

Use only BackOffice Logo hardware with high availability features

Use only MSCS Cluster validated configurations

Software Selection

Use only Designed for Windows NT Logo software

Use only BackOffice Logo software

Use BackOffice Logo software and Microsoft Cluster-aware applications

Storage Solution

Use RAID 0+1 for data disks, Raid 1 for Log disks.

Use hardware solutions for enhanced performance and recoverability of data disks, (e.g., using RAID 5)

In addition, use multiple disk controllers with redundant data paths.

Random Access Memory

Use only error-detecting memories with parity

Use only memories with error corrective coding (ECC) for enhance memory error detection and correction.

In addition, use only systems which support ECC memories for L2 Cache segments

Configuration Management

Make fewer than three changes per month

Make fewer than three changes per quarter

Make fewer than three changes per year

Notification

Monitor system logs regularly (e.g., weekly) or when problems occur

Monitor all system logs daily

Monitor all system logs at least daily, and institute procedures for automatic notification when warnings or error conditions occur (e.g., using SNMP alarms)

Conclusion

The number and diversity of applications critical to the effective operation of today's businesses is increasing dramatically. Availability demands placed on these applications is growing tremendously as businesses place greater reliance on Internets and Intranets, remote computing, telecommuting, global competition, and cooperation.

Customers who have successfully deployed Windows NT based-systems for high availability understand that it is a complex task requiring the right combination of technology infrastructure, best practices, and system vendor participation. Success depends on applying procedures and processes comparable to those developed by administrators of mainframes and other high-end systems. It also requires that users adhere to recommendations specific to deploying Windows NT-based systems, such as hardware and software selection.

These high-level guidelines have been developed by many people through extensive experience with production deployments of Windows NT Server, including departmental file servers, branch offices, and regional data centers supporting thousands of concurrent users. Adhering to these principles will help customers re-create for themselves the successes that ConXion, Chicago Stock Exchange, Columbia/HCA Healthcare, Dell Computer, First Union Capital Markets, Merrill Lynch, and PulsePoint Communications have had deploying Windows NT Server-based systems for high availability.

For more information on Windows NT Server, check out Microsoft TechNet or go to https://www.microsoft.com/ntserver .

Additional Resources

Handbook of Software Reliability Engineering, Lyu, M. R., Ed. McGraw Hill, 1996

Windows NT Automated Deployment and Customization, Richard Puckett, Macmillan Technical Publishing, 1998

Windows NT Shell Scripting, Tim Hill, Macmillan Technical Publishing, 1998

Microsoft Windows NT Server 4.0 Resource Kit, Microsoft Corporation, Microsoft Press, 1996

Microsoft Windows NT Workstation 4.0 Resource Kit, Microsoft Corporation, Microsoft Press, 1996

A Discipline for Software Engineering,. Humphrey, W.S. Addison-Wesley, 1997

Applied Software Measurement, 2nd Ed. Jones, C. McGraw-Hill, 1991, 1996

IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glossaries. Institute of Electrical and Electronics Engineers. New York, NY: 1990.

Cahners Instat Group/BRG, Proprietary Survey, 1998

Annual Disaster Impact Research, Contingency Planning Research Inc. 4 W. Red Oak Lane White Plains, NY 10604

Managing Infrastructure Deployment Projects, (available on Microsoft TechNet as Net 304.doc).

The Windows Internet Naming Service Architecture and Capacity Planning, Mircosoft Technet

BackOffice Server 4.0 Performance Characterization White Paper, Microsoft Corporation

Chen, P., Gibson, G., Katz, R. H., Patterson, D. A., Schulze, M., "Introduction to Redundant Arrays of Inexpensive Disks", UC Berkeley ( https://cs-r.cs.berkeley.edu/Dienst/Repository/2.0/Body/ncstrl.ucb%2fCSD-88-479/ocr )

Duane, J.T. "Learning curve approach to reliability monitoring", IEEE Trans. Aerospace, 2(2), (1964) pp. 563-566

Gray, J., "Why Do Computers Stop and What Can Be Done About It?" Proc. IEEE Symp. Reliability in Distributed Software and Database Systems, 1986

Kronenberg, N., Levy, H., and Strecker, W.,"VAXclusters: A Closely Coupled Distributed System," ACM Transactions on Computer Systems, vol. 4, no. 2 (May 1986)

Russell, G.W., "Experience with Inspections in Ultralarge-scale Developments," IEEE Software (January 1991)

Semeria, C., "Understanding IP Addressing: Everything You Ever Wanted To Know", https://www.3com.com/nsc/501302.html

Short, R., Gamache, R., Vert, J, and Massa, M.," Windows NT Clusters for Availability and Scalability", Microsoft Corporation

Network (GO WORD: MSNTS).

1 For more information visit:
2 This curve is the combination of three separate equations. An initial burn-in phase first identified in [Duane, 1964] as the
3 IEEE Standard Computer Dictionary
4 Contingency Planning Research, Inc.
5 These statistics correlate for both SunSoft Solaris-hosted systems and Windows NT-based systems.
6 Western Digital (
7 Windows NT Server Resource Kit 4.0:
8 Contingency Planning Research
9 APC (
10 RAID Advisory Board (
11 Available
12 A Discipline for Software Engineering,