Strategies for Fault-Tolerant Computing

Article
12/09/2009

Abstract

This document examines the role that fault-tolerant servers can play in solutions that require superior levels of availability. The unique benefits of fault-tolerant servers running the Windows operating system are also discussed, followed by recommendations on when to consider fault-tolerant servers for Windows-based solutions.

Introduction

Availability Defined

In the IT community, availability is defined as the percentage of time that a system is capable of serving its intended function. Unlike reliability metrics, which are best used to measure the probability of failure for a single solution component, a solution’s availability level measures the percentage of time that it remains “up and running” in support of an end-user or IT-enabled business process. Therefore, the reliability of all solution components-server hardware, operating system, application software, networking, and so on-can affect a solution’s availability.

As illustrated in Table 1, availability is typically measured in “nines”. For example, a solution with an availability level of “three nines” is capable of supporting its intended function 99.9 percent of the time-equivalent to an annual downtime of 8.76 hours per year on a 24x7x365 basis.

Availability	Annual Downtime
99%	87.6 hours
99.9%	8.76 hours
99.99%	52.5 minutes
99.999%	5.25 minutes

Table 1: Correlation between availability and annual downtime.

Importance of Availability

Availability becomes increasingly important as businesses continue to increase their reliance on information technology. As such, the availability of mission-critical information systems is often tied directly to business performance or revenue. Based on a system’s given role in the enterprise, downtime can lead to other negative consequences such as loss of life, customer dissatisfaction, loss of productivity, bad press, or an inability to meet regulatory requirements.

Industry Sector	Hourly Cost of Downtime
Manufacturing	$28,000
Transportation	$90,000
Retail, Catalog Sales	$90,000
Retail, Home Shopping	$113,000
Media, Pay Per View	$1,100,000
Banking datacenter	$2,500,000
Financial, Credit Card Processing	$2,600,000
Brokerage	$6,500,000

Table 2: Average Cost of Unplanned Downtime for Various Industries1

However, not all downtime is equally costly; the greatest expense is caused by unplanned downtime. Outside of a system’s core service hours, its amount of downtime—and corresponding overall availability level—may have little to no impact on a business. If a system crashes during core service hours, the result can be significant financial impact. Because unplanned downtime is rarely predictable and can occur at any time, companies looking to minimize risk should evaluate the cost of unplanned downtime during core service hours.

For example, in projecting the consequences of planned vs. unplanned downtime, consider a manufacturing scenario where a system is only used during plant hours. Within these intervals, each hour of unplanned downtime costs the company an average of $28,000. However, on evenings and weekends, the system can be taken offline for maintenance or application upgrades with no impact to the company’s operations. Therefore, while the system’s overall availability level may only be two nines, the corresponding 87.6 hours of annual downtime will not have a financial impact (other than possibly paying an IT staffer to work the occasional evening or weekend).

On the other hand, systems supporting functions such as telecommunications switching or a city’s 911 emergency police/fire/ambulance dispatch operations require 24x7x365 availability—often called continuous availability. In these situations, there are no off-hours; every second a system is down leads to an interruption in phone service for thousands of people, or, even more damaging, delayed response to a life-threatening situation. System administrators in these scenarios face an additional challenge: how to perform periodic hardware maintenance or install software upgrades without compromising availability.

The Availability Equation: People, Process, and Technology

There are three “pillars” of a highly available system: people, process, and technology. Deficiencies in any one of these areas can compromise system availability—like the way a chain’s weakest link determines its ultimate strength.

People, process, and technology are the three “pillars” of a highly available solution.

Figure 1: People, process, and technology are the three “pillars” of a highly available solution.

People: Proper training and skills certification ensure that the people who are managing mission-critical systems and applications have the knowledge and experience required to do so. Strengthening this area requires more than technical know-how; IT administrators must also be knowledgeable in process-related areas.
Process: An organization must develop and enforce a well-defined set of processes covering all phases a solution’s life cycle. Improvements in this area can be achieved by examining industry best practices and modifying them to address each solution’s unique requirements.
Technology: The technology component of a highly available solution comprises many of the areas mentioned above: server hardware, the operating system, device drivers, applications, networking, and many others. As with the people/process/technology dependency chain, the contribution that technology as a whole can make toward achieving a highly available solution is only as strong as its weakest component.

Within the technology pillar, there are several ways to improve availability. This topic is discussed further in the remainder of this document, with a focus on the role that fault-tolerant servers can play in delivering highly available solutions.

Fault-tolerant Servers

Fault Tolerance Defined

As operating systems, applications, drivers, and other software-based solution components become more reliable, hardware-related issues and failures play a larger relative role in determining a solution’s total downtime. One approach to minimizing these causes of downtime is through the use of fault-tolerant servers, combined with software that supports them.

Put simply, fault-tolerant servers are those that have complete redundancy across all hardware components. If a primary component fails, the secondary component takes over in a process that is seamless to the application running on the server. As such, fault-tolerant systems “operate through” a component failure without loss of data or application state. This is different than software-based failover clustering, in which a hardware or software failure on one server causes the workload to be shifted to a second server.

Although a system may have some redundant components, this does not necessarily make it a fault-tolerant server. Most high-end servers employ at least some redundant components to eliminate common points of failure (e.g., hot-swappable power supplies, ECC memory, or multipath I/O adapters), but will still fail when a non-redundant component such as a microprocessor or memory controller fails. True fault-tolerant servers, however, employ complete redundancy across all system components, ensuring that no single point of failure can compromise system availability. Some fault-tolerant servers extend this level of redundancy across datacenter boundaries by allowing the server’s redundant subsystems to be installed in separate yet connected locations.

Note: While servers with only selected redundant components cannot deliver the same level of hardware reliability as a fully fault-tolerant system, they do offer greater reliability than servers without redundant components. As such, they may present a cost-effective way to decrease unplanned downtime in situations where full fault-tolerance is not economically justified.

Traditional Barriers to Adoption

Fault-tolerant servers have been used in limited capacity for a number of years, delivering virtually uninterrupted hardware availability in a variety of high-end computing scenarios: life safety, industrial process control; telecommunications, financial transactions, and many other business-critical scenarios—anywhere that uninterrupted computing is an absolute requirement. However, several factors have prevented fault-tolerant systems from achieving broader adoption:

Extremely high hardware costs. Use of fault-tolerant servers has traditionally been limited to niche markets, forcing hardware vendors to amortize engineering and manufacturing costs over a small number of units. Prior to 2000, the typical cost for an entry-level fault-tolerant server running a proprietary operating system was $250,000.
Complexity and expense of writing software. In the past, many fault-tolerant platforms dictated a unique application programming interface (API)-one closely tied to the underlying hardware. Writing programs for these systems required a deep understanding of transactional semantics and manual “checkpointing” at the application level, leading to significantly higher initial and long-term costs (e.g., increased application development timelines and expenses, increased opportunity cost due to longer time to market, downtime costs while the application is developed or ported to the proprietary platform, and higher long-term software maintenance costs.)

Fault Tolerance on Windows

Combined with the time to market, productivity, integration, and cost benefits of the Microsoft platform, reliability improvements in the Windows 2000 Server family of operating systems are compelling more and more businesses to deploy Windows-based solutions for their mission-critical computing needs. In these situations, companies often use software-based high-availability technologies included in Windows (e.g., message queuing, distributed transactions, failover clustering, and software load balancing) to minimize unplanned downtime.

However, there are scenarios when using these technologies may not be feasible or appropriate—or when an enterprise needs to further improve system availability by eliminating the potential for downtime due to hardware failure. To help address these situations, Microsoft designed the Windows 2000 Advanced Server operating system to fully support fault-tolerant servers.

The following companies offer integrated solutions for fault-tolerant computing based on Windows 2000 Advanced Server. Microsoft is working closely with these companies to ensure the delivery of integrated hardware, software, and service offerings that make fault-tolerance on Windows a cost-effective alternative to more expensive, proprietary solutions.

Marathon Technologies
NEC
Stratus Technologies

Early Adoption of Fault-Tolerance on Windows

There are two primary scenarios where organizations are running Windows-based solutions on fault-tolerant servers:

To increase availability for traditional Windows-based solutions. As Windows-based solutions continue to become more mission-critical, some companies are eliminating the potential for downtime due to hardware failure by moving these applications to fault-tolerant servers. Scenarios leading this trend include back-end messaging and database servers as well as middle-tier application servers.
As a cost-effective alternative to proprietary platforms. Companies are realizing lower costs by deploying fault-tolerant servers running Windows for solutions that have traditionally resided on clustered UNIX servers, mainframes, or proprietary fault-tolerant systems. Industry segments leading this trend include public safety (e.g., computer-aided emergency dispatch), financial transaction processing (e.g., stock trading and banking), and telecommunications (e.g., inline routers).

Regardless of the reasons for deploying fault-tolerant servers running Windows in any given situation, the platform’s rapid adoption across a broad range of scenarios—and in conservative industry sectors—is compelling evidence of its ability to deliver uncompromised availability. According to Stratus Technologies, which first began shipping fault-tolerant servers running a proprietary operating system in 1982 and added a UNIX-based offering in 1995, the company’s two-year-old Windows-based ftServer product line already accounts for 79 percent all new business.2 NEC Corporation, which began shipping mainframes running proprietary operating systems in 1965, reports similar findings since the introduction of its FT Series fault-tolerant servers for Windows in early 2001.

Complete Solutions

Downtime for Windows 2000-based solutions is typically due to hardware failures, bad device drivers, user error, poor change control processes, and so on, with a very small percentage attributable to the core operating system. In addition to delivering the full “technology stack” required to minimize downtime (e.g.; fault-tolerant hardware, a highly stable operating system, hardened device drivers, etc.), vendors of fault-tolerant servers for Windows also offer comprehensive training and service offerings-the three components of the people, process, and technology equation required to maintain a highly available system.

Several fault-tolerant system vendors go a step further in delivering availability-related services through continuous server monitoring. Every Stratus server continually monitors itself for component and operating system failure, and can be set to immediately call into the company’s Customer Assistance Center to report a failure or other important event. NEC offers a similar service through a strategic alliance with Unisys.

Measurable Results

Stratus monitors the availability level of every server connected to its service network. According to Stratus, the average availability level for all Windows-based ftServer systems monitored through the network is 99.9998 percent3—a measurement that considers both hardware- and operating system-related downtime. Not only does this validate the effectiveness of fault-tolerant servers, but it confirms that Windows 2000 Advanced Server can deliver mission-critical availability out-of-the-box when used with the proper technologies, managed by well-trained people, and supported with solid processes.

Unique Benefits

Fault-tolerance on Windows is a compelling solution for companies that need to achieve mission-critical availability but want to avoid expensive and proprietary solutions. Unlike other fault-tolerant platforms that may require “checkpointing” at the application level, support for fault-tolerance in Windows 2000 Advanced Server is handled completely at in the kernel and hardware abstraction layer—a method that makes it transparent to applications.

In addition, Windows-based fault-tolerant servers must pass the same rigorous Windows Hardware Compatibility Tests (HCT) as other servers, ensuring that the applications running on them will behave no differently. As such, companies that embrace fault tolerance on Windows will not only achieve very high levels of availability, but they will also realize the full range of other benefits provided by the Microsoft platform and .NET technologies.

Reduced Time to Market

Because no special development skills or tools are required, solutions intended to run on Windows-based fault-tolerant servers can be developed and deployed as rapidly as any other Windows-based application. Companies can take advantage of the rich functionality provided in the .NET Framework and the highly-productive Visual Studio .NET integrated development system to rapidly develop custom solutions, or can choose from the full range of off-the-shelf Windows applications. Similarly, existing Windows-based solutions can be moved to fault-tolerant servers without modification, enabling companies to increase availability levels with only an investment in new hardware.

Ease of Integration

With native support for industry standards such as XML Web services, the Microsoft platform and .NET technologies make it easy to integrate Windows-based solutions running on fault-tolerant servers with other systems. Microsoft BizTalk Server extends these capabilities even further, with more than 300 plug-in BizTalk Adapters available to simplify enterprise application integration and enable companies to comply with industry-specific electronic transaction formats such as HIPAA or EDI.

Ease of Management

Windows-based solutions running on fault-tolerant servers can be easily administered using the comprehensive management tools provided in the Microsoft platform. For example, Microsoft Operations Monitor enables companies to subject applications running on Windows-based servers to granular real-time monitoring, enabling administrators to detect many problems before they can affect system availability. Because Stratus and NEC expose these self-monitoring capabilities through the Windows Management Interface (WMI), the management tools provided in the Microsoft platform can be used to monitor hardware status as well.

Lower Hardware Costs

Fault-tolerant servers for Windows are available starting at under $20,000—a fraction of the typical $200,000-plus starting price for proprietary fault-tolerant platforms. Combined with the superior cost-effectiveness of the Microsoft platform, this order-of-magnitude decrease in hardware costs makes fault-tolerance on Windows economically justifiable in a far broader range of situations than fault-tolerance on proprietary platforms.

Greater Price-Performance

Not only will companies adopting fault-tolerance on Windows realize significant price-performance advantages over proprietary fault-tolerant systems today, but they can expect to see this difference continue to increase over time. Fault-tolerant servers for Windows rely on the same industry-leading processors as other high-end Intel-based servers, so their price-performance will continue to be driven to new heights by the massive economies of scale for Intel-based microprocessors and the company’s multibillion-dollar annual research budget.

Superior Return on Investment

Due to the above benefits, fault-tolerant solutions on Windows typically carry a far lower total cost of ownership than solutions built on other fault-tolerant platforms. Companies switching from proprietary fault-tolerant solutions to fault tolerance on Windows can reduce costs without compromising availability. Similarly, companies in industries with lower costs of downtime—like manufacturing-can now improve the availability of mission-critical systems and still realize a positive return on investment in a reasonable timeframe.

Recommendations

Fault Tolerance and Clustering: The Ultimate Solution

Systems architected with both software-based failover clustering4 and hardware-based fault-tolerance provide the highest achievable levels of availability. This powerful combination extends the benefits of fault-tolerance on Windows from the server-and-operating system level to the application level-an absolute necessity in maximizing service availability for a mission-critical solutions. The combination of fault-tolerance and clustering accomplishes this by minimizing downtime in two additional situations:

Application failures. With clustering, failure of an application on one node of a cluster results in its workload being transferred to an identical copy of the application running on another node in the cluster-with minimal impact on application availability. Once the cause of failure on the primary node is identified and repaired, the cluster is “failed back” and the workload is returned to the primary node.

Although software-based failover clustering does require a small amount of time to remount storage, rebuild application state, and so on, this can be accomplished in as few as 10 seconds and is done automatically. As such, protection from application failure through clustering results in significantly less downtime than the time required to bring a failed application back online in a non-clustered environment (e.g., waiting for someone to notice or be notified that an application has crashed, followed by the time it takes to reboot the server, remount its storage, and restart the application).

Hardware/software maintenance and upgrades. Failover clustering provides the capability to perform “rolling upgrades”—a technique that enables system administrators to perform operating system upgrades, install service packs, or install security patches—even to upgrade to newer, more powerful servers—with minimal impact on a solution’s overall service availability.

In this scenario, desired changes are made to the “passive” node of the cluster, upon which the cluster is manually forced to fail over. Maintenance is then performed on the primary node of the cluster and the cluster is failed-back to the primary node. This same technique also provides a safety net in case a software upgrade does not perform as expected upon failover; the cluster can immediately be failed-back to the primary (unmodified) node while the secondary node is restored to its previous state.

System Sizing

The Stratus ftServer 6500 and NEC Express5800/340 are the largest Windows-based fault-tolerant servers currently on the market-each a four-way system based on 2.0 GHz Intel Xeon MP processors with two MB of internal Level 3 cache. While these configurations provide ample processing power for the large majority of all server-based applications, companies should still perform careful system sizing—including expected growth in workload over the solution’s lifecycle—to ensure that performance needs will continue to be met.

Disaster Recovery

To protect against catastrophic events such as failure of an entire datacenter, companies often use some form of software- or hardware-based replication to keep two geographically separated facilities synchronized. Based on how this is implemented, a period of service interruption—similar to that of a cluster fail-over-can result while a redundant datacenter is brought online. If continued availability in these situations is required, companies may want to consider Marathon Technologies’ Assured Availability SplitSite solution—consisting of redundant server “halves” that can be connected over a distance of up to 55 kilometers with a single-mode fiber-optic cable.

Justifying the Added Costs

Companies should evaluate their total cost of ownership over a five-year period to determine the value that fault-tolerance can provide. For example, in a manufacturing scenario with an existing availability level of three nines, the cost to achieve five nines availability is likely to be far less than the estimated $277,2005 reduction in downtime-related costs over the solution’s five-year lifecycle. This calculation assumes that the system operates 8 hours per day Monday-Friday; if it needs to be available 24x7x365, then the savings due to reduced downtime increases to $1.2 million.

Covering All the Bases

Although this whitepaper focuses primarily on the technical causes of downtime, people-and process-related incidents account for a far greater percentage of unplanned downtime than software and hardware failures combined. If this is the case, funds earmarked to increase availability may be better spent on improving an organization’s people and processes (e.g., more training for system administrators, better change control procedures, real-time system monitoring, and more thorough testing of applications).

Microsoft offers guidance in all three of these areas—people, process, and technology—nd across all phases of the solution lifecycle: vision, scope, requirements definition, implementation, rollout, and day-to-day operations. Some of the key resources that can help companies achieve mission-critical levels of availability, manageability, security, and scalability for Windows-based solutions include:

Microsoft Solutions Framework (MSF). The Microsoft Solutions Framework (MSF) covers the “plan and build” phases of implementing solutions based on the Microsoft platform. Microsoft collects best practices from its product developers as well as its worldwide network of consultants, customers, and partners, analyzes them for repeatable success factors, and integrates these success factors into Microsoft Solutions Framework principles and practices for use by Microsoft Consulting Services (MCS), partners, and customers.
Microsoft Operations Framework. Designed to help companies operate, manage, and optimize Windows-based solutions, the Microsoft Operations Framework (MOF) is based on the IT Infrastructure Library (ITIL) from Britain’s Central Computer and Telecommunications Agency, an agency chartered with development of IT-related best practices. MOF combines the collaborative industry standards and best practices identified by ITIL with specific guidelines for using Microsoft products and technologies.
Microsoft Systems Architecture. The Microsoft Systems Architecture (MSA) program provides standardized architectures for enterprise-class, Windows-based solutions. Tested in Microsoft labs and optimized for the Microsoft platform, MSA configurations scale from departmental through enterprise to internet data centers, enabling companies building solutions of all sizes to benefit from rapid implementations, predictable costs, reduced risk, and faster time to benefit.

Conclusion

When combined, fault-tolerance and software-based clustering provide a very powerful tool-set for achieving mission-critical availability. However, companies looking to minimize downtime—and its associated costs—need to remember that no amount of technology can make up for lack of experience, improper training, or poor processes.

As such, every organization needs to determine its own cost of downtime and examine the reasons for this downtime, assessing strengths and weaknesses across all three components of the high-availability equation: people, process, and technology. Only after this is done can the proper course of action be determined and the costs to achieve higher availability weighed against the consequences of not doing so.

Fortunately, Windows-based fault-tolerant solutions carry far lower costs than proprietary solutions, enabling companies in all industries to achieve a positive ROI in a reasonable timeframe across a much broader range of scenarios. And because support for fault-tolerant servers in Windows 2000 Advanced Server is implemented in a method that makes it transparent to applications, companies embracing fault-tolerance on Windows as a means of achieving mission-critical availability can expect to realize all the other benefits inherent to the Microsoft platform, including reduced time to market, ease of integration, and simplified management.

For More Information

Fault-Tolerant Server Vendors More information on fault-tolerant solutions for Windows may be found at:

Case Studies Examples of how companies across a broad range on industries are benefiting from fault-tolerance on Windows may be found at:

Failover Clustering Information on the clustering capabilities in Windows 2000 Advanced Server and Windows 2000 Datacenter Server may be found at:

Training and Certification Extensive training and certification resources are available to ensure that IT professionals have the skills required to build, deploy, and maintain highly available Windows-based solutions.

https://www.microsoft.com/learning/

Microsoft Operations Framework (MOF)

https://www.microsoft.com/MOF

Microsoft Solutions Framework (MSF)

https://www.microsoft.com/msf

Microsoft Systems Architecture (MSA)

https://www.microsoft.com/technet/itsolutions/wssra/default.mspx

1	Source: Contingency Planning Research, 1996. © Eagle Rock Alliance, LTD. All Rights Reserved. More information on Eagle Rock Alliance may be found at https://www.eaglerockalliance.com.
2	Based on unit shipments for new servers for the 12 months ending November 2002.
3	For additional details on this measurement, go to: https://www.stratus.com/uptime/ftserver.htm.
4	Failover clustering on Windows uses the Microsoft Cluster Service (MSCS)—a software-based mechanism included in Windows 2000 Advanced Server and Windows 2000 Datacenter Server—to transfer the workload from a failed server to another server in the cluster. Windows 2000 Advanced Server supports 2-node clusters, and Windows 2000 Datacenter supports
5	Calculated as follows, using a $28,000/hour cost of downtime from Table 1 and assuming 250 operating days per year at 8 hours per day (2000 hours per year). Cost of downtime at 99.9% availability: 2000 hours/year x .1% x $28,000/hour = $56,000. Cost of downtime at 99.999% availability: 2000 hours/year x .001% x 28,000/hour = $560. Total savings = ($56,000/year - $560/year) x 5 years = $277,200

Strategies for Fault-Tolerant Computing

On This Page

Introduction

Availability Defined

Importance of Availability

The Availability Equation: People, Process, and Technology

Fault-tolerant Servers

Fault Tolerance Defined

Traditional Barriers to Adoption

Fault Tolerance on Windows

Fault Tolerance on Windows

Early Adoption of Fault-Tolerance on Windows

Complete Solutions

Measurable Results

Unique Benefits

Reduced Time to Market

Ease of Integration

Ease of Management

Lower Hardware Costs

Greater Price-Performance

Superior Return on Investment

Recommendations

Fault Tolerance and Clustering: The Ultimate Solution

System Sizing

Disaster Recovery

Justifying the Added Costs

Covering All the Bases

Conclusion

For More Information

Additional resources