Cluster Strategy: High Availability and Scalability with Industry-Standard Hardware
|Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.|
On This Page
This paper explains Microsoft's vision for enhancing Microsoft® Windows NT® Server, Enterprise Edition and the Microsoft BackOffice™ family through clustering to provide greater availability, scalability and manageability. Clustering technology, which is a standard service of Windows NT Server, Enterprise Edition, brings data-center capabilities and performance to a wider range of customer installations. Windows NT Server, Enterprise Edition clustering services will combine the ease of configuring and maintaining Windows® operating systems with the economy of using industry-standard hardware. In addition to clustering, Windows NT Server Enteprise Edition is differentiated by its extraordinary scalability, support more than 4-way SMP servers, and supporting the aggressive use of large physical memories by large-memory-aware applications such as Microsoft SQL Server Enterprise Editon. These features are beyond the scope of this paper, but are also important in selecting Windows NT Server, Enterprise Editon for mission critical, head-room demanding applications.
Microsoft is committed to providing a computing platform specifically designed for today's enterprise business applications, while anticipating the needs of the most demanding information systems of tomorrow. To meet these diverse requirements, Microsoft is developing systems with leading technology vendors that provide increased availability and scalability of Information Technology (IT) services through clustering technology.
Many enterprise customers now use clustering technology to provide greater availability and scalability for their high-end mission-critical applications such as customer order entry. However, these clustering solutions are complex, difficult to configure, and use expensive proprietary hardware. Microsoft and industry partners are working to bring the benefits of clustering technology to mainstream client/server computing. Microsoft is developing clustering services for Microsoft® Windows NT® Server, Enterprise Edition, using open specifications, industry-standard hardware, and the ease-of-use customers have come to expect from Microsoft products.
Microsoft is delivering the Clustering Service in two phases. The first phase, delivered in October 1997, allows one server to automatically fail over to another server, creating a high-availability Windows NT Server environment. The second phase will extend clustering by unifying Windows NT Server and cluster administration, and adding support for more than two nodes to the features offered in Phase 1. It will also provide the infrastructure for building enterprise level, distributed, and scalable applications. Clustering services are a feature of Windows NT Server, Enterprise Edition 4.0 and are one of the major reasons for the increasing deployment of this product in the marketplace.
In broad terms, a cluster is a group of independent systems working together as a single system. A client interacts with a cluster as though it is a single server. Cluster configurations are used to address both availability and scalability:
Availability: When a system in the cluster fails, the cluster software responds by dispersing the work from the failed system to the remaining systems in the cluster.
Scalability: When the overall load exceeds the capabilities of the systems in the cluster, additional systems may be added to the cluster. At the present time, customers who plan to expand their system's capacity must make up-front commitments to expensive, high-end servers that provide space for additional CPUs, drives, and memory. Using clustering technology, customers will be able to incrementally add smaller, standard systems as needed to meet overall processing power requirements.
Why Clustering is Important
It is estimated that system downtime costs U.S. businesses $4.0 billion per year1. The average downtime event results in a $140,000 loss in the retail industry and a $450,000 loss in the securities industry2. Clustering promises to minimize downtime by providing an architecture that keeps systems running in the event of a system failure. Clustering also gives organizations the ability to set up separate servers as a single computing facility, thus providing flexibility for future installation growth. In addition to providing failover protection for unplanned outages, Microsoft's Windows NT Server, Enterprise Edition provides clustering services which also help customers avoid planned down time—by allowing one node to be taken off line for hardware and software upgrades and/or testing while the remaining node continues to provide services.
Recent research by the Business Research Group indicates that only 8% of the systems customer define as important mission critical are 8-hour a day systems. Very high degrees of system availability, for both internal users and external stakeholders, are becoming the rule, not the exception.
Clustering Example: Data Availability in the Retail Industry
The point-of-sales system is the heartbeat of any retail operation. It provides critical ongoing access to the store's database of products, codes, names, and prices. If the point-of-sale system fails, cashiers cannot log sales and the operation loses money, customers, and its reputation for quality service.
In this case, clustering technology can deliver high system availability. The clustering solution would allow a pair of servers to access the multi-port storage devices (Disk Array) on which the database resides. In the event of a server failure3 on Server 1, the workload is automatically moved to Server 2 and end-users are switched over to the new server—with no operator intervention and minimal downtime. Note that the disk array itself can be protected by the fault-tolerant disk technology built into Windows NT Server. The addition of clustering technology means that the overall system remains online.
Clustering Example: Scalability in the Financial Services Industry
It has been said that a chief information officer's two greatest fears are system success and system failure. If a system fails, the CIO's staff is inundated with complaints. Conversely, if a system is successful, usage demands tend to outstrip the capacity of the system as it grows. Windows NT Server, Enterprise Edition helps address both of these issues.
Clustering can greatly minimize system downtime. In addition, clustering and SMP can also help Information Technology (IT) departments design systems that can grow with the demands of an organization.
For example, billions of dollars have been poured into mutual funds companies in recent years. Although this type of growth is positive in financial terms, the technological burden of managing the corresponding information systems growth can be overwhelming. As a result, chief information officers and their staffs must develop systems that not only meet current system demand, but also provide for future system growth. Formerly, the system choices were rather limited: extremely expensive mainframes and minicomputers.
Windows NT Server clustering can provide a competitive advantage for the IT department. This technology allows faster system deployment, automatic re-tasking, and easier maintenance with smaller staffs—all while using inexpensive PC components. These components are available from multiple sources, which not only ensures competitive pricing, but also ensures parts availability. Consequently, IT departments can incrementally expand their hardware without the burden of single-supplier shortages.
Clustering technology also gives IT departments greater flexibility. Multiple servers can be tied together in one system, and additional servers can be integrated into the system as usage requirements dictate. Windows NT Server clustering gives a choice to system architects that they have never enjoyed before—availability and scalability on inexpensive, mainstream platforms.
Traditional Architectures for High-Availability
Several types of architectures are now used to increase availability of computer systems. A duplicate hardware system with fully replicated components is the traditional hardware structure for achieving high availability. The traditional software model for utilizing this hardware is one system that runs the application while the other sits idle, acting as a standby ready to take over if the primary system fails. The drawbacks of this approach include increased hardware costs with no improvement in system throughput and the lack of protection from intermittent application failures.
Traditional Architectures for Scalability
Several different architectures are used to enhance scalability. One hardware structure for achieving scalability beyond a single processor is the symmetric multiprocessor (SMP) system. In a SMP system, several processors share a global memory and I/O subsystem. The traditional SMP software model, known as the shared memory model, runs a single copy of the operating system with application processes running as if they were on a single processor system. If applications that do not share data are run on an SMP system, the systems will provide high scalability. Windows NT Server, Enterprise Edition supports SMP scalability, with excellent scaling through 8-way SMP in many workloads, and through 32-way SMP in some workloads.
The major drawbacks to SMP systems at the hardware level are physical limitations in bus and memory speed that are expensive to overcome. As microprocessor speeds increase, shared memory multiprocessors become increasingly expensive. Today there are large price steps as a customer scales from one processor to two to four processors, and especially when scaling beyond eight processors.
Finally, neither the SMP hardware structure nor its traditional software model provides inherent availability benefits over single-processor systems.
Only one architecture has proven advantages for availability and scalability in business-critical computing applications: the cluster.
A cluster is a set of loosely coupled, independent computer systems that behave as a single system. Some of these nodes can be, and frequently are, SMP systems. Client applications interact with a cluster as if it is a single high-performance, highly reliable server. System managers view a cluster much as they see a single server. Cluster technology is readily adaptable to low-cost, industry-standard computer technology and interconnects.
Clustering can take many forms. A cluster may be nothing more than a set of standard personal computers interconnected by Ethernet. At the other end of the spectrum, the hardware structure may consist of high-performance SMP systems connected via a high-performance communications and I/O bus. In both cases, processing power can be increased in small incremental steps by adding another commodity system. To a client application, the cluster provides the illusion of a single server, or single-system image, even though it may be composed of many systems.
Additional systems can be added to the cluster as needed to process more complex or an increasing number of requests from the clients. If one system in a cluster fails, its workload can be automatically dispersed among the remaining systems. This transfer is frequently transparent to the client.
Shared Disk Model
Two principal software models are used in clustering today: shared disk and shared nothing. In the shared disk model, software running on any system in the cluster may access any resource (e.g., a disk) connected to any system in the cluster. If two systems need to see the same data, the data must either be read twice from the disk or copied from one system to another. As in an SMP system, the application must synchronize and serialize its access to shared data. Typically a Distributed Lock Manager (DLM) is used to help with this synchronization. A DLM is a service provided to applications that tracks references to resources throughout the cluster. If more than one system attempts to reference a single resource, the DLM will recognize and resolve the potential conflict. DLM coordination, however, may cause additional message traffic and reduce performance because of the associated serialized access to additional systems. One approach to reducing these problems is the "shared nothing" software model.
Shared Nothing Model
In the shared nothing software model, each system within the cluster owns a subset of the cluster's resources. Only one system may own and access a particular resource at a time, although, on a failure, another dynamically determined system may take ownership of the resource. In addition, requests from clients are automatically routed to the system that owns the resource.
For example, if a client request requires access to resources owned by multiple systems, one system is chosen to host the request. The host system analyzes the client request and ships subrequests to the appropriate systems. Each system executes the subrequest and returns only the required response to the host system. The host system assembles a final response and sends it to the client.
A single system request on the host system describes a high-level function (such as a multiple-data record retrieve) that generates a great deal of system activity (such as multiple-disk reads) and the associated traffic does not appear on the cluster interconnect until the final desired data is found. By utilizing an application that is distributed over multiple clustered systems, such as a database, overall system performance is not limited by a single computer's hardware limitations.
The shared disk and shared nothing models can be supported within the same cluster. Some software can most easily exploit the capabilities of the cluster through the shared disk model. This software includes applications and services that require only modest (and read-intensive) shared access to data, as well as applications or workloads that are very difficult to partition. Applications that require maximum scalability should use the cluster's shared nothing support.
Cluster Application Servers
While clusters can bring availability and scalability to most server-based software, cluster-aware applications can take full advantage of the environment's benefits. Database server software must be enhanced either to coordinate access to shared data in a shared disk cluster, or to partition a SQL request into a set of sub-requests in a shared nothing cluster. In a shared nothing cluster, the database server may want to take further advantage of the partitioned or replicated data by making intelligent, parallel queries for execution across the cluster. Server applications may take advantage of cluster load balancing by dynamically distributing the application load among all cluster members. The application server software may also be enhanced to detect component failures and initiate fast recovery via cluster APIs.
Microsoft Windows NT Server Clusters
Windows NT Server Today
Windows NT Server provides all the components necessary to support mission-critical applications. The system is built on a fully 32-bit microkernel foundation. It is multithreaded, offers preemptive multitasking, and provides memory protection for both applications and the operating system itself. It scales to run on hardware with up to 32 processors, 4 GB of RAM, and 17 million terabytes of disk space. In addition, Windows NT Server supports IntelX86/IA-32 environment, and the Compaq Alpha RISC family of microprocessors.
Microsoft's vision is to enhance the Windows NT Server platform to support clustering for a broad base of customers who will benefit from a cost-effective method of delivering increased availability and scalability. Microsoft believes the following factors must be provided to enable broad market acceptance:
Industry standard APIs: Microsoft, in conjunction with technology partners, will work to establish industry standards for clustering application programming interfaces(APIs). The cluster APIs will expose specific cluster features for use in developing high-availability applications, and in the future, more scalable applications. File, print, and database servers, transaction processing monitors, and other software will be able to use the cluster APIs to fully exploit the capabilities of the Windows NT Server cluster.
Industry standard hardware: Windows NT Server clusters will take advantage of today's industry-standard PC platforms and existing network technology. The Windows NT Server layered driver model will allow Microsoft to quickly add support for special purpose, high-performance clustering technology (such as low-latency interconnects) as hardware vendors bring solutions to market.
Programming environment and programming model for enterprise-level applications: Clustering Service will provide the infrastructure and a simple, widely accepted programming model for building the next generation of enterprise level distributed and scalable applications.
Server application support: The Microsoft BackOffice® family of products will be enhanced to use the clustering API and take full advantage of the scalability and availability characteristics of clusters. Of course, Microsoft will encourage other vendors to exploit Windows NT Server cluster services.
Cluster enhancement without user disruption: Because Windows NT Server already implements a cluster-compatible security and user administration model, businesses can easily add clustering to a current Windows NT Server installation without user disruption. In addition, cluster administration will be exposed through enhancements to existing Windows NT Server administration.
Ease of configuration and maintenance: Clusters must be simple to configure and maintain with nondedicated support staff. Windows NT Server clustering will take advantage of the existing central management capabilities of Windows NT Server. Once a Windows NT Server cluster is installed, cluster management will be performed with a series of graphical cluster and network management tools.
Support for rolling upgrades: Microsoft will commit to always providing smooth upgrades between releases of its products, and will use clustering services to provide seamless rolling upgrades whenever possible between releases of Microsoft server software using Windows NT Server, Enterprise Edition cluster.
Windows NT Server Clusters
Windows NT Server already contains many of the basic components for constructing a clustered system, including:
Single-logon capability inherent in Windows NT Directory Services
Multisystem monitoring capability of administration tools and the Windows NT Performance Monitor
The ability to route requests via the redirector
Windows NT Server cluster enhancements represent a spectrum of technologies that will continue to be phased into the Windows NT Server and BackOffice products over time. Microsoft has prioritized additional clustering features based on customer requirements.
A Two-Phased Approach
Deploying Cluster Technology
Microsoft is developing cluster application programming interfaces (APIs) that allow applications to take advantage of Windows NT Server in a clustered environment. The company will deliver clustering products in two phases:
Phase 1: Fail-Over Solution
A fail-over solution improves data availability by allowing two servers to share the same hard disks within a cluster. When a system in the cluster fails, the cluster software will recover and disperse the work from the failed system to another server within the cluster. As a result, the failure of a system in the cluster will not affect the other systems, and in most cases, the client applications will be completely unaware of the failure. This means high server availability for the users. Phase 1 clusters became available in 1997 as a standard part of Windows NT Server, Enterprise Edition 4.0. Phase 1 clustering will be significantly enhanced in Windows NT Server, Enterprise Edition 5.0.
Phase 2: Multiple Node Solution
Phase 2 will enable more than two servers to be connected together for higher performance and reliability. As a result, when the overall load exceeds the capabilities of the systems in the cluster, additional systems may be added to the cluster and the load will be dynamically redistributed. This incremental growth enables customers to easily add processing power as needed. The maximum number of servers in a Phase 2 cluster will be determined based on what customers require, but Microsoft's nominal goal is to support clusters with up to 16 servers in the initial Phase 2 development.
Larger clusters will require more advanced disk connection and higher performance intracluster communications. Microsoft will work with the industry to support evolving disk standards such as Fibre Channel Arbitrated Loop that simplify configuration of large clusters. For intracluster communications, Phase 2 clustering services will include interconnect drivers based on the Virtual Interface (VI) Architecture.
At the heart of Phase 2, clustering services will be new services that simplify the creation of highly scalable, cluster-aware applications that run in parallel on multiple servers in a cluster. These services will be based on a shared nothing development model that partitions workload across available servers. They will include low-level utilities such as input/output shipping and Distributed Message Passing (DMP), plus component-based services that exploit Transaction Services and COM+. Cluster developers will use the Windows NT Server 5.0 Active Directory as their cluster name service.
Cluster features will be integrated into existing Windows NT Server system management tools to enable system administrators who are already familiar with Windows NT Server systems to easily set up and configure their clusters. The initial product provides base operating system support for clusters, including components to configure, maintain, and monitor membership in the cluster, support for a cluster-wide name-space, communication, and fail-over support. Additional services will support the two primary cluster software models. As with all of the distributed services provided by Windows NT Server, ease of set up and cluster management tools will be a very high priority.
The Microsoft BackOffice family of products, such as Microsoft SQL Server™ and Microsoft Exchange Server, are already enhanced to support the cluster's fail-over capability. These versions of these key BackOffice products are called Microsoft SQL Server, Enterprise Edition 6.5 and Microsoft Exchange Server, Enterprise Edition 5.5. Both take advantage of failover services from clustered systems. SQL Server, Enterprise Edition 6.5 takes particularly aggressive advantage of clustering, allowing both cluster nodes to be running SQL Server simultaneously, against different databases. An example would be a cluster where one node is dedicated to OLTP, and the second node manages an extract of this data and other data as a data warehouse. In the event that one system in unavailable for any reason, both databases can still be accessed, and two instances of SQL Server, Enterprise Edition 6.5 will run on the surviving node.
SQL Server also plans to support a partitioned data model and parallel execution to take full advantage of the shared nothing environment. Later releases of BackOffice will fully exploit the scalability aspects of cluster technology.
Cluster Application Development
Microsoft development tools will be enhanced to support the easy creation of cluster-aware applications. Facilities will be provided for automatic fail-over of applications. It is also important to note that not all server applications will need to be cluster-aware to take advantage of cluster benefits. Applications that build on top of cluster-aware core applications, such as large commercial database packages (such as an accounting or financial database application on top of SQL Server), will benefit automatically from cluster enhancements made to the underlying application (such as SQL Server). Many server applications that take advantage of database services, client/server connection interaction, Internet/intranet web publishing, and file and print services will benefit from clustering technology without application changes.
Microsoft will work with the computer hardware and software industry to deliver clustering for the Microsoft Windows NT Server network operating system and Microsoft BackOffice. Clustering technology will enable customers to connect a group of servers together to improve data availability/fault tolerance and performance, using industry-standard hardware components. The goal is to continue to build upon the strengths of Windows NT Server as an enterprise server, and to offer customers the greatest flexibility to design, develop, and implement systems for their most demanding future business needs.
For More Information
For the latest information on Windows NT Server, check out Microsoft TechNet or our World Wide Web site at http://www.microsoft.com/ntserver/ or the Windows NT Server Forum on the Microsoft Network (GO WORD: MSNTS).
|1||FIND/SVP Strategic Research Division Report, 1992|
|3||Server failure due to hardware (CPU/Motherboard, storage adapter, network card, etc.), application failure, or operator error|