Geographically Dispersed Clusters

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

Q. Can Server clusters span multiple sites?

A. Yes, Server clusters support a single cluster spanning multiple sites. This is known as a geographically dispersed cluster. All qualified geographically dispersed cluster solutions appear on the Microsoft Hardware Compatibility list (HCL) (https://go.microsoft.com/fwlink/?LinkID=67738). Only cluster solutions listed on the HCL are supported by Microsoft.

Q. How is a geographically dispersed cluster defined?

A. A geographically dispersed cluster is a Server cluster that has the following attributes:

  1. Has multiple storage arrays, at least one deployed at each site. This ensures that in the event of failure of any one site, the other site(s) will have local copies of the data that they can use to continue to provide the services and applications.

  2. Nodes are connected to storage in such a way that in the event of a failure of either a site or the communication links between sites, the nodes on a given site can access the storage on that site. In other words, in a two-site configuration, the nodes in site A are connected to the storage in site A directly and the nodes in site B are connected to the storage in site B directly. The nodes in site A con continue without accessing the storage on site B and vice-versa.

  3. The storage fabric or host-based software provides a way to mirror or replicate data between the sites so that each site has a copy of the data. Different levels of consistency are available.

The following diagram shows a simple two-site cluster configuration.

01cec8d5-cbb8-49fb-9acf-2341a62a0efa

Q. Will geographically dispersed clusters give me disaster tolerance, or disaster recovery?

A. The goal of multi-site Server cluster configurations is to ensure that loss of one site in the solution does not cause a loss of the complete application for business continuance and disaster recovery purposes. Sites are typically up to a few hundred miles apart so that they have completely different power, different communications infrastructure providers, and are placed so that natural disasters (e.g., earthquakes) are extremely unlikely to take out more than one site.

Geographically dispersed clusters do not provide disaster tolerance, since, in some cases, manual intervention is required to restart the applications.

Q. Can a geographically dispersed cluster use asynchronous replication between sites?

A. Yes, however, there are a couple of caveats:

  • The quorum data must be synchronously replicated between the sites. To ensure that the Server cluster guarantees of consistency are met, the cluster database must be kept consistent across all nodes. If the quorum disk is replicated across the sites, it MUST be replicated synchronously.

    In Windows Server 2003, a new quorum resource Majority Node Set can be used in these configurations as an alternative to replicating the quorum disk.

  • If data is replicated asynchronously, in the event of a site failure, the secondary site will be consistent, but out of date. You should check that the applications can handle going back in time and that the client experience makes sense for the business. Applications such as SQL Server can be hosted on asynchronous replication; however, there are a bunch of restrictions and warning that you should be aware of (see the SQL Server high availability documentation for rules around multi-site replication of SQL Server data).

  • If data is replicated at the block level, the replication must preserve the order of writes to the secondary site. Failure to ensure this will lead to data corruption.

Q. Does Microsoft provide a complete end-to-end geographically dispersed cluster solution?

A. No, Microsoft does not provide a software mechanism for replicating application data from one site to another in a geographically dispersed cluster. Microsoft works with hardware and software vendors to provide a complete solution. All qualified geographically dispersed cluster solutions appear on the Microsoft Hardware Compatibility list (HCL) (https://go.microsoft.com/fwlink/?LinkID=67738). Only cluster solutions listed on the HCL are supported by Microsoft.

Q. What additional requirements are there on a Server cluster to support multiple sites?

A. The Microsoft server clustering software itself is unaware of the extended nature of geographically dispersed clusters. There are no special features in Server cluster in Windows 2000 or Windows Server 2003 that are specific to these kinds of configuration. The network and storage architectures used to build geographically dispersed clusters must preserve the semantics that the server cluster technology expects. Fundamentally, the network and storage architecture of geographically dispersed server clusters must meet the following requirements:

  1. The private and public network connections between cluster nodes must appear as a single, non-routed LAN (e.g., using technologies such as VLANs to ensure that all nodes in the cluster appear on the same IP subnets).

  2. The network connections must be able to provide a maximum guaranteed round trip latency between nodes of no more than 500 milliseconds. The cluster uses heartbeat to detect whether a node is alive or not responding. These heartbeats are sent out on a periodic basis. If a node takes too long to respond to heartbeat packets, the cluster service starts a heavy-weight protocol to figure out which nodes are really still alive and which ones are dead; this is known as a cluster re-group. The heartbeat interval is not a configurable parameter for the cluster service (there are many reasons for this, but the bottom line is that changing this parameter can have a significant impact on the stability of the cluster and the failover time). 500ms round-trip is significantly below any threshold to ensure that artificial re-group operations are not triggered.

  3. Windows 2000 requires that a cluster have a single shared disk known as the quorum disk. The storage infrastructure can provide mirroring across the sites to make a set of disks appear to the cluster service like a single disk, however, it must preserve the fundamental semantics that are required by the physical disk resource:

    • Cluster service uses SCSI reserve commands and bus reset to arbitrate for and protect the shared disks. The semantics of these commands must be preserved across the sites even in the face of complete communication failures between sites. If a node on site A reserves a disk, nodes on site B should not be able to access the contents of the disk. These semantics are essential to avoid data corruption of cluster data and application data.

    • The quorum disk must be replicated in real-time, synchronous mode across all sites. The different members of a mirrored quorum disk MUST contain the same data.

In Windows Server 2003, you can use either a mirrored/replicated quorum disk or a new resource Majority Node Set for a multi-site cluster.

Q. What additional requirements are there on applications?

A. As with the Server cluster itself, applications are unaware of the extended nature of geographically dispersed clusters. There is no topology or configuration information provided to applications to make them aware of the different sites.

Typically, no changes are required to ensure that an application runs, as expected, on a geographically dispersed cluster. However, you should check with the application vendors. In some cases different failure timeout periods may be required since disk accesses and failover times may be longer due to the extended distance between clusters and the need to provide mirroring or replication of data between sites.