Overview of Microsoft HPC Pack and SOA in Failover Clusters
Updated: August 6, 2014
Applies To: Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2
This guide describes how you can configure the HPC Pack head node in a failover cluster and then configure Windows Communication Foundation (WCF) broker nodes in separate failover clusters. This topic provides an overview of the configuration for failover clusters within a single site or data center. For a detailed list of requirements, see Requirements for HPC Pack in Failover Clusters.
In this section
If you want to provide high availability in an HPC cluster that uses applications based on service-oriented architecture (SOA), you can configure your head node in a failover cluster and then configure WCF broker nodes in separate failover clusters. The failover clusters contain servers that work together, so if one server in a failover cluster fails, another server in that cluster automatically begins providing service (in a process known as failover).
The word “cluster” can refer to a head node with compute nodes and WCF broker nodes running HPC Pack, or to a set of servers running Windows Server that are using the Failover Clustering feature. The word “node” can refer to one of the computers in the HPC Pack cluster, or to one of the servers in a failover cluster. In this guide, servers in the context of a failover cluster are usually referred to as “servers,” to distinguish failover cluster nodes from an HPC cluster node. Also, the word “cluster” is placed in an appropriate phrase to distinguish which type of cluster is being referred to.
Each of the servers in a failover cluster must have access to the failover cluster storage. Figure 1 shows the failover of head node services that can run on either of two servers in a failover cluster. (Starting in HPC Pack 2012, HPC Pack supports a larger number of servers in the failover cluster.)
Figure 1 Failover of head node services in HPC cluster
To provide high availability in an HPC cluster that uses applications based on service-oriented architecture (SOA), you configure the head node in a failover cluster and you also configure one or more additional failover clusters. To support the head node, you must also configure SQL Server, either as a SQL Server failover cluster (for higher availability) or as a standalone SQL Server. You must configure failover WCF broker nodes in separate failover clusters. Figure 2 shows a configuration that includes three failover clusters: one failover cluster runs SQL Server (to support the head node), one failover cluster runs the head node, and one failover cluster runs WCF broker nodes. In Figure 2, two of the failover clusters have two nodes each, but the failover cluster that runs WCF broker nodes contains three nodes, two active and one passive (a passive node is inactive, but it is ready to provide service if needed).
Figure 2 Failover clusters supporting head node, SQL Server, and WCF broker nodes
In Figure 2, the failover cluster storage for the head node includes one disk (LUN) for a clustered file server and one disk as a disk witness. A resource such as a disk witness is generally needed for a failover cluster that has an even number of nodes (the head node failover cluster in this example has two). For the WCF broker nodes, the failover cluster storage is needed because these nodes use Microsoft Message Queuing (also known as MSMQ). One instance of Message Queuing runs on each active WCF broker node, and each instance requires a disk in storage, so the failover cluster for the WCF broker nodes requires two disks in storage. That failover cluster has an odd number of nodes; therefore, it does not include a disk witness in the storage.
For clarity, Figure 2 shows a limited number of servers. Figure 3 shows more servers. As shown in Figure 3, in addition to having a failover cluster for the head node in your HPC cluster and a SQL Server that supports the head node, you can have multiple failover clusters for WCF broker nodes. In the same HPC cluster, you can also have individual WCF broker nodes, if those nodes serve only broker sessions that do not provide durable messages.
Figure 3 Multiple failover clusters and an additional WCF broker node
After you configure a WCF broker node to run in a failover cluster, if you decide to change it back to a broker node running on a standalone server, you must uninstall and then reinstall the broker node.
When the head node and broker nodes are in failover clusters, multiple failover clusters are required. Figure 4 illustrates that when you configure multiple failover clusters, you must limit the exposure of each storage volume or logical unit number (LUN) to the nodes in one failover cluster:
Figure 4 Two failover clusters, each with its own LUNs
Note that for the maximum availability of any server, it is important to follow best practices for server management—for example, carefully managing the physical environment of the servers, testing software changes before fully implementing them, and carefully keeping track of software updates and configuration changes on all servers in a failover cluster.
The following overview of node and configuration requirements provides supporting details for Figures 1, 2, and 3.
For detailed information about hardware, software, and networking requirements for running HPC Pack in a failover cluster, see Requirements for HPC Pack in Failover Clusters.
The following list provides a broad overview of the configuration requirements for deploying HPC Pack in failover clusters where the servers are running Windows Server.
If you choose to configure WCF broker nodes in failover clusters, you must configure the head node in a separate failover cluster.
As shown in the preceding figures, each failover cluster requires disks (LUNs) in cluster storage, and every node in a given failover cluster must be connected to the same disks as the other nodes in that failover cluster. For each failover cluster, two or three disks in cluster storage are usually required:
In the failover cluster that supports the head node, one disk in cluster storage is the disk witness, which is generally needed for a failover cluster with an even number of nodes. Another disk is used for the clustered file server that is a part of the head-node configuration.
In the failover cluster that supports one or more WCF broker nodes, each active node requires a disk in storage (to support the instance of Message Queuing running on that active node). If the failover cluster that supports a WCF broker node has an even number of failover cluster nodes, it generally requires an additional disk in cluster storage for the disk witness.
Figures 2 and 3 show multiple failover clusters that support WCF broker nodes. The maximum number of servers and failover clusters are greater than shown in the figures. You can have as many as 8 failover clusters that support WCF broker nodes within an HPC cluster. In each these failover clusters, you can have a minimum of 2 nodes and a maximum of 16 or 64 nodes (depending on your version of Windows Server).
As shown in Figure 3, you can mix WCF broker nodes that are in failover clusters with individual WCF broker nodes that are not in failover clusters.
In HPC clusters where you want to configure high availability for WCF broker nodes, when you choose the network topology, we recommend either Topology 2 or Topology 4 (the topology shown in Figures 1, 2, and 3). In these topologies, there is an enterprise network and at least one other network. Using multiple networks in this way helps avoid single points of failure. For more information about network topologies in HPC Pack, see Appendix 1: HPC Cluster Networking in the Getting Started Guide for HPC Pack.
For a failover cluster that runs a WCF broker node or nodes, choosing the ratio of active nodes to passive nodes involves weighing several factors. You should consider the balance between your hardware budget, your availability requirements, and the performance requirements of your SOA applications. For example, for higher availability and performance but also higher hardware costs for your WCF broker nodes, you could choose to have one passive node for every active node in your failover clusters. For availability and performance that is decreased from that level but also costs less, you could have a failover cluster with more active nodes, for example, an eight-node cluster with seven active nodes and one passive node, where the failure of more than one node could impact performance and possibly also availability.
Starting in HPC Pack 2012, you can configure more than two HPC Pack head nodes in a failover cluster and the failover cluster can span multiple sites (typically two). This configuration includes head nodes deployed in separate geographic regions, and allows the HPC cluster to continue to schedule and run jobs in case an entire site is unavailable.
The detailed steps for creating a multisite HPC Pack 2012 failover cluster differ from the scenario and steps in this guide, and will require advanced networking and Failover Clustering configuration. These are beyond the scope of this guide. For important considerations, see Multisite configuration options in Overview of Configuring the HPC Pack Head Node for Failover”.
This section summarizes some of the differences between running the head node for HPC Pack on a single server and running it in a failover cluster.
The following table summarizes what happens to the main HPC Pack services and resources during failover of the head node. Some items may not apply to your version or configuration of HPC Pack.
Service or resource
What happens in a failover cluster
HPC SDM Store Service
HPC Job Scheduler Service
HPC Session Service
HPC Diagnostics Service
HPC Monitoring Server Service (starting in HPC Pack 2012)
HPC SOA Diag Mon Service (starting in HPC Pack 2012)
Fail over to another server in the failover cluster.
File shares that are used by the head node, such as REMINST
Ownership fails over to another server in the failover cluster.
HPC Management Service
HPC MPI Service
HPC Node Manager Service
HPC Reporting Service
HPC Monitoring Client Service (starting in HPC Pack 2012.
Start automatically and run on each individual server. The failover cluster does not monitor these services for failure.
File sharing for compute nodes
Fails over to another server in the failover cluster if configured through Failover Cluster Manager.
As the previous diagrams in this topic show, a configuration with WCF broker nodes in failover clusters can contain many physical servers, many failover clusters, and many clustered instances of WCF broker nodes (that is, instances of WCF broker nodes that can fail over from one server to another in a particular failover cluster). Because of this, we recommend that you plan your naming scheme carefully before you start to configure WCF broker nodes in failover clusters. You will need the following names for WCF broker nodes that run in a failover cluster:
A unique name for each physical server.
A unique name for each failover cluster.
A unique name for each clustered instance of a WCF broker node running in the failover cluster. Each of these clustered instances runs the HPC Broker Service and Message Queuing, and they can fail over (or be moved) from one physical server to another in a particular failover cluster.
The two administrative consoles that are mentioned most often in this document, HPC Cluster Manager and Failover Cluster Manager, provide somewhat different information. In HPC Cluster Manager, you see the names of physical servers. In Failover Cluster Manager, you see the names of physical servers (nodes in the failover cluster), and you also see the names of failover clusters and clustered instances within those failover clusters. When you view HPC Cluster Manager, you might want a straightforward way to tell which physical servers are related to a given failover cluster or to a given clustered instance inside that failover cluster. Because of this, we recommend that you carefully plan the naming scheme for your WCF broker nodes.
The following table shows an example of a naming scheme. In this example, the servers that are together in a failover cluster have names that are the same except for the last character. The example name for the failover cluster is similar, but it drops the last character and ends with Cluster. Similarly, the example names for the clustered instances end with Inst plus some digits. Shorter names would also work if you could readily tell from each name what it names and what it is related to.
Example names for physical servers in a failover cluster
Example of a name for the failover cluster
Example of names for two clustered instances that run in the failover cluster