Overview of Configuring the HPC Pack Head Node for Failover
Updated: August 6, 2013
Applies To: Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2
This guide describes how you can configure the head node of a Windows HPC cluster in a failover cluster. This topic provides an overview of the configuration for a failover cluster within a single site or data center. For a detailed list of requirements for the configuration, see Requirements for HPC Pack in Failover Clusters.
In this section
In an HPC cluster, if you want to provide high availability for the head node, you can configure it in a failover cluster. The failover cluster contains servers that work together, so if one server in the failover cluster fails, another server in the cluster automatically begins providing service (in a process known as failover).
The word “cluster” can refer to a head node with compute nodes running HPC Pack, or to a set of servers running Windows Server that are using the Failover Clustering feature. The word “node” can refer to one of the computers in an HPC Pack cluster, or to one of the servers in a failover cluster. In this guide, servers in the context of a failover cluster are usually referred to as “servers,” to distinguish failover cluster nodes from an HPC cluster node. Also, the word “cluster” is placed in an appropriate phrase to distinguish which type of cluster is being referred to.
Each of the servers in a failover cluster must have access to the failover cluster shared storage. Figure 1 shows the failover of head node services that can run on either of two servers in a failover cluster. (Starting in HPC Pack 2012, HPC Pack supports a larger number of servers in the failover cluster for the head node.)
Figure 1 Failover of head node services in HPC cluster
To support the head node, you must also configure SQL Server, either as a SQL Server failover cluster (for higher availability) or as a standalone SQL Server. Figure 2 shows a configuration that includes a failover cluster that runs the head node and a failover cluster that runs SQL Server.
Figure 2 Failover clusters supporting the head node and SQL Server
In the preceding figure (Figure 2), the failover cluster storage for the head node includes one disk (LUN) for a clustered file server and one disk as a disk witness. A resource such as a disk witness is generally needed for a failover cluster that has an even number of nodes (the head node failover cluster in this example has two).
When both the head node and SQL Server are in failover clusters, separate failover clusters are one possible configuration. Another option is to deploy the clustered head node services and clustered SQL Server on the same failover cluster nodes. Figure 3 illustrates that when you configure separate failover clusters for the head node and SQL Server, you must limit the exposure of each storage volume or logical unit number (LUN) to the nodes in one failover cluster:
Figure 3 Two failover clusters, each with its own LUNs
Note that for the maximum availability of any server, it is important to follow best practices for server management—for example, carefully managing the physical environment of the servers, testing software changes before fully implementing them, and carefully keeping track of software updates and configuration changes on all servers in a failover cluster.
When the head node is configured in a failover cluster, for the network topology, we recommend either Topology 2 or Topology 4 (the topology shown in Figures 1 and 2). In these topologies, there is an enterprise network and at least one other network. Using multiple networks in this way helps avoid single points of failure. For more information about network topologies, see Requirements for HPC Pack in Failover Clusters.
For more information about network topologies for HPC Pack, see Appendix 1: HPC Cluster Networking in the Getting Started Guide for HPC Pack.
Starting in HPC Pack 2012, you can configure more than two HPC Pack head nodes in a failover cluster and the failover cluster can span multiple sites (typically two). This configuration includes head nodes deployed in separate geographic regions, and allows the HPC cluster to continue to schedule and run jobs in case an entire site is unavailable.
The detailed steps for creating a multisite head node failover cluster differ from the scenario and steps in this guide, and require advanced networking and Failover Clustering configuration. These are beyond the scope of this guide. However, the following are important considerations.
The sites must be connected by a stretch virtual local area network (VLAN) so that the head nodes in the sites are deployed to the same subnet. Because of this requirement, only Topology 5 (all nodes on an enterprise network) is supported for HPC nodes in a multisite failover cluster. For information about stretch VLANs and failover clusters, see Requirements and Recommendations for a Multi-Site Failover Cluster.
To avoid a single point of failure, you should deploy two or more clustered head nodes in each site. If a single head node is deployed on each site, a network interruption between the sites could cause unexpected failovers between the sites.
The head nodes in the multisite failover cluster should be configured first to fail over within a site, and then to fail over to the other site.
Multisite failover cluster solutions generally require an advanced quorum configuration such as a file share witness or a third-party storage solution that can span the sites and act as the quorum device for the cluster that is located in a third, unrelated site. For information about quorum configurations, see Configure and Manage the Quorum in a Windows Server 2012 Failover Cluster.
The solution can include HPC cluster compute nodes that are deployed in the common subnet in both sites. This gives you flexibility to schedule HPC cluster jobs against a common pool of compute nodes, regardless of which site is currently running the head node services.
This section summarizes some of the differences between running the HPC Pack head node on a single server and running it in a failover cluster.
The following table summarizes what happens to the main HPC Pack services and resources during failover of the head node. Some items may not apply to your version or configuration of HPC Pack.
Service or resource
What happens in a failover cluster
HPC SDM Store Service
HPC Job Scheduler Service
HPC Session Service
HPC Diagnostics Service
HPC Monitoring Server Service (starting in HPC Pack 2012)
HPC SOA Diag Mon Service (starting in HPC Pack 2012)
Fail over to another server in the failover cluster.
File shares that are used by the head node, such as REMINST
Ownership fails over to another server in the failover cluster.
HPC Management Service
HPC MPI Service
HPC Node Manager Service
HPC Reporting Service
HPC Monitoring Client Service (starting in HPC Pack 2012)
Start automatically and run on each individual server. The failover cluster does not monitor these services for failure.
File sharing for compute nodes
Fails over to another server in the failover cluster if configured through Failover Cluster Manager.