Running the HPC Pack Head Node in a Failover Cluster

 

Applies To: Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2

This section provides information about running HPC Pack in a failover cluster, and it describes the failover process.

Important

  • For connections to a head node that is configured in the context of a failover cluster, do not use the name of a physical server. Use the name of the clustered head node (file server) that appears in Failover Cluster Manager.

  • To manage the starting and stopping of any service or resource that is configured within the failover cluster, use Failover Cluster Manager (not Server Manager or HPC Cluster Manager). You can see a list of these services and resources in Failover Cluster Manager by clicking the clustered instance of the head node. For more information about managing a failover cluster by using Failover Cluster Manager, see the TechNet Library documentation for Windows Server.

The failover process

When a failover cluster server within an HPC cluster fails, the specific services that are supported by that server begin to run on another server in that failover cluster. The steps in failing over are as follows:

  1. Detection: A failure is detected.

  2. Failover: The head node fails over to another server in the failover cluster.

  3. Client reconnect: Following a failure, clients reconnect. For the head node, this means that job scheduler clients reconnect to the HPC Job Scheduler Service on the server that is now the head node. The actual location of the service (on a server in the failover cluster) does not matter, because it appears to the clients under one consistent name offered by the failover cluster. Management clients will retry until they can reconnect to the HPC Management Service.

Failure detection in a failover cluster

The servers in a failover cluster monitor one another through periodic network signals, called heartbeats. If a server misses five heartbeats by default, communication with that server is considered to have failed. You can use Failover Cluster Manager to configure the thresholds at which a server is considered to have failed.

You can also configure failover and failback settings in Failover Cluster Manager, but we recommend that you prevent failback unless you have a specific reason to allow it. By definition, failback causes the head node to return to running on a preferred physical server when possible. However, failback also causes a brief interruption in service. Preventing failback therefore decreases interruptions in service.

Failover Clustering also monitors some of the services (for example, the HPC Job Scheduler Service on the head node) to ensure that they are running. For detailed information about which services are monitored, see the tables that are at the end of the following topics:

Additional references

Configuring Microsoft HPC Pack for High Availability with SOA Applications 

Configuring the HPC Pack Head Node for High Availability