Running the Head Node in a Failover Cluster with Windows HPC Server 2008 R2

Applies To: Windows HPC Server 2008 R2

This section provides information about running Windows HPC Server 2008 R2 in a failover cluster, and it describes the failover process.

Important

  • For connections to a head node that is configured in the context of a failover cluster, do not use the name of a physical server. Use the name that appears in Failover Cluster Manager. To see the name in Failover Cluster Manager, in the appropriate failover cluster, expand Services and applications, select the clustered instance of the head node, and in the center pane, view the name under Server Name. After the head node is configured in a failover cluster, it is not tied to a single physical server, and it does not have the name of a physical server.

  • When you are managing the starting and stopping of any service or resource that is configured within the failover cluster, use the Failover Cluster Manager snap-in (not Server Manager or HPC Cluster Manager). You can see a list of these services and resources in Failover Cluster Manager by expanding Services and Applications and then clicking the clustered instance of the head node. For more information about managing a failover cluster by using Failover Cluster Manager, see Managing a Failover Cluster (https://go.microsoft.com/fwlink/?LinkId=121208).

The failover process

When a failover cluster server within an HPC cluster fails, the specific services that are supported by that server begin to run on another server in that failover cluster. The steps in failing over are as follows:

  1. Detection: A failure is detected.

  2. Failover: The head node fails over to another server in the failover cluster.

  3. Client reconnect: Following a failure, clients reconnect. For the head node, this means that job scheduler clients reconnect to the Job Scheduler on the server that is now the head node. The actual location of the service (on a server in the failover cluster) does not matter, because it appears to the clients under one consistent name (offered by the failover cluster). Management clients will retry until they can reconnect to a management service.

Failure detection in a failover cluster

The servers in a failover cluster monitor one another through periodic network signals, called heartbeats. If a server misses five heartbeats, communication with that server is considered to have failed. You can configure the thresholds at which a server is considered to have failed in the Failover Cluster Manager snap-in.

You can also configure failover and failback settings in Failover Cluster Manager, but we recommend that you prevent failback unless you have a specific reason to allow it. By definition, failback causes the head node to return to running on a preferred physical server when possible. However, failback also causes a brief interruption in service. Preventing failback therefore decreases interruptions in service.

Failover clustering also monitors some of the services (for example, the HPC Job Scheduler Service on the head node) to ensure that they are running. For detailed information about which services are monitored, see the tables that are at the end of the following topics:

Additional references

Configuring Windows HPC Server 2008 R2 for High Availability with SOA Applications (https://go.microsoft.com/fwlink/?LinkId=198300)

Configuring Windows HPC Server 2008 R2 for High Availability of the Head Node (https://go.microsoft.com/fwlink/?LinkId=198285)