Managing Windows HPC Server 2008 in a Failover Cluster

Article
05/18/2009

Applies To: Windows HPC Server 2008

This section describes how to manage Windows HPC Server 2008 in a failover cluster, and offers some best practices.

You have two management consoles from which to choose, the Failover Cluster Management snap-in and the HPC Cluster Manager snap-in. For more information about managing a failover cluster, see Managing a Failover Cluster (https://go.microsoft.com/fwlink/?LinkId=121208).

The failover process

When the server that is currently acting as the head node fails, management and network services will begin to run on the other server in the failover cluster. For more information, see “Windows HPC Server 2008 in a failover cluster” in Overview and Requirements for Windows HPC Server 2008 in a Failover Cluster. The steps in failing over and failing back are:

Detection: A failure is detected.
Failover: The head node fails over to the other server in the failover cluster.
Client reconnect: Job scheduler clients reconnect to the Job Scheduler on the server that is now the head node.
Failback: After the failed server is fixed, services are returned and begin running on that server again.

Head node failure detection

Failover clustering monitors processes on servers in the failover cluster through periodic heartbeats. If the servers miss five heartbeats, communication is considered to have failed. Failover clustering also monitors services to ensure that they are running.

You can configure the thresholds at which a server is considered to have failed in the Failover Cluster Management snap-in.

There is no failover detection for replicated management services. If a replicated service fails on one server in the failover cluster, the services on the other server in the failover cluster continue to serve all clients.

Head node failure operation

If the failover cluster detects failure of the services that are provided by the head node, it starts the Job Scheduler service and the SDM Management service on the other server in the failover cluster. The management service may continue to run on the server it originally started on, with existing management clients continuing to connect with it there. New management clients will connect to the management service running on the other server.

If all the services (including the management services) fail over, then clients connect to the services on the server where they are currently running.

Client reconnect

When job scheduler clients are disconnected, they retry until they can reconnect to a server running the Job Scheduler service. The actual location of the service (on a server in the failover cluster) does not matter, because it appears to the clients under one consistent name (offered by the failover cluster). Management clients will retry until they can reconnect to a management service.

Repairing the head node and failback

After failover, you repair or replace the server that is having problems, and then restore (fail back) the services to that server through the Failover Cluster Management snap-in.