Performing Maintenance on WCF Broker Nodes in a Failover Cluster with Windows HPC Server 2008 R2

 

Applies To: Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2

This topic explains how to perform maintenance on WCF broker nodes in a failover cluster where the WCF broker nodes are running HPC Pack. For information about the process of configuring a WCF broker node in a failover cluster, see Steps for Setting up Microsoft HPC Pack with Failover Clustering for WCF Broker Nodes.

Overview of performing maintenance on WCF broker nodes in a failover cluster

To review the sequence of actions to use when performing maintenance on servers running one or more WCF broker nodes in a failover cluster, see one of the following sections:

  • Performing maintenance on one physical server at a time

  • Performing maintenance on all the servers in a failover cluster at the same time

Performing maintenance on one physical server at a time

If you can perform maintenance on one server at a time, plan to start with standby servers (those not currently running an instance of a WCF broker node) and finish with active servers. The following list outlines the sequence of actions to take:

  1. Use HPC Cluster Manager to take a physical server (WCF broker node) offline, and optionally, to shrink running jobs on the server you are taking offline. This ensures that the server will not accept additional jobs. For more information, see Use HPC Cluster Manager to take a physical server offline or bring it online, later in this topic.

  2. Use Failover Cluster Manager to see where the clustered instance of the WCF broker node is currently running. If it is running on the server that you want to perform maintenance on, move the clustered instance to a different server. For more information, see Use Failover Cluster Manager to control a clustered instance.

  3. Use Failover Cluster Manager to pause the failover cluster node (server) that you want to perform maintenance on. For more information, see Use Failover Cluster Manager to pause or resume a failover cluster node.

  4. Perform the necessary maintenance on the server.

  5. Use Failover Cluster Manager to resume the failover cluster node that you performed maintenance on. For more information, see Use Failover Cluster Manager to pause or resume a failover cluster node.

  6. Use HPC Cluster Manager to bring the physical server (WCF broker node) online. This allows the server to begin accepting jobs again. For more information, see Use HPC Cluster Manager to take a physical server offline or bring it online, later in this topic.

  7. As needed, repeat the process.

Performing maintenance on all the servers in a failover cluster at the same time

If you must perform maintenance on all the servers in a failover cluster at the same time, plan for downtime, and notify users as appropriate. The following list outlines the sequence of actions to take:

  1. Use HPC Cluster Manager to take all affected WCF broker nodes offline, and optionally, to shrink running jobs on the servers you are taking offline. This ensures that those servers will not accept additional jobs. For more information, see Use HPC Cluster Manager to take a physical server offline or bring it online, later in this topic.

  2. Use Failover Cluster Manager to take all clustered instances of WCF broker nodes (running in the failover cluster) offline. For more information, see Use Failover Cluster Manager to control a clustered instance.

  3. Perform the necessary maintenance on the servers.

  4. Use Failover Cluster Manager to bring all clustered instances of WCF broker nodes online. For more information, see Use Failover Cluster Manager to control a clustered instance.

  5. Use HPC Cluster Manager to bring all affected WCF broker nodes online. This allows these servers to begin accepting jobs again. For more information, see Use HPC Cluster Manager to take a physical server offline or bring it online, later in this topic.

Procedures for performing maintenance on WCF broker nodes in a failover cluster

As outlined in the preceding lists, use the following procedures to perform maintenance on WCF broker nodes in a failover cluster.

Use HPC Cluster Manager to take a physical server offline or bring it online

After you use HPC Cluster Manager to take a physical server offline, the server will not accept additional jobs. When you bring the server back online, it will accept jobs again.

To use HPC Cluster Manager to take a physical server offline or bring it online

  1. In HPC Cluster Manager, in Node Management, navigate to a view that shows the WCF broker nodes that you want to perform maintenance on.

  2. In the views pane, right-click a node, and then click Take Offline or Bring Online.

  3. If you are taking a node offline, in the Take Offline dialog box, you can optionally select Force the node offline and shrink running jobs. If you do not select this check box, the node enters the Draining state, in which jobs are given some time to complete before the node is taken offline.

Use Failover Cluster Manager to control a clustered instance

In Failover Cluster Manager, you can control a clustered instance of a WCF broker node that is running in the failover cluster. You can move the clustered instance to a different server in the failover cluster, you can take the clustered instance offline, or you can bring the clustered instance online. Some user interface details may differ slightly in your deployment, depending on your version of Windows Server.

To use Failover Cluster Manager to control a clustered instance

  1. In Failover Cluster Manager, select or specify the cluster that you want.

  2. In the console tree, click the clustered instance of the WCF broker node and view the status in the center pane. Note the Owner Node, which is listed as part of the status.

  3. In the console tree, right-click the clustered instance of the WCF broker node, and then select the appropriate command:

    • Move

    • Take offline

    • Bring online

  4. When prompted, confirm your choice.

Use Failover Cluster Manager to pause or resume a failover cluster node

You can pause a failover cluster node before performing maintenance on the node, and then resume the node after the maintenance is complete. Some user interface details may differ slightly in your deployment, depending on your version of Windows Server.

To use Failover Cluster Manager to pause or resume a failover cluster node

  1. In Failover Cluster Manager, under Nodes, expand the console tree.

  2. Right-click the node that you want to pause or resume, and then click Pause or Resume.

Additional references

Overview of Microsoft HPC Pack and SOA in Failover Clusters