Understanding Node States, Health, and Operations

Applies To: Windows HPC Server 2008

Node state reflects a node’s deployment state, and whether or not an administrator wants the node to be available as a resource for cluster jobs. An administrator brings a node Online to indicate that the node should accept jobs or client requests.

Node health indicates whether or not there are any warnings or errors that the HPC services are aware of on that node. If the node has a node health value of Unreachable or Failed Provisioning, the node will not be able to accept jobs or client requests, even if the node state is Online.

During normal operations, compute nodes and Windows Communication Foundation (WCF) broker nodes have a node health value of OK, and a node state value of Online.

During normal operations, the head node has a node health value of OK, and a node state value of Offline. If the head node is also acting as a compute node or a WCF broker node, or if a head node is installed for high-availability, then its normal node state value is Online.

Part of the process of monitoring and maintaining cluster health is finding deviances from the normal node state and health, and monitoring the state of cluster operations.

The sections in this topic describe the available values for:

  • Node states

  • Node health

  • Operation states

Node states

Node states reflect a node’s deployment state, and whether or not an administrator wants the node to be available as a resource for cluster jobs.

When the head node first detects a node on the network, the node appears in the Unknown state. When an administrator adds a node to the cluster by applying a node template, the node moves to the Provisioning state. When the node has successfully joined the cluster, it moves to the Offline state.

An administrator brings a node Online or takes a node Offline to indicate whether or not the nodes should accept and run cluster jobs. The HPC Job Scheduler Service will only try to start new jobs on nodes that are in the Online state. To make a node unavailable for new jobs, administrators can take the node Offline. Nodes must be in the Offline state to run some management actions, such as Reimage or Maintain.

You can use the node List view to display the state of each node and filter compute nodes by node state. For more information about the node List view, see Understanding Node List and Heat Map Views.

The following table describes node state values:

Node State Description

Online

This state indicates that the node should accept and run cluster jobs. The HPC Job Scheduler Service will only try to start new jobs on nodes that are in the Online state.

A node must be in the Online state and healthy to run jobs. If the node health is Unreachable or Failed Provisioning, jobs will not be able to start on that node.

The Online state is the normal operating node state for:

  • Compute nodes

  • WCF broker nodes

  • A head node that acts as a compute node or WCF broker node

  • A head node that is installed for high-availability

Offline

This state allows a cluster administrator to run scripts, install software, and perform other tasks on the node. This is the default state of a compute or WCF broker node after a cluster administrator has approved the node for inclusion in the cluster. This is also the default state for a head node (unless it is installed for high availability).

If a node is taken offline while running jobs, it will first move through the Draining state. If an administrator chooses to force the node offline immediately, any running tasks will be canceled and requeued within their job.

Unknown

This state indicates that the node is not part of the cluster, or that a provisioning operation has failed on that node.

To join a node to the cluster, apply the Assign Node Template action to the node.

In a high availability cluster, after setup is run on the first head node, the second head node will be in the Unknown state until setup is run on that node. After setup, the second head node moves to the Online state.

Provisioning

This state indicates that the node is being configured as a compute node. The Assign Node Template, Reimage, and Maintain actions also put a node into the provisioning state. After provisioning is complete, the node goes to the Offline state.

Starting

This state indicates that the node is transitioning from the Offline mode to the Online mode.

Draining

This state indicates that the compute node has been taken offline and is transitioning to the Offline state. The node completes currently running jobs before going to the Offline state. Draining nodes do not accept new jobs.

Removing

This state indicates that information about the node is being removed from the HPC Node Management Services database on the head node. Nothing is changed on the deleted node itself. The Delete action puts a node into this state.

If the node tries to rejoin the cluster, a new entry will be created for that node in the database, and the node will appear in the Unknown state.

Rejected

This state indicates that the node was rejected by a cluster administrator. A node in the Rejected state cannot join the cluster.

Node Health

Node health indicates whether or not there are any warnings or errors that the HPC services are aware of on that node. Node health is associated with icons that appear next to the node name in Node list view. When node health is OK, no icon appears next to the node name.

The following table lists the icons, and the alert types and health states that are indicated by the icons.

Icon Alert type Associated health state

5b6e0dfe-5a06-476a-bc49-c176bcbedd6b

Error

Unreachable

Failed Provisioning

e4a11f36-2c03-42c9-abad-58cdf1a38487

Warning

Failed Diagnostics

62e363f2-5575-481e-ba60-f21a8a9aac40

Pending Operation

Ongoing Operation

You can use the node List view to display the health of each compute node and filter nodes by node health. For more information about the node List view, see Understanding Node List and Heat Map Views.

The following table describes node health values:

Node Health Description

OK

This value indicates that the HPC services are not aware of any problems with the node.

Unreachable

This value indicates that the HPC Job Scheduler Service cannot contact the node.

The HPC Job Scheduler Service sends regular health probes to the HPC Node Manager Service on each node. If a compute node does not reply to the health probe, it has missed a heartbeat. If a node misses too many heartbeats, the HPC Job Scheduler Service flags the node as Unreachable. The following HPC Job Scheduler property settings apply to the health probes:

  • Heartbeat Interval: the frequency, in seconds, of the health probes (the default is 30 seconds)

  • Inactivity Count: how many heartbeats a node can miss (the default is 3)

A node can become unreachable for many reasons, including:

  • Problems with network connectivity

  • The HPC Node Manager Service is not running

  • Authentication failure between the head node and the compute node

For troubleshooting information, see Node Health is Flagged as Unreachable.

Ongoing Operation

This value indicates that the node is performing an operation that a cluster administrator initiated, such as:

  • Maintain

  • Run Diagnostics

  • Apply Node Template

  • Take Offline

If the operation can be performed while the node is Online, then the HPC Job Scheduler Service can continue to run jobs on the node.

Diagnostics Failed

This value indicates that a cluster administrator ran diagnostic tests on the node, and one or more tests returned a result of Failure or Failed to Run.

The node leaves this health state if:

Provisioning Failed

This value indicates that the most recent provisioning operation failed. The node will also be in an Unknown state at this point.

Operation states

For information about how to view the operations log, see Read the Operations Log.

The following table describes the operation state values:

Operation State Description

Archived

This state indicates that the operation is more than 24 hours old or the diagnostics test has been cleared. When an operation is archived, it is removed from other status reports.

Committed

This state indicates that the operation completed successfully.

Executing

This state indicates that the operation is in progress.

Failed

This state indicates that the operation failed to execute, is being reverted, or failed to revert.

Reverted

This state indicates that the operation reverted after failure or cancellation.

Additional references