Understanding Error Handling

Article
06/08/2020

This topic describes the error handling settings for the HPC Job Scheduler Service. For information about how to change the configuration options, see Configure the HPC Job Scheduler Service.

This topic includes the following sections:

Heartbeat options
Retry jobs and tasks
Task cancel grace period
Node release task timeout
Excluded nodes limit

Heartbeat options

The HPC Node Manager Service on each node sends regular health reports to the HPC Job Scheduler Service. This health report is called a heartbeat. This heartbeat signal verifies node availability. If a node misses too many heartbeats, the HPC Job Scheduler Service flags the node as unreachable.

The following cluster property settings apply to the health probes:

Heartbeat Interval: the frequency, in seconds, of the health probes. The default is 30 seconds.
Missed Heartbeats (Inactivity Count): the number of heartbeats a node can miss before it is considered unreachable. The default is 3.

Note

Starting with HPC Pack 2012 with Service Pack 1 (SP1), separate settings are provided to configure the inactivity count for on-premises (local) nodes and Windows Azure nodes. Because of possible network latency when reaching Windows Azure nodes, the default inactivity count for Windows Azure nodes is 10.

Additional considerations

A node can miss a heartbeat for many reasons, including:
- Problems with network connectivity
- The HPC Node Manager Service is not running on the compute node
- Authentication failure between the head node and the compute node
If you increase the frequency of the health probes (set a shorter Heartbeat Interval), you can detect failures more quickly, but you also increase network traffic. Increased network traffic can decrease cluster performance.
When a node is flagged as unreachable, jobs that are running on that node might fail. If you know that your network has frequent intermittent failures, you might want to increase the Inactivity Count to avoid unnecessary job failures. See also Retry jobs and tasks in this topic.

Retry jobs and tasks

The HPC Job Scheduler Service automatically retries jobs and tasks that fail due to a cluster problem, such as a node becoming unreachable, or that are stopped by preemption policy. After a specified number of unsuccessful attempts, the HPC Job Scheduler Service marks the job or task as Failed.

The following cluster property settings determine the number of times to retry jobs and tasks:

Job retry: the number of times to automatically retry a job. The default is 3.
Task retry: the number of times to automatically retry a task. The default is 3.

Additional considerations

Tasks are not automatically retried if the task property Rerunnable is set to false.
Jobs are not automatically retried if the job property Fail on task failure is set to true.
For more information, see Understanding Job and Task Properties.

Task cancel grace period

When a running task is stopped during execution, you can allow time for the application to save state information, write a log message, create or delete files, or for services to finish computation of their current service call. You can configure the amount of time, in seconds, to allow applications to exit gracefully by setting the Task Cancel Grace Period cluster property. The default Task Cancel Grace Period is 15 seconds.

Important

In Windows HPC Server 2008 R2, the HPC Node Manager Service stops a running task by sending a CTRL_BREAK signal to the application. To use the grace period, the application must process the CTRL_BREAK event. If the application does not process the event, the task exits immediately. For a service to use the grace period, it must process the ServiceContext.OnExiting event.

Additional considerations

A cluster administrator or a job owner can force cancel a running task. When a task is force canceled, the task and its sub-tasks skip the grace period and are stopped immediately. For more information, see Force Cancel a Job or Task.
You can adjust the grace period time according to how the applications that run on your cluster handle the CTRL_BREAK signal. For example, if applications try to copy large amounts of data after the signal, you can increase the time out accordingly.

Node release task timeout

Job owners can add Node Release tasks to run a command or script on each node as it is released from the job. Node Release tasks can be used to return allocated nodes to their pre-job state or to collect data and log files.

The Node Release Task Timeout determines the maximum run time (in seconds) for Node Release tasks. The default value is 10 seconds.

Additional considerations

If a job has a maximum run time and a Node Release task, the scheduler cancels the other tasks in the job before the run time of the job expires (job run time minus Node Release task run time). This allows the Node Release task to run within the allocated time for the job.
Node Release tasks run even if a job is canceled. A cluster administrator or the job owner can force cancel a job to skip the Node Release task. For more information, see Force Cancel a Job or Task.

Excluded nodes limit

The Excluded nodes limit specifies the maximum amount of nodes that can be listed in the Excluded Nodes job property. The Excluded Nodes job property can specify a list of nodes that the job scheduler should stop using or refrain from using for a particular job.

If a job owner or a cluster administrator notices that tasks in a job consistently fail on a particular node, they can add that node to the Excluded Nodes job property. When the Excluded nodes limit is reached, attempts to add more nodes to the list fail. For more information, see Set and Clear Excluded Nodes for Jobs.

For SOA jobs, the broker node automatically updates and maintains the list of excluded nodes according to the EndPointNotFoundRetryPeriod setting (in the service configuration file). This setting specifies how long the service host should retry loading the service and how long the broker should wait for a connection. If this time elapses, the broker adds the node (service host) to the Excluded Nodes list. When the Excluded nodes limit is exceeded, the broker node cancels the SOA job.

Note

If you change the Excluded nodes limit for the cluster, the new limit will only apply to excluded node lists that are modified after the new limit has been set. That is, the number of nodes listed in the Excluded Nodes job property is only validated against the cluster-wide limit at the time that the job is created or that the property is modified.

Understanding Error Handling

Heartbeat options

Retry jobs and tasks

Task cancel grace period

Node release task timeout

Excluded nodes limit

Additional references

Additional resources