Understanding Job and Task States
Updated: January 13, 2014
Applies To: Microsoft HPC Pack 2008 R2, Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2
In Microsoft® HPC Pack, jobs and tasks have almost identical life cycle states. The main life cycle states are Configuring, Queued, Running, Finished, Failed, and Canceled. Jobs and tasks also move through brief transitional states. The following table summarizes all life cycle states.
Job and task states
The job or task is in the system, but has not been submitted to the queue.
The job or task has been submitted and is awaiting validation before it can be queued.
The HPC Job Scheduler Service is validating the job or task. During validation, the Job Scheduler service confirms permissions, applies default settings for any properties that the job owner did not specify, and validates each property against constraints. Default settings and constraints are defined by the job template. For more information about job templates, see Understanding Job Templates. The HPC Job Scheduler Service also confirms that job properties encompass all task properties (for example, no task has a run time that is greater in value than the run time of the job).
During validation, the job might also pass through a custom submission filter application that is defined by the cluster administrator.
If the job passes validation, it moves to the Queued state. If the job does not pass validation, the job displays an error message and the job moves to the Failed state.
The job or task passed validation, and is waiting to be scheduled and activated (run).
When a running job, a Basic task, or a Parametric Sweep sub-task is preempted by the HPC Job Scheduler Service, it moves back to the Queued state (unless the task is not rerunnable, in which case it is marked as Failed).
This state only applies to tasks. The HPC Job Scheduler Service has allocated resources to the task and is contacting the allocated nodes to start running the task. When the task starts, it moves to the Running state.
The job or task is running on one or more nodes.
The job or task completed, and job or task clean-up is in progress.
The job or task completed successfully.
The job or task failed to complete, stopped running, or returned an exit code that indicates failure (by default, any non-zero exit code).
Additionally, a running task is marked as Failed when:
If a job or task fails to start because of a cluster failure, the job or task is automatically retried a specified number of times before it is marked as Failed.
For more information, see Troubleshooting Jobs.
The job or task was canceled and clean-up is in progress.
The job was canceled by the job owner, a cluster administrator, or by the HPC Job Scheduler Service. For example, the HPC Job Scheduler Service can cancel a job if it exceeds its runtime or if it is preempted.
The task was canceled by the job owner or a cluster administrator before it started running. If a running task is canceled, the task is marked as Failed.