Investigating Job Failures

Article
06/28/2010

Applies To: Windows HPC Server 2008

Jobs and tasks can fail for a number of reasons. The steps below provide a starting point for investigating failures.

Troubleshooting job failures
Troubleshooting task failures

For more information about using HPC Job Manager, see Overview of HPC Job Manager.

Troubleshooting job failures

To review job error messages

In the Navigation Pane, under My Jobs, click Failed.
Double-click a job (or right-click a job, and then click View Job) to see the job details.
In the View Job dialog box, click Results and Statistics.
Review the Error message field for information about why the job failed.

Common causes of job failure

One or more tasks in the job have failed. This is the most common cause of job failure. This indicates that one or more tasks could not be run or did not complete successfully. View task level error messages to investigate this type of job failure. In the View Job dialog box, click View Failed Tasks. For more information, see Troubleshooting task failures below.
A node assigned to the job could not be contacted. Jobs that fail because of a node falling out of contact are automatically retried a certain number of times, but will eventually fail if the problem continues. If you receive this error message, contact your cluster administrator.
The job’s run time expired. The Job Scheduler service cancels jobs that reach the end of their run time. If possible, modify the run time for your job, and then requeue your job. For more information, see Modify a Job and Requeue a Job or Task.
The job could not be started on one of its allocated nodes. The most common cause for this type of failure is that an invalid user name or password is associated with the job. You can use the job modify command-line command to update the credentials attached to your job, and then try requeueing. For more information about using command-line commands, see the Windows HPC Server 2008 Command Reference (https://go.microsoft.com/fwlink/?LinkID=120724).

Another common cause is logon failure on the compute node. For more information, see Job Failed to Start because of Logon Failure.

Troubleshooting task failures

To review task error messages

In the Navigation Pane, click My Jobs.
Click a job. The Detail Pane displays information about the tasks.
In the Detail Pane, click the Task tab, then double-click a task (or right-click then click View Task) to see the task details.
In the Task Properties dialog box, click the Results tab.
Verify that the correct task is selected, then review the Error message field for information about why the task failed.

Common causes of task failure

The task failed during execution. This type of error occurs in the application itself. Check the output and error files for details. If you did not specify standard output and error files for the task, review the Output and Error fields in the Task Properties dialog box.

Note

This message indicates that the task’s command line returned a non-zero exit code, which the Job Scheduler service interprets as a failure. However, some applications might return a non-zero exit code even when they succeed. For more information, see Tasks That Complete Successfully Are Marked As Failed.
The task’s run time expired. The Job Scheduler service cancels tasks that reach the end of their run time. You can create a new copy of your task with a longer run time and attempt to requeue the job.
A file location required by the task could not be accessed. A frequent cause of task failures is inaccessibility of required file locations, including the standard input, output, and error files and the working directory locations. Check the following possible causes:
- A permissions issue is preventing the task from accessing the specified file.
- A networking issue is preventing access to the file from the specified compute node.
- The working directory, input file, or output file location does not exist.
A node assigned to the task could not be contacted. Tasks that fail because of a node falling out of contact are automatically retried a certain number of times, but will eventually fail if the problem continues. If you receive this error message, contact your cluster administrator.