Troubleshooting Jobs

Article
08/16/2010

Applies To: Windows Compute Cluster Server 2003

This section addresses problems you may have in creating, submitting, and completing parallel computing jobs in Microsoft Compute Cluster Server 2003. For each problem, this section provides likely causes and recommended solutions.

What problem are you having?

Problem: My job has been at the top of the queue for an hour but it does not activate, and other jobs keep jumping the queue and running ahead of it.

Cause: Your job probably requires resources that running jobs are using. In the meantime, smaller jobs behind your job in the queue are moving into the backfill window of a running job.

Solution: Try modifying your job to require fewer or less specific resources. Also, check to see if a running job has a lock on its reserved nodes by holding them to the end of a long or Infinite run time.

For more information, see "Backfilling" in Role of the Job Scheduler.

Problem: My MPI application appears to be running on only one processor.

Cause: You may not be running it within mpiexec. Many MPI applications will run as stand-alone applications, but only as a single process. Or, you may be using mpiexec without having specified the minimum number of processors, which has a default value of one. To spawn multiple processes, mpiexec must always be run with the job property /numprocessors: set to two or more. (For example**, /numprocessors:4**.)

Solution: Ensure that you have included mpiexec in the task command line and that you have specified more than one processor.

Important

Do not use the mpiexec -n option, with or without using /numprocessors:. This will always force all processes to run on a single processor.

For more information, see Add MPI Tasks.

Problem: When I submit a job with an interactive program, the job fails.

Cause: The Job Scheduler does not by itself support interactive programs.

Solutions:

Run the task in batch mode (using input and output files.)

-- OR --
Submit the job to the Job Scheduler from within the application using the jobsubmit <command>[arguments] format. For help with specific applications or job types, consult the software provider.

For more information, see Integrating with Interactive Programs.

Problem: When I create and run a job using Add Parametric Sweep, my output files end up scattered across different compute nodes.

Cause: You probably have the Working Directory property pointed to its default location, which is your home directory on the compute node actually performing the task.

Solution: To write all output files to a single directory on a single node, you must specify the UNC path to that directory using the Standard Output property or the Work Directory property.

For more information, see Store and Access Data Files.

Problem: I have an MPI application that I run in a job. Sometimes it finishes and sometimes it fails, saying it cannot find the file. I am submitting exactly the same job in both cases.

Cause: You probably have only one copy of the application and it is on the head node, which you have also configured as a compute node. This works if the head node happens to be the compute node that is designated to run the master process. If it has more memory than any of the other compute nodes, and is always available, this will always be true. Otherwise, a different compute node may become the designated node. This compute node will look for the application in its working directory, and when it fails to find it, the job will fail.

Solution:

In your task command line, specify the MPI command as a UNC path to its location on the head node.

-or-
Install a copy of the application on each compute node that will use it. This is recommended if it is a large application or the time required to download it each time it is run is otherwise a significant performance factor.

For more information, see Install and License Parallel Applications.

Problem: When I specify long paths to my standard input and standard output files in the Job Manager Add Task screen, they work fine, but when I use the same paths using the CLI, the job fails.

Cause: Your paths may contain folder or file names with spaces, such as Project1 Output. The Job Manager can evaluate these properly with spaces included, but the CLI cannot.

Solution: Place these inside double quotes. For example:

Job submit /numprocessors:4 /stdout:"C:\Project1 Datafiles\Output" Myapp.exe

Note

It is usually possible to avoid spaces as follows: Do not create folders and file names with spaces in the folder name. Where applicable, specify files using paths relative to your working directory. (By default this is %user_profile%, which is C:\Documents and Settings<your_alias>.) Put the path containing spaces into an environment variable and use the environment variable in its place, as follows: %<env_var>% (This is the same method used for the default working directory path shown above.) For files on remote nodes, set up your UNC paths to bypass any folder names with spaces.

Problem: The Job Manager is supposed to understand file paths containing spaces, but when I enter a command path containing spaces, the job fails.

Cause: All task command lines end up being executed by the command shell Cmd.exe /C, whether the job is submitted through the Job Manager or the CLI. Cmd.exe /C cannot automatically resolve directory names with spaces.

Solution: In a task command line, always place paths with spaces inside double quotes.

Note

Problem: I forgot to specify more than one processor for my job, so all of my tasks ran on the same processor. This probably made it much slower than it should have been. Why wasn't I prevented from doing this?

Cause. The tasks appear to be using one processor, but actually they are using all the processors on the node (the job always reserves entire nodes, enough to meet the number the processors requested). The job will run faster if it reserves more nodes based on your having requested more processors, but this can also delay it as it waits for the additional nodes. For this reason, we always give you the option to reduce delays by requesting fewer nodes.

Solution: This behavior is by design. If you want to make optimal use of cluster's resources but also avoid long delays, it is best to request both a minimum and a maximum number of processors . This will tend to give you the fastest throughput available at the time your job reaches the top of the queue. A low minimum also gives it a better chance to run ahead of its place in the queue as a backfill job.

Troubleshooting Jobs

What problem are you having?

See Also

Concepts

Additional resources