Best Practices for Using the Compute Cluster

Article
08/16/2010

Applies To: Windows Compute Cluster Server 2003

Set realistic run times

Instead of using the default Infinite run time, set a run time that is a reasonable estimate of how long your job will actually take to complete. This will give your job a better chance of moving up the queue into the backfill window of a job that is about to run.

Set realistic minimum and maximum number of processors

If there is an ideal number of processors on which to run your job, specify this number as a maximum, not a minimum. Then set as the minimum the lowest acceptable number of processors. This will ensure that your job does not have to wait longer than necessary in the queue.

Reserve specific nodes only when necessary

Reserve specific nodes for your job only if your job requires the special resources of those nodes. Waiting in the queue for specific nodes to become available can be a lengthy process.

Use the nonexclusive option for jobs if possible

By default, a job has exclusive use of a node it reserves. This means that no other jobs may use idle processors on the node, including other jobs owned by you. By changing the Exclusive property to false, you will not only be sharing unused processors with jobs that can make use of them, but you will be able to use idle processors on nodes belonging to other jobs.

Do not use the exclusive option for a task unless necessary

By default, a task has nonexclusive use of the node reserved by the job. Giving a task exclusive use of a node not only locks other jobs out of the node, it locks other tasks in the same job out of the node. This means the node will run your tasks one at a time, turning a parallel job into a serial one. The exclusive option is primarily for administrative operations, like starting or rebooting nodes.

Take advantage of "Save as Template" option

The Save as Template option is a very useful tool. If job submission fails for any reason, there is no way to recover a job that has not first been saved. And while a job that is Finished, Canceled, or Failed can be clicked in the queue and saved as a template, the job will remain in the queue for only a limited time.

Organize Data Files

By default, Job Scheduler looks for standard input files and writes standard output and error files to the working directory of the compute node or nodes on which the executable is run. To use of the cluster effectively, it is necessary to create file shares to establish central and predictable locations for your data files. For more information, see Store and Access Data Files.

Make your executables accessible to a cluster

In your cluster, many compute nodes will run the same program simultaneously, and each of these nodes needs directions to that executable. To avoid "File not found" errors, it is necessary to do two things:

If your executable is installed on eachcompute node, place the executable in the search path of each node using the operating system's Path environment variable. Then invoke it by its filename alone.
If the executable is installed only in one place, specify it on the command line by a UNC path direct to the executable. (The Path search path does not work with file shares.)