Role of the Job Scheduler

Article
08/16/2010

Applies To: Windows Compute Cluster Server 2003

When a job is submitted, it is placed under the management of Job Scheduler. Job Scheduler determines the job's place in the queue and allocates resources to the job when the job reaches the top of the queue and as resources become available. Job ordering in the queue is done according to a set of rules called scheduling policies. Resource allocation is based on resource sorting.

When the requested resources have been allocated, the scheduler dispatches the job tasks to the compute nodes and takes on a management and monitoring function. Management includes enforcing certain job and task options, as well as managing job or task status changes. Monitoring consists of reporting on the status of the job and its tasks, as well as the health of the nodes.

Job Scheduler serves as the sole gateway to the cluster for job submission, running the user application under the individual user's security context. For more information about credential handling, see Compute Cluster Security.

Scheduling policies

Job Scheduler implements the following scheduling policies:

Priority-based first come, first served scheduling
Backfilling
Exclusive scheduling

Priority-based first come, first served

Priority-based first come, first served (FCFS) scheduling is a combination of FCFS and priority-based scheduling. Using priority-based FCFS scheduling, the scheduler places a job into a higher or lower priority group depending on the job's priority setting, but always places that job at the end of the queue in that priority group because it is the last submitted job.

Backfilling

Backfilling maximizes node utilization by allowing a smaller job or jobs lower in the queue to run ahead of a job waiting at the top of the queue, as long as the job at the top is not delayed as a result.

When a job reaches the top of the queue, a sufficient number of nodes may not be available to meet its minimum processors requirement. When this happens, the job reserves any nodes that are immediately available and waits for the job that is currently running to complete. Backfilling then utilizes the reserved idle nodes as follows:

Based on the run time specified for the job that is currently running, a start time for the waiting job is established.
The start time is used to define a backfill window of nodes (n)x time (t). For example, four nodes idle for 15 minutes would create a 4 x 15 backfill window.
Job Scheduler searches for the first job in the queue that can complete within the backfill window. For example, a job that requires a minimum of eight processors (four nodes, assuming dual-processor nodes) and has a run time of 10 minutes would fit the 4 x 15 window exactly.
If a job is found that fits the window, it is activated and run ahead of the job waiting at the top of the queue.

Exclusive scheduling

Exclusive scheduling restricts the resources of a node to a job or task. In both cases, depending on its default setting, exclusivity can be turned off to enable sharing or turned on to prevent sharing.

Job exclusivity

By default, a job has exclusive use of the nodes reserved by it. This can produce idle reserved processors on a node: processors that are not used by the job but are also not available to other jobs. By turning off the exclusive property, the user allows the job to share its unused processors with other jobs that have also been set to nonexclusive. Therefore, nonexclusivity is a reciprocal agreement among participating jobs, allowing each to take advantage of the other's unused processors.

Task nonexclusivity

By default, a task has nonexclusive use of the node it runs on. Otherwise the node would always run as if it had only one processor. The exclusive option is provided for exceptional cases, like a task that reboots the node.

Resource sorting

To allocate nodes, the Job Scheduler sorts the candidate nodes by memory size, and then, within each size group, by processor speed. The nodes are then allocated in the resulting order.

Run time Job and Task Management

Job Scheduler is responsible for run time management of the job and its tasks. This includes the following functions:

Enforcement of job and task run time properties
State (status) transition

Enforcement of job and task properties

Most job and task properties are enforced by Job Scheduler in one way or another. Some of these properties are enforced prior to run time and are used either for job and task identification or for job ordering and resource allocation as determined by the scheduling policies. These properties include job and task name, job and task ID, job priority, run time, number of processors, asked nodes and required nodes, and exclusive and nonexclusive use of nodes. Most of the remaining properties are enforced during run time. The following table lists the properties that are passed to Job Scheduler and describes how Job Scheduler handles them (or, in some cases, leaves that task to the Node Manager).

Job Option	Definition	Job Scheduler Action
/runtime:[[[days:<num>]hours:<num>]minutes: <num>\| infinite]	Maximum run time in day-hour-minute format. The job will be cancelled rather than allowed to run past the maximum run time. Default is Infinite.	Job Scheduler holds all resources until the run time is used up or until the job ends, whichever happens first, then releases them for the next job. The exception is when rununtilcancelled is set to TRUE.
/rununtilcanceled: true \| false	Flag indicating that the job will hold its resources until it is cancelled or reaches its run time limit. This way, additional tasks can be run on the nodes.	When rununtilcanceled is set to TRUE, Job Scheduler holds all of the job resources until the run time is used up or the job is cancelled, then releases them for the next job. If the run time is left at the default value Infinite, the job will run indefinitely until canceled.

Task Option	Definition	Job Scheduler Action
/rerunnable: true \| false	Flag indicating that that a task can be rerun after a failure. Default is TRUE.	The scheduler allows a failed job to be requeued if the failure is due to any error that can be fixed without changing the task command line. If the task fails because of system failure (for example, a node crashes), the scheduler requeues the job automatically. Only incomplete tasks are re-run.
/env:name1=val1 /env:<name2=val2 … /env:nameN=valN>	Specify the environment variables for the task. (For more information about environment variables, see Use Environment Variables.)	Job Scheduler retrieves the value of the environment variable and places it in the command line.
/runtime:[[[days:<num>]hours:<num>]minutes: <num>\| infinite]	Maximum run time in day-hour-minute format. The task will be cancelled rather than allowed to run past the maximum run time. Cancelling a task results in job failure. Default is Infinite.	Job Scheduler holds all resources until the run time is used up or until the task ends, whichever happens first, and then releases them. The exception is when rununtilcancelled is set to TRUE.
/workdir:<path>	The full path of the work directory (the directory for input, output, and error files). The path may contain environment variables. Default is %USERPROFILE%.	No action is taken by Job Scheduler. The work directory is set by the node manager.
/stdin:<file_name>	Take standard input for the task from file <file_name>.	No action is taken by Job Scheduler. The node manager locates the file by its path relative to the working directory (or its full path).
/stdout:<file_name>	Redirect standard output of the task to the file <file_name>.	No action is taken by Job Scheduler. The node manager writes the file in the folder specified by its path relative to the working directory (or its full path).
/stderr:<file_name>	Redirect standard error of the task to a file <file_name>.	No action is taken by Job Scheduler. The node manager writes the file in the folder specified by its path relative to the working directory (or its full path).
/depend:<task_name1,taskname2,…tasknameN>	Specify that this task depends on tasks of the name <task_name>.	Job Scheduler creates a dependency table that reorders the task according to dependencies. By default, tasks are run first-come, first-served.

State (status) transition

Job Scheduler is also responsible for job and task status transition from Queued to Running to Finished, Cancelled, or Failed. The scheduler will also requeue a task and the job containing it automatically, provided the task has its Rerunnable property set to True and it has failed due to a system failure.

Automatic job and task transitioning can be overridden by the user or administrator using the Cancel or Requeue commands.