Understanding Job Scheduling Policies

Applies To: Windows HPC Server 2008

The HPC Job Scheduler Service queues jobs and tasks, allocates resources, and dispatches tasks to compute nodes. Job scheduling policies determine the order in which to run jobs from the queue, and determine how cluster resources are allocated to these jobs.

As a cluster administrator, you can adjust how resources are allocated to jobs, and how jobs are handled, by configuring job scheduling policy options and by creating job templates that leverage the job scheduling policies.

Job scheduling policy options

You can configure options for the following job scheduling policies:

Policy Default setting

Preemption

Graceful pre-emption

Adaptive resource allocation (grow/shrink)

Automatic growth and shrink both enabled

Backfilling

Enabled, with the backfill look ahead set at 1000 jobs

You can configure these job scheduling policies in HPC Cluster Manager, or by using the cluscfg setparams command-line tool or the Set-HpcClusterProperty cmdlet. For more information about how to configure job scheduling policies using HPC Cluster Manager, see Configure the HPC Job Scheduler Service.

Job templates

You can create job templates to define a set of job submission policies. Each job template consists of a list of job properties and associated value settings, and a list of users with permission to submit jobs using that job template. You can optimize cluster usage by creating job templates that work with the job scheduling policies. For more information about using and creating job templates, see Job Templates.

Note

In some cases, you may want to provide additional checks and controls on jobs that are submitted to your cluster, or even change job property values. You can enforce site-specific job submission policies and job activation policies by creating custom filters. For more information, see Creating and Installing Job Submission and Activation Filters in Windows HPC Server 2008 Step-by-Step Guide.

Job scheduling policies

The following table describes Windows® HPC Server 2008 job scheduling policies, and how you can set job scheduling policy options and use job templates to manage cluster usage.

Policy Description Management Options

Priority-based first come, first served (FCFS)

Combines priority sorting and FCFS to determine the order of the job queue. The priority level is based on the Priority job property, and can have one of the following values:

  • Highest

  • AboveNormal

  • Normal

  • BelowNormal

  • Lowest

All Highest priority jobs are queued ahead of AboveNormal priority jobs, and so on. The job submit time determines the order within each priority level.

Job templates: Use job templates to define the default priority level and valid priority values that different sets of users can assign to their jobs.

Preemption

Allows higher priority jobs to take resources away from lower priority, preemptable jobs that are already running. Graceful preemption shrinks preempted jobs, allowing running tasks to finish so that work is not lost. Immediate preemption cancels all running tasks of the preempted jobs so that resources can be allocated to the high priority job immediately.

The Preemptable job property is defined by the administrator in the job template. Preemptable cannot be defined when submitting a job through HPC Cluster Manager, HPC Job Manager, the HPC PowerShell, or the HPC command-line tools. It is only possible to do this by using the HPC API, if the selected job template specifies both True and False as valid values for the Preemptable job property.

Policy options: Set scheduler configuration to one of the following:

  • Graceful pre-emption (default)

  • Immediate pre-emption

  • No pre-emption

Job templates: Use job templates to define the types of jobs that can or cannot be preempted, or the sets of users who can submit preemptable or nonpreemptable jobs.

Adaptive resource allocation (grow/shrink)

Dynamically adjusts the resources allocated to a job based on its tasks. Enabling resource adjustments can result in a significant improvement in cluster utilization and reduced job queue times, especially for clusters which run jobs composed of multiple tasks, such as parametric sweep computations. Only jobs that contain more than one task (including jobs with parametric sweeps) can benefit from automatic resource adjustment.

With automatic growth enabled, the HPC Job Scheduler Service can allocate free resources to running jobs that have additional tasks to run. The service will not allocate more resources than the maximum requested for the job. This results in jobs spending more time in the queue waiting for resources, but they finish more quickly after they are started. Available resources are allocated first to the highest-priority job in the system, whether this job is running or queued.

With automatic shrink enabled, the HPC Job Scheduler Service can release unused resources from running jobs that have no additional tasks to run. The service will not shrink resources below the minimum requested for the job. Automatic shrink results in better overall cluster utilization, but it may cause problems if you add tasks to jobs that are already in progress.

Policy options: Automatic grow and shrink are both enabled by default. Use scheduler configuration settings to enable or disable either option.

Job templates: In the default job template, the job properties Auto Calculate Maximum and Auto Calculate Minimum are set to a default value of True. If a job template specifies that True is the only valid value for these properties, the submitting user will not have the option of specifying maximum and minimum resources for a job submitted with that template, and resources will be automatically calculated based on the tasks in the job.

Backfilling

Maximizes cluster utilization and throughput by allowing smaller jobs lower in the queue to run ahead of a job waiting at the top of the queue, as long as the job at the top is not delayed as a result.

When a job reaches the top of the queue, a sufficient number of nodes may not be available to meet its minimum core requirement. When this happens, the job reserves any nodes that are immediately available and waits for the job that is currently running to complete.

Backfilling then utilizes the reserved idle nodes as follows:

  • Based on the run time specified for the job that is currently running, a start time for the waiting job is established.

  • The start time is used to define a backfill window of nodes (n) x time (t). For example, four nodes that idle for 15 minutes would create a 4 x 15 backfill window.

  • The HPC Job Scheduler Service searches for the first job in the queue that can complete within the backfill window. For example, a job that requires a minimum of eight cores (four nodes, assuming dual-core nodes) and has a run time of 10 minutes would fit the 4 x 15 window exactly.

  • If a job is found that fits the window, it is activated and run ahead of the job that is waiting at the top of the queue.

Job templates: Backfilling is only effective when jobs submitted to the cluster have a maximum run time specified. Use job templates to define a maximum run time on all jobs. For example, you can create a series of job templates named BigJob, MediumJob, and SmallJob with maximum run times of one day, one hour, and one minute, respectively.

Note that you can also write a job submission filter that checks that the runtime job term is not set to infinite. For more information, see Creating and Installing Job Submission and Activation Filters in Windows HPC Server 2008 Step-by-Step Guide.

Policy options: Backfilling is enabled by default. Use scheduler configuration settings to modify or disable backfilling. The BackfillLookAhead parameter specifies how far down the queue to look for jobs. The default value is 1000. A negative value indicates that the HPC Job Scheduler Service should search though the entire job queue to find jobs that can backfill the jobs at the top of the job queue.

Nonexclusive scheduling

By default, a job or a task has nonexclusive use of the nodes reserved by it. For example, when requesting two cores for a task on a cluster with four-core nodes, the task will be assigned to two cores on a node, and other tasks may run on the other two cores on that node. If such a task were exclusive, the task would be assigned the entire node.

When a job is Exclusive, all resources on the nodes assigned to the job will be assigned to that job; no other job can run on the same nodes. When a task is Exclusive, all resources on the nodes assigned to the task will be assigned to that task, and no other task can run on the same nodes. This can produce idle cores on a node: cores that are not used by the job or task but are also not available to other jobs or tasks.

Note that you cannot have an exclusive task in a nonexclusive job.

The Exclusive option is provided for exceptional cases, such as a task that reboots the node. It can also be used for jobs that are sensitive to other jobs using resources such as CPU, memory, disk or network bandwidth. For example, many MPI programs run as slowly as the slowest node, so sharing just one node with another application can severely affect performance.

Job templates: Use job templates to define the types of jobs or the sets of users that can enable job exclusivity.

Additional references