Understanding the Azure Node Availability Policy

The Azure node availability policy determines how and when the Azure nodes are started (the role instances are deployed in Azure) and stopped (the role instances are removed in Azure).

You have the following two options to configure availability for your Azure nodes:

  • Automatic The nodes are automatically configured to be started (provisioned) and then brought to the Online state during one or more scheduled intervals each week. You can specify multiple times each week when you want the nodes to be available to run jobs. At the end of each time block, the nodes are automatically stopped: the nodes are taken offline and the role instances are removed. Optionally, you can specify a time interval before the end of an online block when any jobs running on the nodes are drained.

  • Manual To make the Azure nodes available to run jobs, you must first manually start (provision) the nodes, and then bring them online.

Additional considerations

  • Provisioning the Azure role instances can take several minutes under some conditions, and stopping and deleting the instances can also take several minutes.

  • The nodes are available to run jobs in an online time block only after the role instances have been provisioned in Azure. The scheduled time to start (and bring online) the nodes does not include the time that Azure takes to provision the role instances.

  • If an automatic availability policy is configured, as a best practice, plan for 60 minutes in each online time block for node deployment, in addition to the time that you want the nodes to be available to run jobs. You should also avoid scheduling online time blocks at short intervals.

  • Editing the Azure node availability policy changes the policy for nodes that are already added to the HPC cluster by using the node template, as well as for nodes that you add later. For example, you can edit the Azure node template so that nodes that are configured to start and stop automatically according to a weekly schedule are now configured to start and stop manually.

  • Depending on the configuration of the availability policy in the Azure node template and the Task Cancel Grace Period setting in Job Scheduler Configuration, the exact time when Azure nodes are stopped and the deployment ends can differ from the scheduled end of an online time block. This can occur when HPC tasks are still running near the end of the online time block. For more information, see the section Interaction of the availability policy with the Task Cancel Grace Period setting.

Interaction of the availability policy with the Task Cancel Grace Period setting

When an automatic availability policy is configured, the Azure nodes do not start jobs after an online time block passes. However, HPC tasks that are still running at the end of an online time block can continue to run for a period if the Task Cancel Grace Period setting is configured. The Task Cancel Grace Period cluster property sets a time period for applications to save state information and clean up before exiting (the default period is 15 seconds). The exact time that a task ends depends on whether and how quickly the task responds to the CTRL_BREAK event (the equivalent of the CTRL+BREAK key combination). Tasks that do not process the event will exit immediately, while those that do process the event can take as long as the Task Cancel Grace Period to exit gracefully.

The following table summarizes when HPC tasks will stop running as a result of the interaction between the Azure node availability policy and the Task Cancel Grace Period setting. Possible impacts and workarounds are listed. The interaction differs depending on whether a “drain” period is configured in the availability policy. The drain period is an optional setting specifies the number of minutes before the end of an online time block during which when no new tasks will start on those nodes.

Task drain period configured in the availability policy When Task Cancel Grace Period begins When running HPC tasks end Impacts Workarounds
Yes Beginning of drain period Between the beginning and the end of the Task Cancel Grace Period, depending on whether the task exits upon receiving the signal, or uses the period of time provided by the Task Cancel Grace Period. Can be before the scheduled end of online time block.

Example

- Scheduled end of online time block: 8:00 PM
- Grace period: 5 min
- Drain period: 10 min

Running tasks will end between 7:50 and 7:55 PM
- Azure nodes are stopped and the deployment is taken down earlier than expected.
- Usage of Azure resources for HPC tasks may not be optimal.
- Adjust the Task Cancel Grace Period to be the same as the drain period, or as similar as possible.
- Specify small values for the drain period and grace period, if your applications allow them.
No End of configured online time block Between the beginning and the end of the Task Cancel Grace Period, depending on whether the task exits upon receiving the signal, or uses the period of time provided by the Task Cancel Grace Period. Can be after the scheduled end of online time block.

Example

- Scheduled end of online time block: 8:00 PM
- Grace period: 5 min

Running tasks will end between 8:00 and 8:05 PM
- HPC tasks can continue running beyond the end of the online time block for as long as the Task Cancel Grace Period.
- Azure node deployment can be extended beyond the end of the node time block for as long as the Task Cancel Grace Period.
- If your applications allow it, adjust the Task Cancel Grace Period to be a smaller value.

See Also

Configuring an Azure Node Template for Microsoft HPC Pack
Understanding Node States, Health, and Operations
Task cancel grace period
Set the Number of Azure Proxy Nodes