Guidelines for Running MPI Applications in Azure

Article
07/29/2022

This topic provides guidelines and procedures for enabling MPI applications to run on Windows Azure nodes. This information applies to Windows Azure nodes that are added to an on-premises Windows HPC cluster (in a Windows Azure “burst” scenario), or to nodes that are deployed as part of a Windows Azure service that uses the Windows Azure HPC Scheduler (Windows Azure only).

For general guidelines about data and file shares on Windows Azure nodes, see Guidelines for Running HPC Applications on Azure Nodes.

In this topic:

What kind of MPI jobs are best suited for Windows Azure?

MPI jobs that are not particularly latency and bandwidth sensitive are more likely to scale well in the Windows Azure environment. Latency and bandwidth sensitive MPI jobs can perform well as small jobs, where a single task runs on no more than a few nodes. For example, in the case of an engineering simulation, you can run many small jobs to explore and define the parametric space before increasing the model size. This can be particularly useful in situations where your access to on-premises compute nodes is limited, and you want to ensure that you are using your cluster time for the most appropriate models.

Why do we make these recommendations? MPI jobs often run on clusters with specialized low-latency, high-bandwidth network hardware. Windows Azure nodes are not currently connected with this type of network. Additionally, Windows Azure nodes are periodically reprovisioned by the Windows Azure system. If a node is reprovisioned while it is running an MPI job, the MPI job will fail. The more nodes you are using for a single MPI job, and the longer the job runs, the more likely it is that one of the nodes will be reprovisioned while the job is running.

Registering an MPI job with the firewall on Windows Azure nodes

An administrator must configure Windows Firewall to allow MPI communication between the compute nodes in Windows Azure. To this, you can register each MPI application with the firewall (create an application-based exception). This allows MPI communications to take place on a port that is assigned dynamically by the firewall. You can configure firewall exceptions on your nodes by using the clusrun and hpcfwutil commands.

Note

For burst to Windows Azure nodes, an administrator can configure a firewall exception command to run automatically on all new Windows Azure nodes that are added to your cluster. After you run the hpcfwutil command and verify that your application works, you can add the command to a startup script for your Windows Azure nodes. For more information, see Configure a Startup Script for Windows Azure Nodes.

The following procedure describes how to add an exception for myApp.exe to all nodes. You must be an administrator for the Windows HPC cluster or for the Windows Azure HPC Scheduler deployment to perform the following steps.

To configure a firewall exception for myApp.exe

Connect to your head node in one of the following ways (using administrator credentials):
- Log in directly to the head node (on-premises).
- Run the command from a client computer that has the HPC Pack client utilities installed (on-premises). If the CCP_SCHEDULER environment variable is not set on the computer, then include the /scheduler:<yourHeadNodeName> parameter in the clusrun command.
- Use the Windows Azure Management Portal to make a Remote Desktop connection to a head node in your service deployment (Windows Azure HPC Scheduler).
Open a command prompt.

Type the following command:

clusrun hpcfwutil register myApp.exe e:\approot\myApp.exe

To run a clusrun command on a subset of nodes, you can specify a clusrun parameter such as /nodegroup:<node_group_name>, /nodes:<node_list> (comma separated), or /template:<node_template_name>. For more information, see clusrun.

Setting the MPI netmask for burst to Windows Azure nodes

When you run MPI jobs on Windows Azure nodes, ensure that the IP addresses of the Windows Azure nodes are within the range of accepted IP addresses that is specified for the MPI network mask. The MPI network mask determines from what range of IP addresses an MPI rank can accept communications. If jobs on Windows Azure nodes are failing with errors about connection failures, you might need to reset the netmask to enable communication between the nodes.

The default cluster-wide range is defined through the CCP_MPI_NETMASK cluster environment variable. The value that is specified in this cluster variable is automatically set as a system environment variable on all cluster nodes. Depending on your requirements, an administrator can reconfigure the network mask at a cluster-wide level, or override the cluster settings at a node or node group level. A job owner can override cluster or node settings at the job level (for more information, see Environment variable hierarchy).

Cluster level

You can disable the netmask (allow all IP addresses) on the whole cluster. For example, run the following command on the head node:

setx CCP_MPI_NETMASK=”0.0.0.0/0.0.0.0”

You can broaden the range to ensure that it includes your Windows Azure nodes and on-premises nodes. For example, if your Windows Azure nodes have a 10.x.x.x IP address and the default address range for on-premise nodes is 10.x.x.x or 10.1.x.x, you can set the netmask as follows:

cluscfg setenvs ccp_mpi_netmask=10.0.0.0/255.0.0.0

Important

If you configure connectivity between the Windows Azure nodes access and on-premises resources, you should define separate netmasks for your on-premises and Windows Azure nodes. Ensure that the netmask for the Windows Azure nodes does not allow on-premises IP addresses.

Node level

You can disable or set the netmask (allow all IP addresses) on just your Windows Azure nodes. For example, to disable the netmask, type the following command:

clusrun /nodegroup:AzureNodes setx CCP_MPI_NETMASK=”0.0.0.0/0.0.0.0”

Job level

A job owner can specify the desired range (or disable it) at the job level by setting the MPI environment variable MPICH_NETMASK <range> in the mpiexec command arguments. For example, if the Windows Azure nodes have IP addresses starting with 10.28.x.x, type the following command:

job submit /nodegroup:azurenodes /numcores:32 /stdout: %CCP_PACKAGE_ROOT%\myApp\out.txt /workdir: %CCP_PACKAGE_ROOT%\myApp mpiexec –env MPICH_NETMASK 10.28.0.0/255.255.0.0 myApp.exe

Tracing MPI applications on Windows Azure nodes

If you try tracing an MPI application on Windows Azure nodes by using the –trace argument in mpiexec, the default trace file size (10 GB) is too large. The job submission is likely to fail with a message about insufficient disk space. You can reduce the trace file size by including the –tracefilemax argument. For example, to configure a trace file size of 1 GB, set –tracefilemax 1000.

Identifying Windows Azure nodes in MPI error messages

Error messages for MPI applications typically use the host name to identify nodes. In Windows Azure, the host name is not the friendly name for the nodes and can be difficult to identify. You can use HPC Cluster Manager to view the node name in the HPC cluster and the Windows Azure instance name.

Additional consideration for burst to Windows Azure

MPI jobs cannot span across on-premises and Windows Azure nodes, or across different Windows Azure node deployments (Windows Azure nodes that deployed by using different node templates). Separate Windows Azure node deployments are isolated, and MPI processes would not be able to communicate with each other. You can prevent MPI jobs from spanning these boundaries by submitting MPI jobs to specific node groups. Node groups can be enforced by the administrator at the job template level, or can be specified by the job owner at the job level.