Using Microsoft Message Passing Interface

Applies To: Windows Compute Cluster Server 2003

Microsoft® Windows® Compute Cluster Server 2003 is a two-CD package that includes Microsoft® Windows Server™ 2003, Compute Cluster Edition operating system on the first CD, and Microsoft® Compute Cluster Pack on the second CD. The Compute Cluster Pack is a collection of utilities, interfaces, and management tools that bring personal high-performance computing (HPC) to readily available x64-based computers.

What is MPI?

Microsoft ® Message Passing Interface (MPI), called MS MPI, is a portable, flexible, vendor- and platform-independent interface for messaging on HPC nodes. MPI is the specification, and MS MPI, MPICH2, and others are the implementations of that standard. MPI2 is an extension of the original MPI specification.

MPI provides a portable and powerful interprocess communication mechanism that simplifies some of the complexities of communication between hundreds or even thousands of processors working in parallel. It is this interprocess communication that distinguishes fine-grained parallelism from coarse-grained parallelism. In coarse-grained parallelism, processes run concurrently but completely independently of one another.

MS MPI vs. MPICH2

The best known implementation of the MPI specification is the MPICH2 reference implementation created by the Argonne National Laboratory. MPICH2 is an open-source implementation of the MPI2 specification that is widely used on HPC clusters. MS MPI is based on and designed for maximum compatibility with the reference MPICH2 implementation. The exceptions to that compatibility are all on the job launch and job management side of MPICH2—the APIs that independent software vendors (ISVs) use are identical to MPICH2. These exceptions to the MPICH2 implementation were necessary to meet the strict security requirements of Windows Compute Cluster Server  2003.

Note

You are not required to use MS MPI when using Windows Compute Cluster Server 2003. You can use any MPI stack you choose. However, the security features that Microsoft has added to MS MPI may not be available in other MPI stacks.

MS MPI Features

MS MPI is a full-featured implementation of the MPI2 specification. It consists of two essential parts—an application programming interface (API) that includes more than 160 APIs that ISVs can use for interprocess communication and control, and the executable mpiexec that is used to execute jobs and that provides fine-grained control over the execution and parameters of each job.

Programming features of MS MPI

Although MS MPI includes more than 160 APIs, most programs can be written using about a dozen of those. MS MPI includes bindings that support the C, Fortran77, and Fortran90 programming languages. Microsoft® Visual Studio® 2005 includes a remote debugger that works with MS MPI in Visual Studio Professional Edition and Visual Studio Team System. Developers can start their MPI applications on multiple compute nodes from within the Visual Studio environment. Then Visual Studio will automatically connect the processes on each node, so the developer can individually pause and examine program variables on each node.

MPI uses objects called communicators to control which collections of processes can communicate with each other. A communicator is a kind of “party line” of processes that can all receive messages from each other but that ignore any messages within the communicator unless they are directed to them. Each process has an ID or rank in the communicator. The default communicator includes all processes for the job and is known as MPI_COMM_WORLD. Programmers can create their own communicators to limit communications to a subset of MPI_COMM_WORLD where appropriate.

The MS MPI communications routines also include collective operations. These collective operations allow the programmer to collect and evaluate all the results of a particular operation across all nodes in a communicator in a single call.

MPI supports fine-grained control of communications and buffers across the nodes of the cluster. The standard routines will be appropriate for many actions, but when you need specific buffering, MPI supports it. Communications routines can be blocking or nonblocking, as appropriate.

MPI supports both predefined data types and derived data types. The built-in data type primitives are contiguous, but derived data types can be contiguous or noncontiguous.

Mpiexec features

The normal mechanism for starting jobs in a Windows Compute Cluster Server 2003 cluster is the scheduler's job submit command, which in turn invokes mpiexec to execute the end task. A typical command line might be:

job submit /numprocessors:8 /runtime:5:0 mpiexec myapp.exe

This will submit the application myapp.exe to the Job Scheduler under mpiexec, assigning it to eight processors for a run time not to exceed five hours.

Note that the number of processors and the run time are specified as options to the job submit command, not mpiexec. Mpiexec has its own set of options that duplicate these and other job and task options. In the context of a job, these mpiexec options function as a subset of the job and task options and can be used to obtain a higher degree of specificity. They can also be used for development purposes when there is a need to run mpiexec directly (for example, in developing MPI applications.)

An important improvement that MS MPI brings to MPICH2 is in how security is managed during execution of MPI jobs. Each job is run with the credentials of the user. Credentials are only present while the job is being executed on the compute nodes and are erased when the job completes. Individual processes only have access to their logon token, and do not have access to the credentials (user name and password) that generated the token or to the credentials used by other processes.

Implementing MPI

The original MPI specification was formulated and agreed to by the MPI Forum in the early 1990s, and extended in 1996 by the same group of approximately 40 organizations to create the MPI2 specification. Today the MPI specifications are the de facto message-passing standards across almost all HPC platforms.

Programs written to MPI are portable across platforms and across various implementations of MPI without the need to rewrite source code. Although originally targeted at distributed systems, MPI implementations now support shared-memory systems as well.

What protocols and hardware are supported?

Windows Compute Cluster Server 2003 includes MS MPI as part of the Microsoft® Compute Cluster Pack. MS MPI uses the Microsoft WinSock Direct protocol for maximum compatibility and CPU efficiency. MS MPI can use any Ethernet interconnect that is supported by Windows Server 2003 operating systems as well as such interconnects as InfiniBand or Myrinet. Windows Compute Cluster Server 2003 supports the use of any network interconnect that has a Winsock Direct provider. Gigabit Ethernet provides a high-speed and cost-effective interconnect fabric, while InfiniBand and Myrinet are ideal for latency-sensitive and high-bandwidth applications.

The WinSock Direct protocol bypasses the TCP/IP stack, using Remote Direct Memory Access (RDMA) on supported hardware to improve performance and reduce CPU overhead. Figure 1 shows how MPI works with WinSock Direct drivers to bypass the TCP/IP stack where drivers are available.

WinSock Direct topology

Figure 1. WinSock Direct topology

Topology

Windows Compute Cluster Server 2003 supports a variety of network topologies, including those with a dedicated MPI interface, those that use a private network for both cluster communications and MPI, and even a topology that uses a single network interface on each node that shares a public network for all communications. The following five topologies are supported:

  • Three network adapters on each node. One network adapter is connected to the public (corporate) network; one to a private dedicated cluster management network; and one to a high-speed dedicated MPI network.

  • Three network adapters on the head node and two on each of the cluster nodes. The head node uses Internet Connection Sharing (ICS) to provide network address translation (NAT) services between the compute nodes and the public network, with each compute node having a connection to the private network and a connection to a high-speed protocol such as MPI.

  • Two network adapters on each node. One network adapter is connected to the public (corporate) network, and one is connected to the private dedicated cluster network.

  • Two network adapters on the head node and one on each of the compute nodes. The head node provides NAT between the compute nodes and the public network.

  • A single network adapter on each node with all network traffic sharing the public network. In this limited networking scenario, Remote Installation Services (RIS) deployment of compute nodes is not supported, and each compute node must be manually installed and activated.

Windows Compute Cluster Server 2003 is based on Microsoft® Windows Server™ 2003 operating system, Standard x64 Edition, and is designed to integrate seamlessly with other Microsoft server products. For example, Microsoft® Operations Manager (MOM) can be used to monitor the cluster, or applications can integrate with Exchange Server to mail job status to the job owner.

Cluster topology

Figure 2. Typical Compute Cluster Server network topology

In a debugging environment, the developer’s Visual Studio workstation needs direct access to the compute nodes to be able to do remote debugging.

Security and the MS MPI Implementation

The most significant difference between MS MPI and the reference MPICH2 implementation is found in the way security is handled. MS MPI integrates with Microsoft® Active Directory® directory service to simplify running with user credentials instead of using the root account to run jobs.

Integration with Active Directory

Windows Compute Cluster Server 2003 is integrated with and dependent on Active Directory to provide security credentials for users and jobs that run on the cluster. Before the Compute Cluster Pack can be installed on the head node and the cluster actually created, the head node must be joined to an Active Directory domain or be promoted to be a Domain Controller for its own domain.

User accounts are used for all job creation and execution on the cluster. All jobs are executed using the credentials of the user submitting the job.

Credential and process management

When an MPI job is submitted, the credentials of the user submitting the job are used for that job. At no time are passwords or credentials passed in plaintext across the network. All credentials are passed using only authenticated and encrypted channels, as shown in Figure 3. Credentials are stored with the job data on the head node and deleted when the job completes. At the user’s discretion, the credentials can also be cached on the individual client computer in order to streamline job submission, but when this option is chosen, they are encrypted with a key known only to the head node. While a job is being computed, the credentials are used to create a logon token and then erased. Only the token is available to the processes being run, further isolating credentials from other processes on the compute nodes.

The processes that are running on the compute nodes are run as a single Windows job object, enabling the head node to keep track of job objects and clean up any processes when the job completes or is cancelled.

Handling of credentials

Figure 3. Credential management of MPI jobs

Conclusion

MPI and MPI2 are widely accepted specifications for managing messaging in high performance clusters. Among the most widely accepted implementations of MPI is the open-source MPICH2 reference implementation developed by Argonne National Laboratory. Microsoft Windows Compute Cluster Server 2003 includes the Microsoft implementation of MPI (MS MPI). MS MPI is based on MPICH2 and is highly compatible with it. At the API level, it is identical to all 160+ APIs defined by MPICH2, while adding enhanced security and process management capabilities for enterprise environments. MS MPI uses WinSock Direct drivers to provide high-performance MPI network support for Gigabit Ethernet and InfiniBand adapters, and supports all adapters that have a WinSock Direct provider.

References

MPICH2 Home Page

(https://go.microsoft.com/fwlink/?LinkId=55115)

MPI tutorial at the Lawrence Livermore National Lab

(https://go.microsoft.com/fwlink/?LinkId=56096)

Deploying and Managing Compute Cluster Server 2003

(https://go.microsoft.com/fwlink/?LinkId=55927)

Using the Compute Cluster Server 2003 Job Scheduler

https://go.microsoft.com/fwlink/?LinkId=55929)

Migrating Parallel Applications

(https://go.microsoft.com/fwlink/?LinkId=55931)

Parallel Debugging Using Visual Studio 2005

(https://go.microsoft.com/fwlink/?LinkId=55932)

Appendix: A Sample MPI Program

A frequently used example of an MPI program is calculation of pi, which can be expressed as follows:

#include <time.h>
#include <mpi.h>

void main(int argc, char *argv[])
{
IntNumIntervals= 0;//num intervals in the domain [0,1] of F(x)= 4 / (1 + x*x)
doubleIntervalWidth= 0.0;//width of intervals
double IntervalLength= 0.0;//length of intervals
doubleIntrvlMidPoint= 0.0;//x mid point of interval
intInterval= 0;//loop counter
intdone= 0;//flag
doubleMyPI= 0.0;//storage for PI approximation results
doubleReferencePI= 3.141592653589793238462643; //ref value of PI for comparison
double PI;
charprocessor_name[MPI_MAX_PROCESSOR_NAME];
char(*all_proc_names)[MPI_MAX_PROCESSOR_NAME];
intnumprocs;
intMyID;
intnamelen;
intproc = 0;
charcurrent_time [9];

if (argc > 1)
{
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&MyID);
MPI_Get_processor_name(processor_name,&namelen);

if (MyID == 0){/* if this is the Rank 0 node... */
NumIntervals = (int)_strtoi64(argv[1], 0, 10);/* read number of intervals from the command line argument */
all_proc_names = malloc(numprocs * MPI_MAX_PROCESSOR_NAME);   /* allocate memory for all proc names */
}

/* send number of intervals to all procs */
MPI_Bcast(&NumIntervals, 1, MPI_INT, 0, MPI_COMM_WORLD);

    /* get all processor names for this job */
MPI_Gather(processor_name, MPI_MAX_PROCESSOR_NAME, MPI_CHAR, all_proc_names, MPI_MAX_PROCESSOR_NAME, MPI_CHAR, 0, MPI_COMM_WORLD);

if (NumIntervals == 0)
{
printf("command line arg must be greater than 0");
}
else
{
//approximate the value of PI
IntervalWidth   = 1.0 / (double) NumIntervals;

for (Interval = MyID + 1; Interval <= NumIntervals; Interval += numprocs){
IntrvlMidPoint = IntervalWidth * ((double)Interval - 0.5);
IntervalLength += (4.0 / (1.0 + IntrvlMidPoint*IntrvlMidPoint));
}

MyPI = IntervalWidth * IntervalLength;
MPI_Reduce(&MyPI, &PI, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);