Security Considerations in High Performance Computing
by Richard Carpenter, Principal Consultant, Microsoft Services, Southern California and Sanjay Pandit, Senior Consultant, Microsoft Services, Southern California
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Windows® HPC Server 2008 is a cost-effective, high-performance computing (HPC) solution. One of its primary advantages is that Windows HPC Server 2008 can be deployed, managed, and extended using familiar Windows tools and technologies. This also means that the foundation for securing Windows HPC Server 2008 is the same as that for securing Windows Server® 2008, and can be executed by taking advantage of resources such as the Windows Server 2008 Security Guide.
In addition to the broad guidance offered by the Windows Server 2008 Security Guide, there are elements of security unique to HPC, and the cluster architecture can be a significant contributor to its overall security. Consider the two types of computers in a compute cluster. The head node is the controlling node, which is represented by the server that will perform all security checks and orchestrate the operation of the rest of the compute nodes. The second type of server is the compute node; this is where work is actually performed. The head node can also be a compute node, but this is usually only the case for small clusters of less than 10 nodes.
Along with compute resources, there is also networking. Three types of networks compose a compute cluster. The first is the enterprise network—typically the corporate local area network (LAN)— which allows users who are not directly connected to the compute cluster to access the head node and, potentially, the compute clusters. The second is the private network, which provides a dedicated connection between the head node and the compute nodes. For small clusters, this network can sometimes be the enterprise network, but as the compute cluster grows in size this connection can often impact the corporate LAN and provide unwanted access to the compute nodes. The third is a dedicated network (preferably with high bandwidth and low latency) that carries parallel Message Passing Interface (MPI) application communication between cluster nodes.
The compute cluster security is broken down into a combination of two types of security: user credential and network configuration. All user access to the compute cluster is managed through Active Directory® groups. The groups provide access rights to the different resources of the compute cluster. There are two different groups used in compute cluster configurations: administrators and users. Administrators have the power to manage the configuration of the cluster (e.g. add and/or remove nodes, user permissions, and other administrative functions). Users are allowed to run jobs on the compute cluster; depending on their rights, users may have access to some or all of the resources in the cluster.
Compute clusters often use multiple network connections to achieve the performance requirements of the jobs they are performing. With compute clusters, network latency can be a major factor in the success or failure of a compute job, so many compute clusters will have dedicated low-latency network fabric that connects all of the compute nodes and the head node together. In addition to the needs of low-latency network for computation functions, compute clusters will often have one or more additional network connections depending on the types of jobs being run on the compute nodes. These additional network connections often carry data and control information to the different compute nodes and results/status reports from the compute nodes to the head nodes. We can also leverage these multiple networks to maintain and reinforce the security of the cluster.
There are three basic networking configurations for an HPC cluster. The first is a two-network configuration for small clusters where the enterprise network performs double duty as the data network. The second network is the data network where the low-latency MPI protocol is used to communicate between the compute nodes.
In the second configuration, all three network types are on three separate network interfaces. The enterprise network provides access to the larger corporate network, where raw data and results can be stored. The private network is where command/control traffic is kept, which guarantees that the head node can reach each compute node without interference from data traffic. The data network allows the compute nodes to communicate with each other without being impacted by the traffic over the enterprise or private networks.
A variation of the second configuration maintains the three network connections to the head node, but each of the compute nodes only gets two connections: the private network and the data network. The reason for doing this is to improve security by reducing the attack surface of the compute cluster. If a user wants to run a job that requires large amounts of data, that data must be moved to the head node where the compute nodes will have access to the data through the private network. This also means that any resulting data will need to be copied off the head node before the end of the job to prevent loss of data when the next job is run. To increase the security of this configuration, many organizations turn on the Windows Firewall or install a third-party firewall to prevent all inbound traffic except remote desktop connections. This requires that end users perform a remote logon to the compute cluster head node to perform data transfers to or from the enterprise network. From the remote desktop session, users would copy data from the enterprise network to temporary storage on the head node before initiating the job that will pull data from the head node.
Through adoption of these strategies and deployment of an appropriate HPC architecture, an attacker’s ability to access the compute cluster and push malicious data to it will be significantly reduced. Combined with general Windows Server security best practices, we can establish network configurations that help ensure that the HPC platform and its data are more secure.