3 Approaches and Recommendations

Article
12/15/2009

Applies To: Windows HPC Server 2008

In this chapter, we will explain the different approaches to offer several operating systems on a cluster. The approaches discussed in Sections 3.1 and 3.2 are summarized in Table 1 on the next page.

3.1 A single operating system at a time

Let us examine the case where all nodes run the same OS. The cluster OS of the cluster is selected at boot time. Switching from an OS to another can be done by:

Re-installing the selected OS on the cluster if necessary. But since this process can be long it is not realistic for frequent changes. This is noted as approach 1 in Table 1.
Deploying a new OS image on the whole cluster depending on the OS choice. The deployment can be done on local disks or in memory with diskless compute nodes. It is difficult to deal with the OS change on the management node in such an environment: either the management node is dual-booted (this is approach 7 in Table 1), or an additional server is required to distribute the OS image of the MN. This can be interesting in some specific cases: on HPC clusters with diskless CN when the OS switches are rare, for example. Otherwise, this approach is not very convenient. The deployment technique can be used in a more appropriate manner for clusters with 2 simultaneous OSs (i.e., 2 MNs); this will be shown in the next Section with approaches 3 and 11.
Dual-booting the selected OS from dual-boot disks. Dual-booting the whole cluster (management and computing nodes) is a good and very practical solution that was introduced in Chapters 1 and 2. This approach, noted 6 in Table 1, is the easiest way to install and manage a cluster with several OSs but it can only apply for small clusters with few users when no flexibility is required. If only the MNs are on a dual-boot server while the CNs are installed with a single OS (half of the CNs having an OS while the others have another), the solution has no sense because only half of the cluster can be used at a time in this case (this is approach 5). If the MNs are on a dual-boot server while the CNs are installed in VMs (2 VMs being installed on each compute server), the solution has no real sense either because the added value of using VMs (quick OS switching for instance) is cancelled by the need of booting the MN server (this is approach 8).

Whatever the OS switch method, a complete cluster reboot is needed at each change. This implies cluster unavailability during reboots, a need for OS usage schedules and potential conflicts between user needs, hence a real lack of flexibility.

In Table 1, approaches 1, 5, 6, 7, and 8 define clusters that can run 2 OSs but not simultaneously. Even if such clusters do not stick to the Hybrid Operating System Cluster (HOSC) definition given in Chapter 1, they can be considered as a simplified approach of its concept.

Approaches with two operating systems

Table 1 Possible approaches to HPC clusters with 2 operating systems

3.2 Two simultaneous operating systems

The idea is to provide, with a single cluster, the capability to have several OSs running simultaneously on an HPC cluster. This is what we defined as a Hybrid Operating System Cluster (HOSC) in Chapter 1. Each compute node (CN) does not need to run every OS simultaneously. A single OS can run on a given CN while another OS runs on other CNs at the same time. The CNs can be dual-boot servers, diskless servers, or virtual machines (VM). The cluster is managed from separate management nodes (MN) with different OSs. MN can be installed on several physical servers or on several VMs running on a single server. In Table 1, approaches 2, 3, 4, 9, 10, 11, and 12 are HOSC.

HPC users may consider HPC clusters with two simultaneous OSs rather than a single OS at a time for four main reasons:

To improve resource utilization and adapt the workload dynamically by easily changing the ratio of OSs (e.g., Windows vs. Linux compute nodes) in a cluster for different kinds of usage.
To be able to migrate smoothly from one OS to the other, giving time to port applications and train users.
Simply to be able to try a new OS without stopping the already installed one (i.e., install a HPCS cluster at low cost on an existing Bull Linux cluster or install a Bull Linux cluster at low cost on an existing HPCS cluster).
To integrate specific OS environments (e.g., with legacy OSs and applications) in a global IT infrastructure.

The simplest approach for running 2 OSs on a cluster is to install each OS on half (or at least a part) of the cluster when it is built. This approach is equivalent to building 2 single OS clusters! Therefore it cannot be classified as a cluster with 2 simultaneous OSs. Moreover, this solution is expensive with its 2 physical MN servers and it is absolutely not flexible since the OS distribution (i.e., the OS allocation to nodes) is fixed in advance. This approach is similar to approach 1 already discussed in the previous section.

An alternative to this first approach is to use a single physical server with 2 virtual machines for installing the 2 MNs. In this case there is no additional hardware cost but there is still no flexibility for the choice of the OS distribution on the CNs since this distribution is done when the cluster is built. This approach is noted 9.

On clusters with dual-boot CNs the OS distribution can be dynamically adapted to the user and application needs. The OS of a CN can be changed just by rebooting the CN aided by a few simple dual-boot operations (this will be demonstrated in Sections 6.3 and 6.4). With such dual-boot CNs, the 2 MNs can be on a single server with 2 VMs: this approach, noted 10, is very flexible and requires no additional hardware cost. It is a good HOSC solution, especially for medium-sized clusters.

With dual-boot CNs, the 2 MNs can also be installed on 2 physical servers instead of 2 VMs: this approach, noted 2, can only be justified on large clusters because of the extra cost due to a second physical MN.

A new OS image can be (re-)deployed on a CN on request. This technique allows changing the OS distribution on CNs on a cluster quite easily. However, this is mainly interesting for clusters with diskless CNs because re-deploying an OS image for each OS switch is slower and consumes more network bandwidth than the other techniques discussed in this paper (dual-boot or virtualization). This can also be interesting if the OS type of CNs is not switched too frequently. The MNs can then be installed in 2 different ways: either the MNs are installed on 2 physical servers (this is approach 3 that is interesting for large clusters with diskless CNs or when the OS type of CNs is rarely switched) or they are installed on 2 VMs (this is approach 11 that is interesting for small and medium size diskless clusters).

The last technique for installing 2 CNs on a single server is to use virtual machines (VM). In this case, every VM can be up and running simultaneously or only a single VM may run on each compute server while the others are suspended. The switch from an OS to another can then be done very quickly. Using several virtual CNs of the same server simultaneously is not recommended since the total performance of the VMs is bounded by the native performance of the physical server and so no benefit can be expected from such a configuration. Installing CNs on VMs makes it easier and quicker to switch from one OS to another compared to a dual-boot installation but performance of the CNs may be decreased by the computing overhead due to the virtualization software layer. Section 3.5 briefly presents articles that analyze the performance impact of virtualization for HPC. Once again, the 2 MNs can be installed on 2 physical servers (this is approach 4 for large clusters), or they can be installed on 2 VMs (this is approach 12 for small and medium-sized clusters). This latter approach is 100% virtual with only virtual nodes. This is the most flexible solution, and very promising for the future; however it is too early to use it now because of performance uncertainties.

For the approaches with 2 virtual nodes (2 CNs or 2 MNs) on a server, the host OS can be Linux or Windows and any virtualization software could be used. The 6 approaches using VMs have thus dozens of virtualization implementations. The key points to check for choosing the right virtualization environment are listed here by order of importance:

List of supported guest OSs
Virtual resource limitations (maximum number of virtual CPUs, maximum number of network interfaces, virtual/physical CPU binding features, etc.)
Impact on performance (CPU cycles, memory access latency and bandwidth, I/Os, MPI optimizations)
VM management environment (tools and interfaces for VM creation, configuration and monitoring)

Also, for the approaches with 2 virtual nodes (2 CNs or 2 MNs) on a server, the 2 nodes can be configured on 2 VMs or one can be a VM while the other is just installed on the server host OS. When upgrading an existing HPC cluster from a classical single OS configuration to an HOSC configuration, it might look interesting at first glance to configure a MN (or a CN) on the host OS. For example, one virtual machine could be created on an existing management node and the second management node could be installed on this VM. Even if this configuration looks nice and quick and easy to setup, it should never be used. Indeed, running any application or using resources of the host OS is not a recommended virtualization practice. This creates a non-symmetrical situation between applications running on the host OS and those running on the VM. This may lead to load balancing issues and resource access failures.

On an HOSC with dual-boot CNs, re-deployed CNs or virtual CNs, the OS distribution can be changed dynamically without disturbing the other nodes. This could even be done automatically by a resource manager in a unified batch environment.

Note

The batch solution is not investigated in this study but could be considered in the future.

The dual boot technique limits the number of installed OSs on a server because only 4 primary partitions can be declared in the MBR. So, on an HOSC, if more OSs are necessary and no primary partition is available anymore, the best solution is to install virtual CNs, and to run them one by one while the others are suspended on each CN (depending on the selected OS for that CN). The MNs should be installed on VMs as much as possible (like in approach 12), but several physical servers can be necessary (as in approach 4). This can happen in the case of large clusters for which the cost of an additional server is negligible. This can also happen so as to keep a good level of performance when a lot of OSs are installed on the HOSC and thus many MNs are needed.

3.3 Specialized nodes

In an HPC cluster, specialized nodes dedicated to certain tasks are often used. The goal is to distribute roles, for example, in order to reduce the management node (MN) load. We can usually distinguish 4 types of specialized nodes: the management nodes, the compute nodes (CN), the I/O nodes and the login nodes. A cluster usually has 1 MN and many CNs. It can have several login and I/O nodes. On small clusters, a node can be dedicated to several roles: a single node can be a management, login and I/O node simultaneously.

Management node

The management node (MN), named Head Node (HN) in the HPCS documentation, is dedicated to providing services (infrastructure, scheduler, etc.) and to running the cluster management software. It is responsible for the installation and setup of the compute nodes (e.g., OS image deployment).

Compute nodes

The compute nodes (CN) are dedicated to computation. They are optimized for code execution, so they are running a limited number of services. Users are not supposed to log in on them.

I/O nodes

I/O nodes are in charge of Input/Output requests for the file systems.

For I/O intensive applications, an I/O node is necessary to reduce MN load. This is especially true when the MNs are installed on virtual machines (VM). When a virtual MN handles heavy I/O requests it can dramatically impact the I/O level of performance of the second virtual MN.

If an I/O node is aimed at serving nodes with different OSs then it must have at least one network interface for each OS subnet (i.e., a subnet that is declared for every node that runs with the same OS). Section 4.4 and 4.5 show an example of OS subnets.

An I/O node could be installed with Linux or Windows for configuring a NFS server. NFS clients and servers are supported on both OSs. But the Lustre file system (delivered by Bull with XBAS) is not available for Windows clusters so Lustre I/O nodes can only be installed on Linux I/O nodes (for the Linux CN usage only). Other commercial cluster / parallel file systems are available for both Linux and Windows (e.g., CXFS).

Note

Lustre and GPFSTM clients for Windows are announced to be available soon.

The I/O node can serve one file system shared by both OS nodes or two independent file systems (one for each OS subnet). In the case of 2 independent file systems, 1 or 2 I/O nodes can be used.

Login nodes are used as cluster front end for user login, code compilation and data visualization. They are specially used to:

login
develop, edit and compile programs
debug parallel code programs
submit a job to the cluster
visualize the results returned by a job

Login nodes could run a Windows or Linux OS and they can be installed on dual-boot servers, virtual machines or independent servers. A login node is usually only connected to other nodes running the same OS as its own.

For the HPCS cluster, the use of a login node is not mandatory, as a job can be submitted from any Windows client with the Microsoft HPC Pack installed (with the scheduler graphical interface or command line) by using an account into the cluster domain. A login node can be used to provide a gateway to enter into the cluster domain.

3.4 Management services

From the infrastructure configuration point of view, we should study the potential interactions between services that can be delivered from each MN (e.g., DHCP, TFTP, NTP, etc.). The goal is to avoid any conflict between MN services while cluster operations or computations are done simultaneously on both OSs. This is especially complex during the compute node boot phase since the PXE procedure requires DHCP and TFTP access from its very early start time. A practical case with XBAS and HPCS is shown in Section 4.4.

At least the following services are required:

a unique DHCP server (for PXE boot)
a TFTP server (for PXE boot)
a NFS server (for Linux compute node deployment)
a CIFS server (for HPCS compute node deployment)
a WDS server (for HPCS deployment)
a NTP server (for the virtualization software and for MPI application synchronization)

3.5 Performance impact of virtualization

Many scientific articles deal with the performance impact of virtualization on servers in general. Some recent articles are more focused on HPC requirements.

One of these articles compares virtualization technologies for HPC (see [25]). This paper systematically evaluates various VMs for computationally intensive HPC applications using various standard scientific benchmarks using VMware Server, Xen, and OpenVZ. It examines the suitability of full virtualization, para-virtualization, and operating system-level virtualization in terms of network utilization, SMP performance, file system performance, and MPI scalability. The analysis shows that none match the performance of the base system perfectly: OpenVZ demonstrates low overhead and high performance, Xen demonstrated excellent network bandwidth but its exceptionally high latency hindered its scalability, VMware Server, while demonstrating reasonable CPU-bound performance, was similarly unable to cope with the NPB MPI-based benchmark.

Another article evaluates the performance impact of Xen on MPI and process execution for HPC Systems (see [26]). It investigates subsystem and overall performance using a wide range of benchmarks and applications. It compares the performance of a para-virtualized kernel against three Linux operating systems and concludes that in general, the Xen para-virtualizing system poses no statistically significant overhead over other OS configurations.

3.6 Meta-scheduler for HOSC

Goals

The goal of a meta-scheduler used for an HOSC can be:

Purely performance oriented: the most efficient OS is automatically chosen for a given run (based on backlog, statistics, knowledge data base, input data size, application binary, etc)
OS compatibility driven: if an application is only available for a given OS then this OS must be used!
High availability oriented: a few nodes with each OS are kept available all the time in case of requests that must be treated extremely quickly or in case of failure of running nodes.
Energy saving driven: the optimal number of nodes with each OS are booted while the others are shut down (depending on the number of jobs in queue, the profile of active users, job history, backlog, time table, external temperature, etc.)

OS switch techniques

The OS switch techniques that a meta-scheduler can use are those already discussed at the beginning of Chapter 3 (see Table 1). The meta-scheduler must be able to handle all the processes related to these techniques:

Reboot a dual-Boot compute nodes (or power it on and off on demand)
Activate/deactivate virtual machines that work as compute nodes
Re-deploy the right OS and boot compute nodes (on diskless servers for example)

Provisioning and distribution policies

The OS type distribution among the nodes can be:

Unplanned (dynamic): the meta-scheduler estimates dynamically the optimal size of node partitions with each OS type (depending on job priority, queue, backlog, etc.), then it grows and shrinks these partitions accordingly by switching OS type on compute nodes. This is usually called “just in time provisioning.”
Planned (dynamic): the administrators plan the OS distribution based on time, dates, team budget, project schedules, people vacations, etc. The size of the node partitions with each OS type are fixed for given periods of time. This is usually called “calendar provisioning.”
Static: the size of node partitions with each OS type are fixed once for all and the meta-scheduler cannot switch OS type. This is the simplest and less efficient case.