4 Technical Choices for Designing a Hybrid Operating System Cluster Prototype
Updated: October 7, 2009
Applies To: Windows HPC Server 2008
We want to build a flexible medium-sized HOSC with XBAS and HPCS. We only have a small 5-server cluster to achieve this goal but it will be sufficient to simulate the usage of a medium-sized cluster. We start from this cluster with InfiniBand and Gigabit network. The complete description of the hardware is given in Appendix E. We have discussed the possible approaches in the previous chapter. Let us now see what choice should be made in the particular case of this 5-server cluster. In the remainder of the document, this cluster is named the HOSC prototype.
4.1 Cluster approach
According to recommendations given in the previous chapter, the most appropriate approach for medium-sized clusters is that with 2 virtual management nodes on one server and dual-boot compute nodes. This is the approach noted 10 in Table 1 of Section 3.
4.2 Management node
For the virtualization software choice, we cannot choose Hyper-V since the XBAS VM must be able to use more than a single CPU to serve the cluster management requests in the best conditions. We cannot choose virtualization software that does not support HPCS for obvious reasons. Finally, we have to choose between VMware and Xen which both fulfill the requirements for our prototype. RHEL5.1 is delivered with the XBAS 5v1.1 software stack and thus it is the most consistent Linux choice for the host OS. So in the end, we chose Xen as our virtualization software since it is included in the RHEL5.1 distribution. Figure 7 shows the MN architecture for our HOSC prototype.
4.3 Compute nodes
We have chosen approach 10, so CNs are dual-boot servers with XBAS and HPCS installed on local disks. We chose the Windows MBR for dual-booting CNs because it is easier to change the active partition of a node than to edit its grub.conf configuration file at each OS switch request. This is especially true when the node is running Windows since the grub.conf file is stored on the Linux file system: a common file system (on a FAT32 partition for example) would then be needed to share file grub.conf.
When the OS type of CNs is switched manually, we decided to allow the OS type switch commands to be sent only from the MN that runs the same OS as itself. In other words, the HPCS MN can “give up” one of its CNs to the XBAS cluster and the XBAS MN can “give up” one of its CNs to the HPCS cluster, but no MN can “take” a CN from the cluster with a different OS. This rule was chosen to minimize the risk of switching the OS of a CN while it is used for computation with its current OS configuration. When the OS type of CNs is switched automatically by a meta-scheduler, OS type switch commands are sent from the meta-scheduler server. To help switching OS on CNs from the MNs, simple scripts were written. They are listed in Appendices D.1.3 and D.2.3, and an example of their use is shown in Sections 6.3 and 6.4. Depending on the OS type that is booted, the node has a different hostname and IP address. This information is sent by a DHCP server whose configuration is updated at each OS switch request as explained in next section.
Figure 7 Management node virtualization architecture
4.4 Management services
We have to make choices to create a global infrastructure architecture for deploying, managing and using the two OSs on our HOSC prototype:
The DHCP is the critical part, as it is the single point of entry when a compute node boots. In our prototype it is running on the XBAS management node. The DHCP configuration file contains a section for each node with its characteristics (hostname, MAC and IP address) and the PXE information. Depending on the administrator needs, this section can be changed for deploying and booting XBAS or HPCS on a compute node. (see an example of dhcp.conf file changes in Appendix D.2.2)
WDS and/or TFTP server: each management node has its own server because the installation procedures are different. A booting compute node is directed to the correct server by the DHCP server.
Directory Service is provided by Active Directory (AD) for HPCS and by LDAP for XBAS. Our prototype will not offer a unified solution, but since synchronization mechanisms between AD and LDAP exist, a unified solution could be investigated.
DNS: this service can be provided by the XBAS management node or the HPCS head node. The DNS should be set as dynamic in order to provide simpler access for the AD. In our prototype, we set a DNS server on the HPCS head node for the Windows nodes, and we use /etc/hosts files for name resolution on XBAS nodes.
Recommendations given in Section 3.4 can be applied to our prototype by configuring the services as shown in Figure 8 and in Table 2.
Figure 8 Architecture of management services
Table 2 Network settings for the services of the HOSC prototype
The netmask is set to 255.255.0.0 because it must provide connectivity between Xen domain 0 and each DomU virtual machine.
Figures 9 and 10 describe respectively XBAS and HPCS compute node deployment steps, while Figures 11 and 12 describe respectively XBAS and HPCS compute node normal boot steps on our HOSC prototype. They show how the PXE operations detailed in Figures 3, 4, 5 and 6 of Chapter 2 are consistently adapted in our heterogeneous OS environment with a unique DHCP server on the XBAS MN and a Windows MBR on the CNs.
Figure 9 Deployment of a XBAS compute node on our HOSC prototype
Figure 10 Deployment of a HPCS compute node on our HOSC prototype
Figure 11 Boot of a XBAS compute node on our HOSC prototype
Figure 12 Boot of a HPCS compute node on our HOSC prototype
4.5 HOSC prototype architecture
The application network (i.e., InfiniBand network) should not be on the same subnet as the private network (i.e., gigabit network): we chose 172.16.0.[1-5] and 172.16.1.[1-5] IP address ranges for the application network address assignment.
The complete cluster architecture that results from the decisions taken in the previous sections is shown in Figure 13 below:
Figure 13 HOSC prototype architecture
If for some reasons the IB interface cannot be configured on the HN, you should setup a loop back network interface instead and configure it with the IPoIB IP address (e.g., 172.16.1.1 in Figure 13). If for some reasons the IB interface cannot be configured on the MN, its setup can be skipped since it is not mandatory to connect the IB interface on the MN.
In the next chapter we will show how to install and configure the HOSC prototype with this architecture.
4.6 Meta-scheduler architecture
Without a meta-scheduler, users need to connect to the required cluster management node in order to submit his job. In this case, each cluster has its own management node with its own scheduler (as shown on the left side of Figure 14). By using a meta-scheduler, we offer a single point of entry to use the power of the HOSC whatever the OS type required by the job (as shown on the right side of Figure 14).
Figure 14 HOSC meta-scheduler architecture (in order to have a simpler scheme, the HOSC is represented as two independent clusters: one with each OS type)
On the meta-scheduler, we create two job queues, one for the XBAS cluster and another one for HPCS cluster. So according to the user request, the job will be automatically redirected to the correct cluster. The meta-scheduler will also be managing the switch from an OS type to the other according to the clusters workload.
We chose PBS Professional to be used as meta-scheduler for our prototype because of the experience we already have with it on Linux and Windows platforms. PBS server should be installed on a node that is accessible from every other nodes of the HOSC. We chose to install it on the XBAS management node. PBS MOM (Machine Oriented Mini-server) is installed on all compute nodes (HPCS and XBAS) so they can be controlled by the PBS server.
In the next chapter we will show how to install and configure this meta-scheduler on our HOSC prototype.