7 Conclusion and Perspectives

Applies To: Windows HPC Server 2008

We studied 12 different approaches to HPC clusters that can run 2 OS’s. We particularly focused on those being able to run the 2 OS’s simultaneously and we named them: Hybrid Operating System Clusters (HOSC). The 12 approaches have dozens of possible implementations among which the most common alternatives were discussed, resulting in technical recommendations for designing an HOSC.

This collaborative work between Microsoft and Bull gave the opportunity to build an HOSC prototype that provides computing power under Linux Bull Advanced Server for Xeon and Windows HPC Server 2008 simultaneously. The prototype has 2 virtual management nodes installed on 2 Xen virtual machines run on a single host server with RHEL5.1, and 4 dual-boot compute nodes that boot with the Windows master boot record. The methodology to dynamically switch the OS type easily on some compute nodes without disturbing the other compute nodes was provided.

A meta-scheduler based on Altair PBS Professional was implemented. It provides a single submission point for both Linux and Windows and it adapts automatically (with some simple rules given as example) the distribution of OS types among the compute nodes to the user needs (i.e., the pool of submitted jobs).

This successful project could be continued with the aim of improving the current HOSC prototype features. Ideas of possible improvements are to:

  • develop a unique monitoring tool for both OS compute nodes (e.g., based on Ganglia [35])

  • centralize user account management (e.g., with Samba [36])

  • work on interoperability between PBS and HPCS job scheduler (e.g., by using the tools of OGF, the Open Grid Forum [37])

We could also work on security aspects that were intentionally overlooked during this first study. More intensive and exhaustive performance tests with virtual machines (e.g., InfiniBand ConnectX virtualization feature, virtual processor binding, etc.) could also be done. Finally, a third OS could be installed on our HOSC prototype to validate the general nature of the method exposed.

More generally, the framework presented in this paper should be considered as a building block for more specific implementations. Various requirements of real applications, environments or loads could lead to sensibly different or more sophisticated developments. We hope that this initial building block will help those who will add subsequent layers, and we are eager to hear about successful production environments designed from there.

Note

Do not hesitate to send your comments to the authors about this paper and your HOSC experiments: patrice.calegari@bull.net and thomas.varlet@microsoft.com.