Research NotesUsing High-Performance Computing in the Fight against HIV

Kristin Firth and Mia Matusow

When you think of people conducting medical research, do you envision men and women in white lab coats working with test tubes and microscopes? Perhaps a team of chemists at a university or pharmaceutical company? These are common and certainly real images of medical research, but they represent just

one aspect of the field. In reality, some medical research is performed in very different settings.

You might be surprised to learn that Microsoft Research is playing a key role in the effort to develop a vaccine for the human immunodeficiency virus (HIV). What's more, Microsoft is doing this without a single Bunsen burner in sight. The setting looks like any other office space, furnished with regular desks and lots of PCs.

So what role does Microsoft play in all this research? Those people in white lab coats—the scientists at various universities and research centers, including Mass General in Boston, the University of British Columbia, the University of Washington, the Fred Hutchinson Cancer Research Center, and Murdoch University in Australia—gather vast amounts of data in their projects. The eScience group at Microsoft Research, which consists of about half a dozen people mostly located in Redmond, WA (and one person in New Mexico), helps those researchers process and analyze massive amounts of data. Microsoft Research first got involved in this project back in 2003, and today the eScience group collaborates with scientists on a variety of projects dealing with issues that affect society. The team works closely with external scientists, building custom solutions for projects, crunching numbers, and analyzing results.

In one high-priority project, Microsoft is helping analyze data related to HIV mutation patterns. David Heckerman, MD, PhD, Senior Researcher for Microsoft, explains that HIV mutates rapidly when attacked by an infected person's immune system. "We're working on a study that is attempting to determine how HIV mutates in response to the host's immune system. To do so, we are looking for correlations among an individual's immune system type and the HIV protein sequences that infect him."

Scientists use samples from people infected with HIV to determine their immune system types and HIV sequences. That's where the number crunching comes in. The study searches for correlations between 3,000 HIV amino acids and hundreds of immune system types across many hundreds of individuals. "We've devised statistical tests that produce more trustworthy correlations, with a reduced number of false positives and negatives," says Carl Kadie, PhD, Principal Research Software Development Engineer for Microsoft. "However, those tests demand a great deal of compute power, and the more subjects we include, the better. The idea is to run millions of simulations to reveal the most trustworthy correlations."

Using HPC to Fight HIV

Just a few years back, processing limitations hampered research efforts. Armed with only half a dozen computers, the Microsoft researchers involved in this project didn't have enough processing power to perform analysis in an adequate amount of time. Looking across every position along the genome of the vaccine and the different immune types would have taken an entire year for just 200 subjects. Even if the researchers had 20 computers to dedicate to this analysis, they still would have faced the fundamental problem of running the tests manually on 20 separate computers, receiving 20 individual sets of results, and needing another program (as well as more time) to tabulate the separate sets of results. Managing multiple jobs, gathering partial output, and all the other individual tasks involved would have taken too much time. So they turned to the High-Performance Computing (HPC) group at Microsoft.

In 2006, the researchers implemented Windows® Compute Cluster Server 2003. This HPC solution offered a straightforward way for the researchers to harness the power of many computers working together. Essentially, Windows Compute Cluster Server allows the work to be distributed across server nodes running in parallel. The task of distributing data among nodes, managing the data, and combining the results is all automated. The solution includes setup procedures, a suite of management tools, and an integrated job scheduler.

Finally, with this solution, the most significant technical roadblocks were overcome: compute power was no longer limited and the process of distributing work and managing the data was fully automated.

The Setup

For Microsoft Research, the compute cluster approach was a great fit. Windows Compute Cluster Server 2003 is well suited for projects that involve applications performing the same operation over and over—for instance, when problem-solving and analysis can be accelerated by running tasks in parallel. However, deploying this solution should not be undertaken lightly, since high-performance computing can be capital-intensive—especially in terms of power and cooling. It involves multiple servers that often run at 100 percent CPU utilization for weeks on end.

The next steps are to determine how big a cluster the organization can afford and where it can be stored. If an environment is already Windows-based with Active Directory® in place, administrators have the infrastructure they need to install, deploy, and support a cluster. In environments where Active Directory isn't already in use, some additional configuration steps are required.

You should also decide how to provision nodes within the compute cluster, which means choosing whether to use the tools that come with Windows Compute Cluster Server 2003 or using your own internal deployment techniques. You need to deploy Windows Compute Cluster Server 2003 itself along with the software applications that you want to run on the cluster. User access to the cluster must also be established so users have the ability to connect to the cluster and submit jobs, which can be done through the included graphical UI or through command-line interfaces.

Windows Compute Cluster Server 2003 is a 64-bit operating system. (The typical architecture of a Windows Compute Cluster Server 2003 environment is illustrated in Figure 1.) At the time of writing this column (June 2007), Microsoft Research was running a variety of different applications on a cluster built upon 25 IBM eServer 326 servers. Each of these servers features two AMD Opteron processors running at 2.6 GHz.

Figure 1 Harnessing power with Windows Compute Cluster Server 2003

Figure 1** Harnessing power with Windows Compute Cluster Server 2003 **(Click the image for a larger view)

After deploying Windows Compute Cluster Server, the group updated the application that it uses to perform genetic correlation to run on the cluster. The complexity of deploying applications for compute clusters can vary. The applications themselves influence how much programming is necessary.

At first, the group used the tools that were built into Windows Compute Cluster Server to quickly establish a generic UI. That was just a quick fix, however, and the group soon created its own custom Web application, which provides greater flexibility and supports the option of exposing some of the cluster nodes to scientists located outside of Microsoft. In fact, Microsoft is part of the Open Grid Forum and some of the Microsoft Research clusters are made available to users from other universities around the world, enabling researchers to collaborate and share the work load.

In addition to its graphical UIs, Windows Compute Cluster Server also supports command-line operations, allowing users to write scripts. And it provides rich APIs that can be used to write programs that interact directly with the Windows Compute Cluster Job Scheduler—a technique that Microsoft Research chose to utilize.

Getting Results

Thanks to its use of high-performance computing, the eScience group at Microsoft Research is making enormous strides in the race to develop an HIV vaccine. "With high-performance computing, we've been able to accelerate our time to insight," says Heckerman. "Several of the external groups with which we're collaborating now use our statistical techniques and share their findings. And as a result, scientists are already coming up with new hypotheses for us to test. Before, getting results would have taken a year for every step of the process. Now it takes just a single day."

Before Windows Compute Cluster Server, the methods used by Microsoft Research would have been impractical due to the time required to perform analysis. "With Windows Compute Cluster Server," says Kadie, "we can run 50 jobs of 200,000 work items each in the same amount of time that it used to take to run 1 job."

Now that the eScience group has massive computation capabilities at hand, it can conduct tests with many runs on simulated data. The runs on simulated data are critical to determining which results on real data are interesting. The more simulations performed, the more reliable the results.

Lessons Learned

Microsoft Research has obviously benefited from the use of Windows Compute Cluster Server, but the relationship is symbiotic in nature. Developers in the High-Performance Computing group, which is working on a second version of Windows Compute Cluster Server, continue to gain insight from the feedback the eScience group provides.

In particular, the HPC group has learned a great deal about resource allocation by monitoring and analyzing the behavior of the Microsoft Research cluster to determine the best way to balance the cluster's resources among multiple users. For instance, on this project, a user typically creates one job at a time with an average of 50 tasks associated with each job. The user submits the job and the Windows Compute Cluster Job Manager allocates enough resources to manage all 50 tasks. To do this, it immediately claims all available nodes on the cluster to work on the tasks. This scenario would not be problematic in a single-user environment, but it becomes an issue in a multi-user setting where it is very important that projects progress simultaneously.

Currently, when 10 of the job's 50 tasks are complete, 10 servers are not freed up. The Job Manager waits until all 50 tasks within the job have been completed before freeing up the servers to tackle another job. HPC developers are now working on a way to reallocate resources upon completion of each task—not the overall job.

Microsoft Research also looked to the HPC group to help resolve specific issues that pertained to the way that the researchers wanted to use Windows Compute Cluster Server. Notably, the researchers wanted to set up security so that user credentials would automatically propagate from the Web front end to the compute cluster. The solution was to work within the Microsoft® .NET Framework and extend the ASP.NET forms authentication so that the Web application would be able to supply the cluster nodes with full user credentials every time a user submits a job.

Now the eScience team members at Microsoft Research and their colleagues from around the world can focus on the most important parts of their jobs and leave all the tasks involved in processing and managing data to the computers. "As a result, we're able to go further as a team," says Heckerman. "We're pushing out the boundaries of what is known about HIV, making good progress in the fight against the disease."

Kristin Firth and Mia Matusow, both of Blue Line Writing & Editing, have spent the past decade creating strategic content for enterprise organizations in the public and private sectors on three continents.

© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.