How to use GPU compute nodes

Starting in HPC Pack 2012 R2 Update 3, you can manage and monitor the GPU resources and schedule GPGPU jobs on the compute nodes to fully utilize the GPU resources.

In this article:

Deployment

There are no extra steps that you need to do to manually deploy the compute nodes with supported GPU cards installed. You can follow any of the existing methods to add compute nodes to your HPC Pack cluster.

Note

Currently, HPC Pack only supports the CUDA-compatible products, for example, the NVidia Tesla series. See the Nvidia site for a more detailed list of supported products.

Management and monitoring

GPU node group

After the GPU compute nodes are successfully deployed, a new default node group (GPUNodes) is created automatically. All the nodes that are recognized with supported GPU resources are added to this group.

It’s also possible to create new customized node groups that contain GPU nodes based on how you want to organize the compute resources in the cluster. This is similar to creating custom groups with other node types.

Node properties

By selecting a GPU compute node in HPC Cluster Manager, you can see some basic information in the GPU Information tab in the node property details. The basic GPU information includes:

  • GPU card PIC BUS ID

  • GPU module name

  • Total memory of GPU

  • GPU SM clock (streaming multiprocessor clock in MHz)

GPU heat map

The following GPU metrics are added to the cluster node heat map views:

  • GPU Time (%)

  • GPU Power Usage (Watts)

  • GPU Memory Usage (%)

  • GPU Memory usage(MB)

  • GPU Fan Speed (%)

  • GPU Temperatures (degree C)

  • GPU SM Clock (MHz)

You can customize your own view of the heat maps to monitor GPU usage in the same way you do with other existing heat map operations.

If there are multiple GPUs in one compute node, multiple metrics values are shown.

GPU job scheduling

By following the CUDA coding best practices for HPC Pack (see later in this article), GPU jobs can be scheduled by HPC Pack to fully utilize the GPU resources in the cluster. You can do this in HPC Cluster Manager, HPC PowerShell, or the HPC Pack job scheduler API.

HPC Cluster Manager

To submit a GPU job in HPC Cluster Manager
  1. In Job Management, click New Job.

  2. In Job Details, under Job resources, select GPU.

  3. Under Resource Selection, select a node group that includes GPU nodes. Otherwise the job scheduled on GPU resources will fail.

    You can also specify which nodes you want to run the jobs on. These nodes also need to be GPU nodes.

  4. Complete the remaining settings and submit the job.

HPC PowerShell

To submit a GPU job through HPC PowerShell, specify the NumGpus parameter, as in the following sample commands:

$job = New-HpcJob -NumGpus 1  
Add-HpcTask -Job $job -CommandLine "<CommandLine>"  
Submit-HpcJob -Job $job  
  

HPC Pack job scheduler API

If you want to submit a GPU job through the job scheduler API, set the UnitType to JobUnitType.Gpu, as shown in the following code snippet:

Scheduler sc = new Scheduler();  
            sc.Connect(<ClusterName>);  
            ISchedulerJob job = sc.CreateJob();  
            sc.AddJob(job);  
            job.UnitType = JobUnitType.Gpu;  
            ISchedulerTask task = job.CreateTask();  
            task.CommandLine = "<CommandLine>";  
            job.AddTask(task);  
            sc.SubmitJob(job, <username>, <password>);  
  

Job template

To simplify GPU job scheduling, create a job template that prepares your GPU-related settings in advance.

To create a job template with GPU settings
  1. In Configuration, create a new job template with proper initial settings.

  2. In Job Template Editor, under Node Groups, specify as Required one or more node groups containing GPU nodes.

  3. Under Job template properties, add a new Unit type setting. Make GPU the default value and select it as the valid value. This ensures that GPU jobs can be easily scheduled based on GPUs.

CUDA code sample

The GPU scheduling provided in HPC Pack (starting in HPC Pack 2012 R2 Update 3) is straightforward. The HPC Pack job scheduler assigns the jobs and tasks only to those compute nodes that have available GPU resources.

When a job is assigned to a node, the first thing is to ensure that the right GPU unit is used in this machine. Use the HPC Pack environment variable CCP_GPUIDS to get this information directly. Here is a code snippet:

/* Host main routine */  
int main(void)  
{  
    // get the available free GPU ID and use it in this thread.   
    cudaSetDevice(atoi(getenv("CCP_GPUIDS")));  
    // other CUDA operations   
}  
// once job finishes, GPU will be freed up and recognized by job scheduler as available to assign to other job/tasks.  
  

The return value of the call getenv("CCP_GPUIDS") is a string containing the index of available GPUs. In the most common case, each task needs one GPU, and the return value will be a string containing one index digit, such as 0, 1, etc.

The following sample provides more details:

  
#include "cuda_runtime.h"  
#include "device_launch_parameters.h"  
  
#include <stdio.h>  
#include <vector>  
#include <sstream>  
#include <iostream>  
#include <ctime>  
#include <algorithm>  
  
__global__ void spinKernel(int count)  
{  
	for (int i = 0; i < count; i++)  
	{  
		for (int j = 0; j < 1024; j++)  
		{  
			int hello = 0;  
			int world = 1;  
			int helloWorld = hello + world;  
		}  
	}  
}  
  
bool sortFunc(std::pair<int, std::string> *p1, std::pair<int, std::string> *p2)  
{  
	return p1->second.compare(p2->second) < 0;  
}  
  
void getSortedGpuMap(std::vector<std::pair<int, std::string> *> &map)  
{  
	int count = 0;  
	cudaGetDeviceCount(&count);  
  
	for (int i = 0; i < count; i++)  
	{  
		char buffer[64];  
		cudaDeviceGetPCIBusId(buffer, 64, i);  
		// Build a mapping between cuda Device Id and Pci Bus Id  
		map.push_back(new std::pair<int, std::string>(i, std::string(buffer)));  
	}  
  
	// Sort by Pci Bus Id  
	std::sort(map.begin(), map.end(), sortFunc);  
}  
  
int main()  
{  
	// In case different processes get different cuda Device Ids, sort them by Pci Bus Id  
	// Not necessary if the devices always give the same Ids in a machine  
	std::vector<std::pair<int, std::string> *> deviceMap;  
	getSortedGpuMap(deviceMap);  
  
	cudaError_t cudaStatus = cudaSuccess;  
  
	std::vector<int> myGpuIds;  
  
	// Get the process-wide environment variable set by HPC  
	std::istringstream iss(getenv("CCP_GPUIDS"));  
  
	clock_t startTime = clock();  
	clock_t endTime;  
  
	do  
	{  
		std::string idStr;  
		iss >> idStr;  
  
		// Drop the ending chars  
		if (idStr.length() == 0)  
		{  
			break;  
		}  
  
		int gpuId = atoi(idStr.c_str());  
  
		std::cout << "GPU ID parsed: " << gpuId << std::endl;  
  
		// Set the device ID  
		cudaStatus = cudaSetDevice(deviceMap[gpuId]->first);  
		if (cudaStatus != cudaSuccess) {  
			fprintf(stderr, "cudaSetDevice failed! Current GPU ID: %d", deviceMap[gpuId]->first);  
			return 1;  
		}  
  
		std::cout << "GPU with HPC ID " << gpuId << " has been set" << std::endl;  
		std::cout << "Cuda Device ID: " << deviceMap[gpuId]->first << std::endl << "Pci Bus Id: " << deviceMap[gpuId]->second << std::endl;  
  
		// Launch a kernel to spin GPU, async by default  
		spinKernel <<<1024, 1024>>> (1024);  
  
		endTime = clock();  
		double elapsed = (double)(endTime - startTime) / CLOCKS_PER_SEC;  
		startTime = endTime;  
  
		printf("Spin kernel launched after %.6f seconds\n\n", elapsed);  
  
		// Record the device IDs for further usage  
		myGpuIds.push_back(deviceMap[gpuId]->first);  
  
	} while (iss);  
  
	for (int i = 0; i < deviceMap.size(); i++)  
	{  
		// Clean up the map since device IDs have been saved  
		delete deviceMap[i];  
	}  
  
	std::cout << "Waiting for all the kernels finish..." << std::endl << std::endl;  
  
	for (std::vector<int>::iterator it = myGpuIds.begin(); it != myGpuIds.end(); it++)  
	{  
		// Set the device ID  
		cudaStatus = cudaSetDevice(*it);  
		if (cudaStatus != cudaSuccess) {  
			fprintf(stderr, "cudaSetDevice failed! Current GPU ID: %d", *it);  
			return 1;  
		}  
  
		cudaDeviceSynchronize();  
	}  
  
	endTime = clock();  
	double elapsed = (double)(endTime - startTime) / CLOCKS_PER_SEC;  
	startTime = endTime;  
  
	printf("Spin kernels finished after %.6f seconds\n\n", elapsed);  
  
	for (std::vector<int>::iterator it = myGpuIds.begin(); it != myGpuIds.end(); it++)  
	{  
		// Set the device ID  
		cudaStatus = cudaSetDevice(*it);  
		if (cudaStatus != cudaSuccess) {  
			fprintf(stderr, "cudaSetDevice failed! Current GPU ID: %d", *it);  
			return 1;  
		}  
  
		std::cout << "GPU with Device ID " << *it << " has been set" << std::endl;  
  
		// Do cleanup  
		cudaStatus = cudaDeviceReset();  
		if (cudaStatus != cudaSuccess) {  
			fprintf(stderr, "cudaDeviceReset failed! Current GPU ID: %d", *it);  
			return 1;  
		}  
  
		std::cout << "Reset device done" << std::endl << std::endl;  
	}  
  
	return 0;  
}