Monitoring Nodes

A key step in monitoring and maintaining cluster health is to identify any deviance from normal operational state or performance. HPC Cluster Manager enables you to view cluster and node status at a glance, identify problem nodes, and drill down into node details for further investigation.

In this topic:

View cluster status at a glance

In Node Management you can monitor your cluster at a glance using the node List view or the node Heat Map view. In Charts and Report, the monitoring charts display current and recent data about node health and cluster utilization. For more information, see:

Drill down into individual node details

The List and Heat Map views provide a starting point for identifying problem areas. Double-click a compute node to see detailed information such as hardware, operating system properties, and current performance metrics. You can also select one or more nodes, then drill down into the node details to investigate performance.

Monitor node operations

Tracking recent or ongoing cluster operations is another monitoring aspect that is critical to administrating a cluster. For more information, see:

Correlate the monitoring information between nodes, jobs, operations, and diagnostics

In HPC Job Manager, you can use the Pivot To actions to correlate the monitoring information between nodes, jobs, operations, and diagnostics. For example, you can select one or more nodes in the views pane, and then pivot to the Jobs for the Selected Nodes. This takes you to a job list view that is filtered by the nodes that you selected.

The supported pivot paths are:

  • Nodes: pivot to jobs, test results, and operations.

  • Jobs: pivot to nodes.

  • Test results: pivot to failed nodes, and operations.

Monitor cluster usage and statistics over time

HPC Cluster Manager provides several built-in charts and reports to monitor and analyze cluster resource usage and job and node statistics over time. The HPCReporting database also supports custom reporting. For more information, see Charts and Reports: HPC Cluster Manager.

In this section