Key Monitoring Scenarios

The Windows HPC Server 2008 R2 Management Pack includes a number of key monitoring scenarios that are configurable. These monitoring scenarios are enabled with monitors, rules, alerts and tasks to help manage an HPC cluster.

This section contains the following topics:

  • Head node monitoring

  • Job scheduler monitoring

  • Broker node, compute node, and workstation node monitoring

  • Placing monitored objects in maintenance mode

Head node monitoring

The head node is the most important component in an HPC cluster. The head node is responsible for both cluster management and job scheduler functionalities. The “Head Node” folder contains alert, state, and performance views for head node monitoring.

The following table lists the monitoring scenarios for the cluster management service on the head node. The monitoring scenarios for the job scheduler functionalities are discussed separately in Job scheduler monitoring, later in this topic.

Scenario

Monitoring Elements

Cluster metrics monitoring

Important cluster-wide metrics can indicate whether the cluster is in a congested state or not. These metrics include:

  • Cluster CPU usage

  • Cluster disk throughput

  • Cluster network usage

  • Cluster power consumption efficiency

These monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors.

Management component monitoring

The health of the management component is tracked by monitoring the registry configuration, services, and events.

Registry configuration

Check the registry settings under SOFTWARE\Microsoft\HPC are correct.

Services

Check whether the following services are running:

  • HPC Basic Profile Web Service (monitoring can be disabled if the HPC Basic Profile is not being used)

  • HPC System Definition Model (SDM) Store Service

  • HPC Management Service

  • HPC MPI Service

  • Windows Deployment Services (WDS) Server service (monitoring can be disabled if WDS is not used on the head node. - for example, when only an enterprise network topology is used for the cluster)

  • HPC Session Service

  • HPC Node Manager Service

  • HPC Diagnostics Service

  • HPC Reporting Service

Events

Monitor events for the HPC Management Service to track operations errors

Node performance

These metrics can indicate whether the node is performing properly. These metrics include:

  • CPU utilization

  • Memory utilization

These monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors.

Database monitoring

Check connectivity to the HPC databases (installed either on the head node or on remote servers that are running SQL Server). If the databases are installed on remote servers, the SQL Server Library Management Pack can be imported to provide additional monitoring capabilities.

Head node network monitoring

Head node network monitoring contains some basic network configuration and performance monitoring elements:

  • NetworkDirect configuration

  • Network firewall configuration

  • Basic MPI ping-pong test performance (this monitor uses the MPI Ping Pong: Lightweight Throughput diagnostic test in Microsoft HPC Pack 2008 R2)

Job scheduler monitoring

To schedule jobs on the cluster, the HPC Job Scheduler Service and the compute nodes on which the jobs are scheduled need to be healthy. The “Job Scheduler” folder contains alert, state, and performance views for HPC Job Scheduler monitoring.

The following table lists the monitoring scenarios for the job scheduler.

Scenario

Monitoring Elements

Job scheduling monitoring

The following monitors track whether jobs are running smoothly on the cluster:

  • HPC Job Scheduler Service monitor

  • HPC Job Scheduler Service event monitor tracks the number of warning and error events produced by the HPC Job Scheduler Service

  • Failed job proportion monitor tracks the percentage of failed jobs against all completed jobs. (This monitor should be configured for individual cluster usage.)

  • Unreachable node proportion monitor tracks the percentage of nodes in the cluster that can be reached in the cluster. (This monitor should be configured for individual cluster usage.)

  • Daily job queue time monitor tracks the average daily wait time. (This monitor is disabled by default. To use this monitor, it needs to be enabled and its threshold should be configured.)

Submission and activation filter monitoring

The monitors in this category track job submission and activation filter configuration and performance:

  • Submission and activation filters availability

  • Submission and activation filters timeout

Broker node, compute node, and workstation node monitoring

The Windows HPC Server 2008 Management Pack provides basic monitoring capabilities for compute nodes and Windows Communication Foundation (WCF) broker for Windows Communication Foundation (WCF) broker nodes, compute nodes, and workstation nodes in the cluster. The “Broker Node”, Compute Node” and “Workstation Node” folders contain alert, state, and performance views for monitoring these nodes.

The following table lists the monitoring scenarios in this category.

Scenario Monitoring Element

Broker node monitoring

Connectivity

Check whether the node is reachable from the head node.

Services

Check whether the following services are running:

  • HPC Management Service

  • HPC MPI Service

  • HPC Node Manager Service

  • HPC Broker Service

  • MSMQ Service

  • Net.TCP Port Sharing Service

Events

Monitor events for the HPC Management Service to track operations errors

Performance

These metrics can indicate whether the node is performing properly. These metrics include:

  • CPU utilization

  • Memory utilization

  • MSMQ utilization

The CPU and memory utilization monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors.

Compute node and workstation node monitoring

Connectivity

Check whether the node is reachable from the head node.

Services

Check whether the following services are running:

  • HPC Management Service

  • HPC MPI Service

  • HPC Node Manager Service

Events

Monitor events for the HPC Management Service to track operations errors

Performance

These metrics can indicate whether the node is performing properly. These metrics include:

  • CPU utilization

  • Memory utilization

The CPU and memory utilization monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors.

Placing monitored objects in maintenance mode

When a monitored object, such as a computer or distributed application, goes offline for maintenance, Operations Manager 2007 detects that no agent heartbeat is being received and, as a result, might generate numerous alerts and notifications. To prevent alerts and notifications, place the monitored object into the maintenance mode. In the maintenance mode, alerts, notifications, rules, monitors, automatic responses, state changes, and new alerts are suppressed at the agent.

For general instructions about placing a monitored object in maintenance mode, see How to Put a Monitored Object into Maintenance Mode in Operations Manager 2007 (https://go.microsoft.com/fwlink/?LinkId=108358).