Key Monitoring Scenarios

Article
05/12/2011

The Windows HPC Server 2008 R2 Management Pack includes a number of key monitoring scenarios that are configurable. These monitoring scenarios are enabled with monitors, rules, alerts and tasks to help manage an HPC cluster.

This section contains the following topics:

Head node monitoring
Job scheduler monitoring
Broker node, compute node, and workstation node monitoring
Placing monitored objects in maintenance mode

Head node monitoring

The head node is the most important component in an HPC cluster. The head node is responsible for both cluster management and job scheduler functionalities. The “Head Node” folder contains alert, state, and performance views for head node monitoring.

The following table lists the monitoring scenarios for the cluster management service on the head node. The monitoring scenarios for the job scheduler functionalities are discussed separately in Job scheduler monitoring, later in this topic.

Scenario	Monitoring Elements
Cluster metrics monitoring	Important cluster-wide metrics can indicate whether the cluster is in a congested state or not. These metrics include: Cluster CPU usage Cluster disk throughput Cluster network usage Cluster power consumption efficiency These monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors.
Management component monitoring	The health of the management component is tracked by monitoring the registry configuration, services, and events. Registry configuration Check the registry settings under SOFTWARE\Microsoft\HPC are correct. Services Check whether the following services are running: HPC Basic Profile Web Service (monitoring can be disabled if the HPC Basic Profile is not being used) HPC System Definition Model (SDM) Store Service HPC Management Service HPC MPI Service Windows Deployment Services (WDS) Server service (monitoring can be disabled if WDS is not used on the head node. - for example, when only an enterprise network topology is used for the cluster) HPC Session Service HPC Node Manager Service HPC Diagnostics Service HPC Reporting Service Events Monitor events for the HPC Management Service to track operations errors
Node performance	These metrics can indicate whether the node is performing properly. These metrics include: CPU utilization Memory utilization These monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors.
Database monitoring	Check connectivity to the HPC databases (installed either on the head node or on remote servers that are running SQL Server). If the databases are installed on remote servers, the SQL Server Library Management Pack can be imported to provide additional monitoring capabilities.
Head node network monitoring	Head node network monitoring contains some basic network configuration and performance monitoring elements: NetworkDirect configuration Network firewall configuration Basic MPI ping-pong test performance (this monitor uses the MPI Ping Pong: Lightweight Throughput diagnostic test in Microsoft HPC Pack 2008 R2)

Job scheduler monitoring

To schedule jobs on the cluster, the HPC Job Scheduler Service and the compute nodes on which the jobs are scheduled need to be healthy. The “Job Scheduler” folder contains alert, state, and performance views for HPC Job Scheduler monitoring.

The following table lists the monitoring scenarios for the job scheduler.

Scenario

Monitoring Elements

Job scheduling monitoring

The following monitors track whether jobs are running smoothly on the cluster:

HPC Job Scheduler Service monitor
HPC Job Scheduler Service event monitor tracks the number of warning and error events produced by the HPC Job Scheduler Service
Failed job proportion monitor tracks the percentage of failed jobs against all completed jobs. (This monitor should be configured for individual cluster usage.)
Unreachable node proportion monitor tracks the percentage of nodes in the cluster that can be reached in the cluster. (This monitor should be configured for individual cluster usage.)
Daily job queue time monitor tracks the average daily wait time. (This monitor is disabled by default. To use this monitor, it needs to be enabled and its threshold should be configured.)

Submission and activation filter monitoring

The monitors in this category track job submission and activation filter configuration and performance:

Submission and activation filters availability
Submission and activation filters timeout

Broker node, compute node, and workstation node monitoring

The Windows HPC Server 2008 Management Pack provides basic monitoring capabilities for compute nodes and Windows Communication Foundation (WCF) broker for Windows Communication Foundation (WCF) broker nodes, compute nodes, and workstation nodes in the cluster. The “Broker Node”, Compute Node” and “Workstation Node” folders contain alert, state, and performance views for monitoring these nodes.

The following table lists the monitoring scenarios in this category.

Scenario	Monitoring Element
Broker node monitoring	Connectivity Check whether the node is reachable from the head node. Services Check whether the following services are running: HPC Management Service HPC MPI Service HPC Node Manager Service HPC Broker Service MSMQ Service Net.TCP Port Sharing Service Events Monitor events for the HPC Management Service to track operations errors Performance These metrics can indicate whether the node is performing properly. These metrics include: CPU utilization Memory utilization MSMQ utilization The CPU and memory utilization monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors.
Compute node and workstation node monitoring	Connectivity Check whether the node is reachable from the head node. Services Check whether the following services are running: HPC Management Service HPC MPI Service HPC Node Manager Service Events Monitor events for the HPC Management Service to track operations errors Performance These metrics can indicate whether the node is performing properly. These metrics include: CPU utilization Memory utilization The CPU and memory utilization monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors.

Broker node monitoring

Connectivity

Check whether the node is reachable from the head node.

Services

Check whether the following services are running:

HPC Management Service
HPC MPI Service
HPC Node Manager Service
HPC Broker Service
MSMQ Service
Net.TCP Port Sharing Service

Events

Monitor events for the HPC Management Service to track operations errors

Performance

These metrics can indicate whether the node is performing properly. These metrics include:

CPU utilization
Memory utilization
MSMQ utilization

The CPU and memory utilization monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors.

Compute node and workstation node monitoring