Key Monitoring Scenarios
The Windows HPC Server 2008 R2 Management Pack includes a number of key monitoring scenarios that are configurable. These monitoring scenarios are enabled with monitors, rules, alerts and tasks to help manage an HPC cluster.
This section contains the following topics:
Head node monitoring
Job scheduler monitoring
Broker node, compute node, and workstation node monitoring
Placing monitored objects in maintenance mode
Head node monitoring
The head node is the most important component in an HPC cluster. The head node is responsible for both cluster management and job scheduler functionalities. The “Head Node” folder contains alert, state, and performance views for head node monitoring.
The following table lists the monitoring scenarios for the cluster management service on the head node. The monitoring scenarios for the job scheduler functionalities are discussed separately in Job scheduler monitoring, later in this topic.
Scenario |
Monitoring Elements |
Cluster metrics monitoring |
Important cluster-wide metrics can indicate whether the cluster is in a congested state or not. These metrics include:
These monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors. |
Management component monitoring |
The health of the management component is tracked by monitoring the registry configuration, services, and events. Registry configuration Check the registry settings under SOFTWARE\Microsoft\HPC are correct. Services Check whether the following services are running:
Events Monitor events for the HPC Management Service to track operations errors |
Node performance |
These metrics can indicate whether the node is performing properly. These metrics include:
These monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors. |
Database monitoring |
Check connectivity to the HPC databases (installed either on the head node or on remote servers that are running SQL Server). If the databases are installed on remote servers, the SQL Server Library Management Pack can be imported to provide additional monitoring capabilities. |
Head node network monitoring |
Head node network monitoring contains some basic network configuration and performance monitoring elements:
|
Job scheduler monitoring
To schedule jobs on the cluster, the HPC Job Scheduler Service and the compute nodes on which the jobs are scheduled need to be healthy. The “Job Scheduler” folder contains alert, state, and performance views for HPC Job Scheduler monitoring.
The following table lists the monitoring scenarios for the job scheduler.
Scenario |
Monitoring Elements |
Job scheduling monitoring |
The following monitors track whether jobs are running smoothly on the cluster:
|
Submission and activation filter monitoring |
The monitors in this category track job submission and activation filter configuration and performance:
|
Broker node, compute node, and workstation node monitoring
The Windows HPC Server 2008 Management Pack provides basic monitoring capabilities for compute nodes and Windows Communication Foundation (WCF) broker for Windows Communication Foundation (WCF) broker nodes, compute nodes, and workstation nodes in the cluster. The “Broker Node”, Compute Node” and “Workstation Node” folders contain alert, state, and performance views for monitoring these nodes.
The following table lists the monitoring scenarios in this category.
Scenario | Monitoring Element |
---|---|
Broker node monitoring |
Connectivity Check whether the node is reachable from the head node. Services Check whether the following services are running:
Events Monitor events for the HPC Management Service to track operations errors Performance These metrics can indicate whether the node is performing properly. These metrics include:
The CPU and memory utilization monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors. |
Compute node and workstation node monitoring |
Connectivity Check whether the node is reachable from the head node. Services Check whether the following services are running:
Events Monitor events for the HPC Management Service to track operations errors Performance These metrics can indicate whether the node is performing properly. These metrics include:
The CPU and memory utilization monitors are disabled by default. To use these monitors, they need to be enabled and their thresholds need to be configured. For more information, see Enabling Performance Threshold Monitors. |
Placing monitored objects in maintenance mode
When a monitored object, such as a computer or distributed application, goes offline for maintenance, Operations Manager 2007 detects that no agent heartbeat is being received and, as a result, might generate numerous alerts and notifications. To prevent alerts and notifications, place the monitored object into the maintenance mode. In the maintenance mode, alerts, notifications, rules, monitors, automatic responses, state changes, and new alerts are suppressed at the agent.
For general instructions about placing a monitored object in maintenance mode, see How to Put a Monitored Object into Maintenance Mode in Operations Manager 2007 (https://go.microsoft.com/fwlink/?LinkId=108358).