Understanding Diagnostic Test Results in Windows HPC Server 2008

Applies To: Windows HPC Server 2008

Windows® HPC Server 2008 provides a set of commonly-used diagnostic tests. You can use these tests to help verify deployment, troubleshoot failures, and detect performance degradation.

Detailed results from the diagnostics tests are available as an HTML file. For information about viewing diagnostic test results in HPC Cluster Manager, see View and Save Test Results. You can also export test results to an HTML file by using the HPC PowerShell cmdlet Export-HpcTestResult.

Diagnostic tests are conceptually grouped by test suite. The following sections describe the tests in each suite and their results:

  • Scheduler diagnostics

  • Services diagnostics

  • Connectivity diagnostics

  • System Configuration diagnostics

  • SOA diagnostics

  • Performance diagnostics

Scheduler diagnostics

Job Submission Test

Submits a simple job to the HPC Job Scheduler Service using the clusrun command. This test verifies that the HPC Job Scheduler Service can accept and run a job on selected compute nodes.

Result Description

Success

The job finished successfully on all nodes.

Failure

The job failed to run on the selected nodes.

Services diagnostics

All Services Running

Verifies that the required Windows HPC Server 2008 services are running on the selected nodes. Required services are determined by the role of the target node. This test may report the status of optional services, if they are present, but it only validates against the required services. For information about Service Oriented Architecture (SOA) services, use the SOA Service Configurations Report.

The following table lists the services, and indicates which services are required for each node role. If a service is listed as “Not validated”, that indicates that the test will not fail if that service is missing or not running.

Service Hosting Head node Backup Head node Compute node WCF Broker node

HPC Management Service

Required

Required

Required

Not validated

HPC Node Manager Service

Required

Required

Required

Not validated

HPC Job Scheduler Service

Required

Must be installed and not running

Not validated

Not validated

HPC SDM Store Service

Required

Must be installed and not running

Not validated

Not validated

SQL Server

Required

Must be installed and not running

Not validated

Not validated

Windows Deployment Services Server

Required if private network exists

Required if private network exists

Not validated

Not validated

DHCP Server

Required if private network exists

Required if private network exists

Not validated

Not validated

HPC Basic Profile Web Service

Optional

Optional

Not validated

Not validated

HPC MPI Service

Not validated

Not validated

Required

Not validated

Result Description

Success

All required services are installed and running.

Failure

Required services are missing or are not running.

Expand the test results for the failed nodes to see the list of services that are missing or not running.

Connectivity diagnostics

DNS Name Resolution

Verifies Domain Name System (DNS) name resolution between the selected compute nodes.

During the test, each node attempts to resolve the name of every other node in the cluster using DNS and compares the name with the HPC Management Service records. HPC Management Service records are updated dynamically by an agent running on the test node, ensuring the test is between the address recorded by DNS and the actual physical IP address of the node.

This test reports mismatches between pairs. Each local node performs its own DNS-only lookup, and then performs a host file and DNS lookup for each node in the cluster.

Result Description

Success

All the expected IP addresses are returned.

Warning

A private or application network IP address is returned.

Failure

No IP address or an unexpected IP address is returned, or the lookup threw an exception.

Domain Connectivity

Verifies connectivity between the selected nodes and each domain controller. It verifies connectivity by using a simple Lightweight Directory Access Protocol (LDAP) query to look up an Active Directory's RootDSE object.

Result Description

Success

All of the selected nodes are able to reach all domain controllers.

Warning

One or more selected nodes are able to reach some, but not all domain controllers.

Failure

One or more selected nodes cannot reach any domain controllers.

Internode Connectivity

Verifies network connectivity between compute nodes by performing a ping test between each node and all other nodes in the selected group.

Result Description

Success

Each node can ping all other nodes.

Failure

One or more nodes are unreachable or the ping timed-out before receiving response back.

System Configuration diagnostics

Application Configurations Report

This test provides a report of the applications, including the version numbers, that are installed on the selected nodes.

In the Summary section of the test results, you can see a table that lists all of the installed applications, and a count of the nodes that have that application installed.

Expand the results for an individual node to see the applications installed on that node.

Firewall Configurations Report

This test reports on the firewall status (enabled or disabled) for the selected nodes. This test also reports the applications or services that are allowed access through the firewall (the firewall exceptions), including which port number they are using.

In the Summary section of the test results, you can see a table that summarizes the firewall status for the selected nodes. You can also see a table that lists all of the firewall exceptions, and a count of the nodes that have that firewall exception.

Expand the results for an individual node to see the firewall status, on a per-interface basis, and the firewall exceptions for that node.

Installed Software Updates Report

This test provides a report of the software updates (patches) that have been installed on each selected node. This test can take a long time.

You can expand the results for an individual node to see the list updates that have been installed on that node.

Network Configurations Report

This test provides a report of the network configuration for the selected nodes. For each relevant network (Private, Enterprise, and Application), this test reports the values for the following network settings:

  • Description of the network adapter

  • Speed

  • Gateway

  • DhcpEnabled (True/False)

  • DhcpServer

  • Domain

  • DnsServers

  • ID

  • Interface Type

  • IP Addresses

  • Online (True/False)

  • MAC Address

  • Role

Expand the results for an individual node to see the network settings for that node.

Pending Software Updates

This test provides an overall report of the software updates that are available for all the nodes as well as a list of updates that are available for each node. The test reports on the updates that are identified as critical by Windows Server Update Services (WSUS) or Microsoft Update for each node in the cluster.  The diagnostic communicates with the Microsoft Update client, which filters the updates so that only those that are relevant to the node are reported to the diagnostic.

Important

This test fails if the winhttp proxy is not set on the compute node. Run the netsh winhttp show proxy command to determine if the compute nodes have a proxy server set.

Expand the results for an individual node to see the pending updates for that node.

Service Configurations Report

This test provides a report of the services that are running on the selected nodes, and indicates the startup configuration for each service.

In the Summary section of the test results, you can see a table that lists all of the running services, and a count of the nodes that have that service running.

Expand the results for an individual node to see the running services for that node.

Software Updates Required

This test compares the software updates that are installed on the compute node with the updates specified in the node template Apply Updates task. The report indicates if any compute nodes failed to meet the required update level (None, Critical, All), or are missing the specific updates, as defined in the node template.

Note

The node template must include the Apply Updates task to run this test. If the node template does not include this task, you can either run the Pending Software Updates test to see a list of available updates, or you can add the task to the node template. For information about adding the update task to the node template, see Add the Apply Updates Task to a Node Template.

Result Description

Success

There are no updates required by the node's template that have not yet been installed.

Failure

There are outstanding updates that should be installed, according to the node's template.

Expand the results for an individual node to see the missing updates for that node.

SOA diagnostics

SOA Model Latency

This test verifies network connectivity over Hypertext Transfer Protocol (HTTP) and NetTCP on the selected nodes or node groups.

This test attempts to detect a Windows Communication Foundation (WCF) broker node, and then runs the broker service on it. After a WCF broker node is detected, the test starts two Service-Oriented Architecture (SOA) sessions on each selected node: one session over HTTP and the other over NetTCP. The sessions are run sequentially on each of the selected nodes.

If an SOA session is successfully created on a node, the SOA Model test then sends one thousand messages to that session and counts the average latency of messages sent and received from each session. The test then sorts the latency results into three categories and reports the nodes in each category.

This test reports how many nodes fall into one of the three latency response categories. The following table describes the latency categories:

Latency category Description

Low

Less than 5 milliseconds

Medium

Between 5 and 10 milliseconds

High

Greater than 10 milliseconds

Result Description

Success

SOA sessions were successfully created on all of the selected nodes.

Failure

An SOA session was not started successfully on either the HTTP or NetTCP bindings on the failed nodes.

Failed To Run

The test failed to find a WCF broker node, and it will report a broker configuration error.

SOA Service Configurations Report

This test reports on the Service-Oriented Architecture (SOA) service configuration for the selected nodes. This test displays the following information:

  • Service name

  • Location of the service assembly

  • Service type

  • Contract type

  • Architecture (x86 or x64)

  • Environment variables

Note

A service is displayed as being installed on all the compute nodes if the service registration file is installed on a file share designated by the CCP_SERVICEREGISTRATION_PATH environment variable, and the file share is readable by everyone.

Performance diagnostics

MPI Ping Pong: Lightweight Throughput

This test measures the performance of the network throughput between each node and two of the nodes closest to it. This test is performed in a ring to provide a reasonable indication of cluster throughput in a minimal amount of time. Unlike latency measurements, throughput measurements stress a cluster’s network switches and few networking topologies and can provide fully non-blocking links under throughput measurement conditions where 4 MB packets are ping-ponging on all links.  Thus, each link is measured serially and ring-style where the first node ping-pongs with the node next to it, then that node ping-pongs with a node next to it, and so on until the ping-ponging moves around the entire set of nodes under test.  This method creates a Cycle Graph of the cluster network or N pairings of nodes (N=number of nodes under test) instead of the exponentially larger number of measurements required for a Complete Graph of the cluster network or N*(N-1)/2 pairings of nodes.

This test reports node latency in the following categories:

Latency category Description

Low

Less than 5 milliseconds

Medium

Between 5 and 10 milliseconds

High

Greater than 10 milliseconds

The test also reports the following statistics:

Statistic Description

Average

The average latency.

Standard deviation

The standard deviation in latency.

Best link

The computer names of the node pair with the lowest measured latency and the latency value.

Worst link

The computer names of the node pair with the highest measured latency and the latency value.

Histogram data

The number of network links measured in each of several latency ranges.

Variability rating

The variability rating is a qualitative indication of consistency of latency across the entire cluster.   Variability is the “width” of the Bell curve (Normal Distribution).  It is calculated as the standard deviation divided by the mean of the measure latency of all network links in the cluster (σ/xmean) with values <0.05 reported as “Low”, values 0.05-0.25 reported as “Moderate”, and values >0.25 reported as “High” variability.

Note

Latency is measured by first “warming up” the network between each node pair (sending packets back and forth which are ignored in the MPI Ping Pong calculations) and then sending a 4-byte data packet from the first node to the second and back to the first. The latency for given pair is calculated as the average (over 1024 iterations) of one-half the round-trip time, in micro-seconds. In practice, very little noise is introduced into the simultaneous latency measurements of each round because the packets are small and thus even heavily over-subscribed network switches do not impede the packets. If highly accurate measurements are required, use the command line version of MPI Ping Pong (mpipingpong) to make latency measurements on each link serially.

Result Description

Warning

The average latency for all network links for that node is at least one standard deviation away from the mean latency for the cluster, and latency is at least 20% higher than the mean latency for the cluster. 

MPI Ping Pong: Quick Check

This test measures the bandwidth and latency of node-to-node communication. Since this is a performance test, to get accurate results, run this test on nodes that are offline and not running other jobs.

This test runs MPI Ping Pong between every pair of nodes. For a large number of nodes, this test might take a long time to complete. Specifically, for a target group of N nodes, the test will run against N*(N-1)/2 node-to-node connections serially. Average node latency and bandwidth is gauged for each connection, and results are categorized into high, medium, and low performance categories.

The same statistics are reported for this test as for the MPI Ping Pong: Lightweight Throughput test. 

Result Description

Warning

The average latency for all network links for that node is at least one standard deviation away from the mean latency for the cluster, and latency is at least 20% higher than the mean latency for the cluster.