Understanding Diagnostic Tests

Microsoft HPC Pack provides a set of commonly used diagnostic tests. You can use these tests to help verify deployment, troubleshoot failures, and detect performance degradation. This topic describes the System tests that are included by default when you install HPC Pack. For information about the HPC Services for Excel tests, newly available tests, and creating custom diagnostics, see online diagnostics resources.

The System diagnostic tests are conceptually grouped by suite. The following sections in this topic describe the tests in each suite, and if applicable, the configurable parameters for the tests:

Deployment Environment Validator

The tests in this suite can help you find common problems that can affect bare metal node deployment. For more information, see Validate your Environment Before Deploying Nodes.

Note

For a head node that is configured as a failover cluster, the deployment environment validator tests only check the active head node. For additional checking, you might want to fail over and run the tests on the other head node. Typically, head nodes in a failover cluster having similar configurations, and different issues between the two head nodes are uncommon, but it might be good to run the test after a recovery (on the newly recovered/restored head node after a fatal failure).

Diagnostic Description
Deployment: DHCP Test Verifies DHCP server availability for all networks.
Deployment: DNS Test Verifies DNS server availability for all networks and reports the DNS server IP addresses.
Deployment: Credentials Test Verifies that the installation credentials are those of a valid HPC user. For more information, see Provide Installation Credentials.
Deployment: Active Directory Connectivity Test Verifies connectivity to the domain controller and reports the response time.
Deployment: IPsec Test Checks if Internet Protocol security (IPsec) is enabled on the Enterprise network. If IPsec is enforced on your domain through Group Policy, you may experience issues during deployment. For example, IPsec can disallow the compute nodes from talking to the head node by blocking the ports.
Deployment: Windows Deployment Services Test Verifies that the Windows Deployment transport service is turned on and that the Deployment Server is not installed. Windows Deployment Services enables remote Windows installation to PXE-enabled computers.

HPC Pack uses only the Transport Server role service in the Windows Deployment Services role. The Deployment Server role service does not need to be installed.
Deployment: Windows Image and Install Share Test Verifies that the installation image in each node template and the Windows Preinstallation Environment (Windows PE) image used for deployment are not missing, corrupt or locked by another process, and that the size of the Windows PE image does not exceed 300 MB. Verifies that the shared folder used for installation exists and has the correct permissions.

See also Understanding Node Templates.

Windows PE is used to prepare a computer for Windows installation, to copy disk images from a network file server, to initiate Windows Setup, and to capture the image of a node.
Deployment: NAT Test Verifies that Network Address Translation (NAT) is correctly configured on the head node, so that compute nodes can communicate with the Enterprise network in some topologies.
Deployment: Firewall Test Test added in HPC Pack 2008 R2 Service Pack 1.

Verifies that the firewall is turned OFF for the network adapters in the Private and Application cluster networks (that is, that those network adapters are excluded from Windows Firewall). It also verifies that the required inbound and outbound firewall rules on the head node are properly configured.

For more information, see the Windows Firewall configuration section in HPC Cluster Networking.
Deployment: Ports Open Test Test added in HPC Pack 2008 R2 with Service Pack 1.

Verifies that the following TCP ports are open in Windows Firewall:

- 1856
- 6729
- 6730
- 9094
- 9095
- 9096
- 9794
- 9892
- 9893
- 9894

For information about the ports that are required by HPC Pack for communication between cluster services on the head node and the other nodes in the cluster, see the Windows Firewall configuration section in HPC Cluster Networking.
Deployment: Binding Order Test Test added in HPC Pack 2008 R2 with Service Pack 1.

Verifies that the Enterprise network is set as the first in the binding order on the default network gateways. If the Private network is listed before the Enterprise network, this can cause problems communicating with the Active Directory domain controller on the Enterprise network.
Deployment: HA Virtual Network Resources Test Test added in HPC Pack 2008 R2 with Service Pack 1.

If the head node is configured in a failover cluster for high availability, verifies that the virtual network resources for the failover cluster are correctly configured. After the head node is configured in a failover cluster, because the failover cluster is not tied to a single physical server, it cannot have the name and IP address of a physical server. The failover cluster must have a virtual head node name and a corresponding virtual IP address that is different than the physical names and IP addresses of the two head nodes in the failover cluster. This allows communications on the Enterprise and Private networks to contact the head node that is currently active at any given time by the virtual name and IP address, ensuring that communication will not break if the active head node fails and functionality switches over to the passive head node.

For more information, see Configuring Microsoft HPC Pack for High Availability of the Head Node.

MPI Performance

The Message Passing Interface (MPI) ping-pong tests measure network latency and throughput between nodes on the cluster by sending packets of data back and forth between paired nodes repeatedly. The latency is the average of half of the time that it takes for a packet to make a round trip between a pair of nodes, in microseconds. The throughput is the average rate of data transfer between a pair of nodes, in MB/second. When you run the MPI ping-pong tests, you can specify the running mode and the network to use.

Important

To get accurate results with the MPI ping-pong tests, run the tests in Serial mode (if available) and ensure that the nodes are not running jobs. If the nodes are running jobs, the tests do not return accurate measures of latency and throughput.

The running mode parameter has the following values:

  • Ring: In a ring test (also known as a nearest neighbor test), nodes send packets to each other one pair at a time in a ring pattern. While one pair of nodes runs the test, all other nodes remain idle. The first node pairs with its immediate neighbor. When the test on the first pair is complete, the next node similarly pairs with a neighbor. This sequential pairing and testing continues until the test covers all of the nodes in the HPC cluster and each node has paired with two neighbors, one in each direction around the ring.

    You can use Ring mode to obtain a reasonable indication of the performance of an HPC cluster in a minimal amount of time. The ring test takes less time than a serial or tournament-style test because each node is tested with only two neighbor nodes instead of with all nodes in the cluster.

  • Serial: Serial mode runs the MPI ping-pong test on one node pair at a time. While one pair of nodes runs the test, all other nodes remain idle. When one pair of nodes finishes the test, the test runs for another pair of nodes, and this testing of individual pairs proceeds serially until all possible pairs of nodes are tested.

    You can use Serial mode to thoroughly test all of the individual network links between nodes when the HPC cluster has a small number of nodes. This mode provides the most accurate measure of latency or throughput. Because the serial test runs the MPI ping-pong test on all possible pairs of nodes one pair at a time, the test can take a long time for large numbers of nodes.

  • Tournament: Tournament mode runs the MPI ping-pong test in multiple rounds, similar to a tournament. In each round, all of the nodes in the HPC cluster pair off. The two nodes in each pair send packets to each other, with all of the pairs exchanging packets in parallel. When one round is complete, another round begins, using a different set of node pairings than was used in previous rounds. The rounds continue until all possible node pairs have been tested. Tests in this mode complete the fastest and the network switches are most highly loaded.

    You can use Tournament mode to test the infrastructure of the specified network and how it performs when loaded. The measured latency and throughput are that of a loaded cluster and thus may not compare favorably with the manufacturer’s specifications for your networking hardware.

Note

You can run the tests with additional arguments and get additional output by using the mpipingpong command.

Diagnostic Description
MPI Ping Pong: Latency This test measures the bandwidth and latency of node-to-node communication. Since this is a performance test, to get accurate results, run this test on nodes that are offline and not running other jobs.

Parameters: You can specify the network to use for the test and the running mode.

By default, this test runs in Tournament mode. When you use Tournament mode to measure latency, the test introduces little noise into the simultaneous latency measurements of each round, because the packets are small and thus even heavily over-subscribed network switches do not impede the packets. To obtain more accurate latency measurements, if necessary, use Serial mode to test all pairs serially.
MPI Ping Pong: Throughput Measures network throughput between adjacent nodes on the cluster.

Parameters: You can specify the network to use for the test and the running mode (Serial or Tournament).

By default, this test runs in Serial mode.
MPI Ping Pong: Simple Throughput Measures network throughput between adjacent nodes on the cluster.

Parameters: You can specify the network to use for the test.

Measures throughput only between pairs of adjacent nodes in the cluster using Ring mode. This provides a reasonable verification of connectivity across the specified network. For more accurate throughput measurements, run the MPI Ping Pong: Throughput test.

Network Status

The tests in this suite can help you verify the configuration of your cluster network. There are no parameters that you can configure for these tests.

Diagnostic Description
Firewall Configuration Report Reports on the firewall status (enabled or disabled) for the selected nodes. This test also reports the applications or services that are allowed access through the firewall (the firewall exceptions), including which port number they are using.

See also Understanding Firewall Configuration for HPC Networks.
Network Configuration Report Reports on the configuration of the network adapters for each selected node.

Network Troubleshooting

The tests in this suite can help you verify network connectivity for cluster nodes.

Diagnostic Description
DNS Test Verifies Domain Name System (DNS) name resolution between the selected nodes.
Domain Connectivity Test Verifies connectivity between the selected nodes and each domain controller.
Ping Test Verifies network connectivity between the selected nodes by performing a ping test between each node and all other nodes in the selected group.

Parameters: You can specify the network to use for the test and the number of pings per node.

Note

The HPC Pack Tool Pack includes the Network Troubleshooting Report, an additional diagnostic test that collects and analyzes network information in your HPC Pack-based cluster to help troubleshoot network issues. If you have an InfiniBand network, the report also includes the status and capabilities of the Host Channel Adapter (HCA) cards in that network. For more information, see Install and Use the Network Troubleshooting Report Diagnostic Test.

Services

The tests and reports in this suite can help you verify that the required HPC services are running on the selected nodes and troubleshoot service errors.

Diagnostic Description
Service Configuration Report Reports all the running services that are installed on the selected nodes and their startup configuration setting.
Service Status Report Reports on HPC events in the event log for the selected nodes.

Parameters: You can specify the Hour count to indicate how far back to check the event log (between 1 and 50 hours ago). You can also limit the number of events to report by setting the Log count parameter (1-100).
Services Running Test Verifies that the HPC services are running on the selected nodes. Expected services are determined by the role of the target node (head node, compute node, or WCF broker node). This test might report the status of optional services, if they are present, but it only validates against the required services.

SOA

The SOA Service Loading Test verifies that the DLLs for the specified service can be loaded on the specified nodes, and that any detected dependencies for the DLL are present on the nodes. By default, this test uses the built-in CcpEchoSvc service to verify SOA functionality on the cluster.

To verify that a particular service can be loaded, you can specify the name of the service in the test parameter. When you run the SOA Service Loading Test, in Configuration, in Managing SOA Services in Microsoft HPC Pack, the service that you select is automatically specified in the parameter for the test.

System Configuration

The reports in this suite provide information about application configuration and software updates on the selected nodes.

Diagnostic Description
Active Power Scheme Report Test added in HPC Pack 2012.

Reports on the active power scheme (plan) and lists all existing power schemes that are configured in the operating system on the selected nodes.
Application Configuration Report Reports the applications, including the version numbers, that are installed on the selected nodes. The results include a table that lists all of the installed applications, and a count of the nodes that have that application installed. You can also view results by node.
Available Software Updates for Node Report Reports the software updates that are available for the selected nodes. The test reports on the updates that are identified as critical by Windows Server Update Services (WSUS) or Microsoft Update. The diagnostic communicates with the Microsoft Update client, which filters the updates so that only those that are relevant to the node are reported to the diagnostic.

This test fails if the winhttp proxy is not set on the compute node. Run the netsh winhttp show proxy command to determine if the nodes have a proxy server set.

For more information about applying updates by using an enterprise WSUS server or by using a node template, see the Best Practices topic in the updating nodes step-by-step guide.
HPC Soft Card KSP Test Test added in HPC Pack 2008 R2 with Service Pack 2.

Reports whether the HPC soft card key storage provider (KSP) is installed on the selected cluster nodes. This setting enables soft card authentication when running tasks on the nodes.

The KSP is a separate installation that is only installed on the Head Node and the compute nodes. It does not need to be installed on the client nodes.

The KSP component is used to perform the Smart Card logon for the tasks that run on compute nodes. The KSP is just used on machines where tasks are run.

If the test fails: The HPC soft card KSP is not installed on this computer. For information about installing it to enable soft card authentication, see the Microsoft HPC Pack release notes.

If the test passes: The HPC soft card KSP is installed on this computer.
Missing/Required Software Updates from Template Report Compares the software updates that are installed on the selected nodes with the updates specified in the Apply Updates task in the node template. The report indicates if any compute nodes failed to meet the required update level (None, Critical, All), or are missing the specific updates, as defined in the node template.

If this diagnostics reports that required updates are missing, take the indicated nodes offline and run the Maintain action. See Run Maintenance Tasks on Nodes.

The node template must include the Apply Updates task to run this test. If the node template does not include this task, you can either run the Available Software Updates for Node Report to see a list of available updates, or you can add the task to the node template. For information about adding the update task to the node template, see Add the Apply Updates Task to a Node Template.
Software Updates Installed on Nodes Report Reports the updates that are installed on the selected nodes.

Windows Azure

Suite added in HPC Pack 2008 R2 with Service Pack 2.

The tests in this suite can help you verify that you can deploy and run jobs on the Windows Azure nodes in your cluster.

Diagnostic Description
Windows Azure Firewall Ports Test Performs a simple test to verify communication from the head node to Windows Azure through any existing internal and external firewalls. This test always runs using the default diagnostic test credentials. You can run this test before deploying Windows Azure nodes to help ensure that any existing firewall is configured to allow deployment, scheduler, and broker communication between the head node and Windows Azure.

This test checks outbound communication on selected TCP ports from the head node to the hpcazureportcheck.cloudapp.net service in Windows Azure. The hpcazureportcheck.cloudapp.net service is hosted by Microsoft, to provide a communication endpoint for this test. Important: hpcazureportcheck.cloudapp.net is not related to any Windows Azure hosted service that you use for your Windows Azure node deployments. You do not need it to deploy Windows Azure nodes in your cluster, since it is only used by the Windows Azure Firewall Ports Test.

The firewall ports that the test checks are those that are required by the version of HPC Pack that is installed on the head node (starting with HPC Pack 2008 R2 with SP2). If you have installed HPC Pack 2008 R2 with SP2, communication on the following TCP ports is tested:

- 80
- 443
- 3389
- 5901
- 5902
- 7998
- 7999

If you have installed at least HPC Pack 2008 R2 with SP3, communication on the following TCP ports is tested by default:

- 443
- 3389 Important:
  • A failure can indicate that a port is blocked by your corporate firewall. If you have already unblocked all the listed ports and are still seeing a failure, it could mean that a proxy server or client, a software firewall, or other device that manages Internet traffic is not configured to allow the HPC services to communicate with Windows Azure.
  • Successful test results do not guarantee that the head node can communicate properly with a hosted service that you use for your Windows Azure node deployments.
  • If you choose to enable firewall access for this test, it is recommended that you enable access to the hpcazureportcheck.cloudapp.net hostname instead of its IP address, since the latter can change.
  • If you have installed at east HPC Pack 2008 R2 with SP3, you can configure a registry setting so that the head node communicates with Windows Azure using the network firewall ports that are required for HPC Pack 2008 R2 with SP2 instead of the default ports that are required for HPC Pack 2008 R2 with SP3. If you do this, the test checks communication on the ports that are required for HPC Pack 2008 R2 with SP2.


For more information about firewall ports for Windows Azure, see Requirements for Windows Azure Nodes in Microsoft HPC Pack.
Windows Azure MPI Communication Test Runs a simple ping-pong test between pairs of Windows Azure nodes to ensure that MPI communication is working properly. This test runs only on Windows Azure nodes, and on nodes in the same deployment.
Windows Azure Report Reports on the names of the role instances for Windows Azure nodes that have been deployed. Important: After installation of HPC Pack 2008 R2 SP3, this test no longer provides the names of the role instances for the Windows Azure nodes. To work around this problem, you can run the following command on each node for which you want to see the name: Set COMPUTERNAME You can also use a clusrun command, or create a new diagnostic test, to run this command on a group of nodes.
Windows Azure Services Connection Test Verifies that services running on the head node can connect to Windows Azure by using the subscription IDs and certificates specified in the Windows Azure node templates. This test always runs using the default diagnostic test credentials.

Parameter: You can specify the node template to use for the test. By default this test uses all node templates.
Windows Azure Virtual Network Test Test added in HPC Pack 2012.

Performs a test to verify that the Windows Azure Virtual Network sites configured in all the Windows Azure node templates are valid.

There are no parameters that you can configure for this test.

Additional references