Troubleshooting Compute Node Deployment

 

Applies To: Windows HPC Server 2008 R2, Windows HPC Server 2008

In this topic:

You need administrator privileges to perform the procedures in this topic.

If you are experiencing a problem during the Pre-Boot Execution Environment (PXE) boot on a compute node during bare metal deployment, see Troubleshooting PXE Boot Failures During Baremetal Node Deployment.

Problem joining domain error

An error message similar to Problem joining domain in the provisioning log can have multiple causes. One of the most common is when compute nodes on a cluster do not have IP addresses on the enterprise network. If you see this error, verify on the head node that Network Address Translation (NAT) is enabled on the enterprise network.

Default node template for compute nodes does not include client utilities

In Windows HPC Server 2008, the default compute node templates do not install HPC PowerShell, HPC Cluster Manager, and HPC Job Manager. In many cases, the client utilities are not needed for compute nodes, but some applications may need them installed.

If you need your Windows HPC Server 2008 compute nodes to run an application that requires the client utilities, you can edit the node template to include a custom OS command that installs the client utilities. For more information, see Install HPC PowerShell, HPC Cluster Manager, and HPC Job Manager on the Compute Nodes.

Note

In Windows HPC Server 2008 R2, the Install HPC Pack node template configuration task automatically installs the client utilities during node deployment.

How to reset the compute node naming series to reuse the name of a deleted node

The name of a compute node is not reused after you delete a compute node. When you next deploy a node from bare metal, it is named automatically with a name that was not assigned previously. Because of this behavior, you may notice gaps in the sequence of compute node names on your cluster.

Cause

During bare metal deployment, Windows HPC Server 2008 automatically assigns the computer name for the compute node. This name is generated in sequence according to a naming series that you specify in the Specify Compute Node Naming Series dialog box. By default, a computer name in the naming series is not reused after a compute node is deleted.

Resolution

You can reset the node naming series so that a gap in the sequence of names of the compute nodes is filled when you next deploy a compute node.

To reset the node naming series

  1. If HPC Cluster Manager is not already open on the head node, open it. Click Start, point to All Programs, click Microsoft HPC Pack, and then click HPC Cluster Manager.

  2. In Configuration, in the Navigation Pane, click To-do List.

  3. In the To-do List, click Configure the naming of new nodes. The Specify Compute Node Naming Series dialog box opens.

  4. Do one of the following:

    • To reset the existing node naming series, click OK.

    • If you want, you can specify a new node naming series, and then click OK.

You can also reset the node naming series by running the HPC PowerShell cmdlet Set-HpcClusterProperty with the -NodeNamingSeries parameter. For more information, see Set-HpcClusterProperty.

Note

If you want, you can rename a compute node after it is deployed. For example, you can change the computer name in Control Panel on the compute node. However, to use the renamed compute node for compute jobs in your Windows HPC Server 2008 cluster, you must take the node offline until the name is discovered in your cluster.

Compute node deployment fails during disk partitioning

On certain computers, the disk partitioning script Diskpart.txt that runs during the Configuration phase of bare metal deployment can fail, causing the node deployment to fail.

Cause

In Windows Preexecution Environment on some computers, the C: or D: volume letter is automatically assigned to a drive (such as a DVD drive) that is not a hard disk drive. The default Diskpart.txt script that subsequently partitions the hard disk on a compute node requires that the C: letter is assigned to a hard disk drive. If it is not, the Diskpart script will fail.

Workaround

You can modify the Diskpart.txt script to remove existing volume mappings for disks that are not hard disk drives. For example, add the following lines to the script to remove the existing mappings for the C: and D: drives:

Select volume 0
Remove
Select volume 1
Remove

Compute node deployment hangs during the installation of the operating system

Under certain conditions, during the installation of the Windows operating system on a compute node that is being deployed from bare metal, the installation pauses indefinitely and a blank screen appears.

To work around this issue and to complete the deployment of the compute node successfully, restart the compute node. You can restart the node manually, or you can restart the node by using a power control script, such as an Intelligent Platform Management Interface (IPMI) script. For more information about using a power control script, see Appendix 5: Scriptable Power Control Tools.

To help prevent this issue, make sure that your Windows HPC cluster is updated with the most recent service pack for HPC Pack. You can download the latest service pack for HPC Pack 2008 at HPC Pack 2008 Service Pack 2 (SP2) ( HYPERLINK "https://go.microsoft.com/fwlink/?LinkId=196363" https://go.microsoft.com/fwlink/?LinkId=196363). For information about service packs for HPC Pack 2008 R2, see Windows HPC Server 2008 R2: Getting Started

Compute node deployment hangs after the installation of the operating system

On certain computers, compute node deployment from bare metal pauses indefinitely after the installation of the operating system is finished. The compute node may restart continuously and fail to reboot to complete the deployment.

Cause

This issue can occur because of a problem with a device driver (such as a driver for a network adapter, or for a SATA controller) that is required for the compute node to boot properly. The problem may be one of the following:

  • A device driver that is required for the operation of the devices on the compute node is not included in the operating system image that was installed.

  • A device driver in the operating system image is corrupted, outdated, or unsigned.

You might experience this issue if the compute node computer is designed for a version of the operating system that is more recent than the one in the operating system image that was installed.

Resolution

Verify that the operating system image includes all of the device drivers that are required for the proper operation of the devices on each compute node. Then, try again to deploy the compute nodes.

As a best practice, test the installation of the operating system and any required drivers on a single representative node before you prepare an image to deploy on multiple compute nodes that have the same hardware configuration. If you need to install additional device drivers along with your operating system image, see Add Drivers for Operating System Images.

For prerequisites and troubleshooting steps for installing Windows Server 2008, see Installing Windows Server 2008 (https://go.microsoft.com/fwlink/?LinkID=119578).

For more information about devices that are compatible with Windows HPC Server 2008, see Windows HPC Server Hardware Compatibility_deleted.

Provisioning an existing compute node fails because of a reboot error

Under certain conditions, when you apply the Assign Node Template action (or the Reimage action) to an existing compute node, the provisioning of the compute node fails at the beginning of the process because the compute node fails to reboot (restart). In the provisioning log for the compute node, an error message similar to the following may appear: Waiting for the node to reboot.

To work around this issue, first manually reboot the compute node. Then, try again to provision the node: right-click the node, and then click Assign Node Template (or click Reimage).

Timeout or authorization failure during PXE boot when adding a compute node by importing a node XML file

Under certain conditions, when you add a compute node by importing a node XML file, provisioning fails because the compute node fails to boot in the Pre-Boot Execution Environment (PXE). This can be due to a failure of the compute node to receive an IP address from the head node during the PXE request, and the PXE request then times out. An error message in the provisioning log may state that no PXE boot image file name was provided. Alternatively, the compute node may receive an IP address from the head node, but the PXE boot is not authorized.

To work around this issue, cancel any compute node deployments that are in progress, and then restart the HPC SDM Store Service and the HPC Management Service. Then, restart HPC Cluster Manager and then try to add the node again.

To restart the HPC SDM Store Service and HPC Management Service

  1. Log on to the head node as a user with administrative permissions.

  2. Open the Services snap-in: Click Start, point to Administrative Tools, and then click Services.

  3. Right-click HPC SDM Store Service, then click Restart. In the Restart Other Services dialog box, click Yes.

    Restarting the HPC SDM Store Service also restarts the HPC Management Service.

  4. Close the Services snap-in.

Note

You can also stop and restart the HPC Management Service and the HPC SDM Store Service in an elevated Command Prompt window by running the following commands: net stop hpcmanagement net stop hpcsdm net start hpcmanagement net start hpcsdm

Restart task fails during the Configuration phase of a node template

If you add a Restart task to the Configuration phase of a node template before the Install Windows task, the deployment of a compute node using that node template fails.

As a best practice, add a Restart task only in the Deployment phase of a node template.

For more information about node templates, see Appendix 3: Node Template Tasks and Properties [Deployment] in the Windows HPC Server 2008 Design and Deployment Guide.

How to edit a node template to deploy a new operating system image

After you configure a node template to deploy an operating system image (a Windows Image Media, or .wim, file) to a compute node, it is possible to use Node Template Editor to edit the node template to deploy a different .wim file that is available in the image store on the head node. However, you must edit each task in the Configuration phase of the node template that includes a path to the .wim file. For example, you may need to edit a property in each of several tasks, including Multicast Copy (or Unicast Copy), Apply WIM Image, and Install Windows. Deployment can fail if all of the references to the .wim file in the node template are not updated.

As a best practice, it is recommended that you first copy the node template and then edit the copy you made to deploy a different .wim file. Then, test the node template on a representative compute node. Alternatively, use the Create Node Template Wizard to create a separate node template for each operating system image that you need to deploy.

For more information about node templates, see Appendix 3: Node Template Tasks and Properties [Deployment] in the Windows HPC Server 2008 Design and Deployment Guide.

Operating system specified in a node template is not installed on nodes that have HPC Pack already installed

In the following scenario, nodes that are assigned a node template that includes a step to deploy an operating system do not install the operating system that is specified in the template:

  • HPC Pack is already installed on the nodes, either because the nodes were previously deployed from bare metal, or because preconfigured nodes were added to the cluster. For example, the nodes might have HPC Pack 2008 R2 and an edition of Windows Server 2008 installed.

  • The nodes are then deleted from the cluster, but the nodes remain powered on.

  • A node template is assigned, specifying an operating system different from the one that is already installed on the nodes. For example, the new node template might specify an edition of Windows Server 2008 R2.

Under these conditions, the operating system on the nodes is not changed. Instead, only the tasks in the Maintenance phase of the node template are performed. This might cause the information about the nodes in HPC Cluster Manager to be misleading.

To work around this problem, after assigning the new node template, reimage the nodes. Alternatively, restart the nodes and redeploy the nodes from bare metal using the new node template. For more information, see Deploy Nodes from Bare Metal.

A read-only operating system image cannot be deleted using HPC Cluster Manager

If an operating system image (a Windows Image Media, or .wim, file) in the image store on the head node has the read-only attribute set, you cannot delete the image by using the Delete Image action in Configuration in HPC Cluster Manager. The read-only attribute on a .wim file could be set if you import (load) the .wim file from a read-only shared folder or from read-only media.

To work around this problem, delete a read-only .wim file from the image store on the head node by using the file system to change the file permissions and to delete the file. By default, the .wim files for HPC Cluster Manager are stored in the following folder on the head node: %CCP_DATA%InstallShare\Images.

After you delete the .wim file using the file system, you can use the Delete Image action in Configuration to remove the image name from HPC Cluster Manager.

Note

You cannot use HPC Cluster Manager to delete an operating system image that is currently associated with a node template. You must edit or delete each node template to which the image is associated before you can delete the image.

A compute node name is duplicated during deployment of preconfigured compute nodes

Because of a timing issue when you are adding preconfigured compute nodes to your cluster, it is possible for two compute nodes to receive the same name. This problem can occur when the following conditions are true:

  • Windows Deployment Service mode is set to Respond to all PXE Requests

  • You are adding one or more preconfigured nodes to the cluster.

  • An additional computer that is not a preconfigured node or a previously installed compute node PXE boots on the private network.

Under these conditions, HPC Cluster Manager could assign the same node name to both the additional computer and another compute node. In HPC Cluster Manager, in Node Management, two nodes with the same name will appear.

To determine which of the duplicated compute nodes that you want to delete from the cluster, you can view the detailed properties of each node. To view the properties of the node, in Node Management, double-click the name of the node to open the Node Properties dialog box. You can use the information on the Properties tab or the Network tab of the Node Properties dialog box to help distinguish the two nodes. For example, you might be able to distinguish the two nodes by using the MAC address of the network adapter that is connected to each of the networks.

To prevent this problem from occurring, you can do one of the following:

  • When you use the Add Node Wizard to add preconfigured nodes, turn on only preconfigured nodes. Do not turn on another computer that can make a PXE request to the head node on the private network. For more information about adding preconfigured nodes, see Add Preconfigured Compute Nodes.

  • Set the Windows Deployment Service mode to Respond only to PXE requests that come from existing nodes. For more information, see Set the Windows Deployment Services Mode [Help link].

A node name is duplicated after assignment of a node template in a failover cluster

You may see a duplicate node name in Node Management in HPC Cluster Manager in a Windows HPC Server 2008 R2 cluster when the following conditions are true:

  • The head node is configured for high availability in the context of a failover cluster.

  • You are adding a preconfigured node to the cluster - for example, a preconfigured compute node or a workstation node.

  • You assign a node template to the preconfigured node immediately after the node name appears in the Unapproved state in HPC Cluster Manager.

If a node name is duplicated under these conditions, you can delete the duplicated node name and then reassign the node template to the node.

To remove a duplicate node name

  1. If HPC Cluster Manager is not already open on the head node, open it. Click Start, point to All Programs, click Microsoft HPC Pack, and then click HPC Cluster Manager,

  2. In Node Management, in the Navigation Pane, click Nodes.

  3. Cancel the provisioning of the preconfigured node by doing the following:

    • In the view pane, click the name of the preconfigured node that is being provisioned.

    • In the Detail Pane, on the Properties tab, click Cancel Operations.

  4. In the view pane, right-click a duplicated node name, and then click Delete.

  5. Turn off the preconfigured node, and then turn it back on.

  6. In HPC Cluster Manager, in Node Management, in the Navigation Pane, click Nodes.

  7. After the name of the preconfigured node appears, right-click the name and then click Assign Node Template.