Step 6: Start the Windows Azure Worker Nodes
Updated: November 14, 2011
Applies To: Windows HPC Server 2008 R2
To provision the role instances in Windows Azure, you must start the worker nodes that you added to the HPC cluster. Then you bring the nodes online so that they are available to run cluster jobs.
How the worker nodes are started and brought online depends on the availability policy that you configured in the Windows Azure node template as follows:
-
Automatic If you configured a policy to start (and bring online) and stop the nodes automatically, the nodes are automatically configured to be in the Online state during one or more intervals each week. You do not need to perform other actions.
-
Manual If you configured a policy to start and stop the nodes manually, you must first start the worker nodes, and then bring them online to make them available to run cluster jobs.
Important |
|---|
|
-
In Node Management, in the Navigation Pane, click Nodes.
-
In the List or Heat Map view, select one or more worker nodes.
-
In the Actions pane, click Start. The Start Azure Worker Nodes dialog box appears.
-
If you selected worker nodes that were added by using different worker node templates, select a node template to specify the set of worker nodes to start. Then click Start.
-
During the starting process, the state of the worker nodes changes from Not Deployed to Provisioning. If you want to track the provisioning progress, select a worker node, and then in the Details Pane, click the Provisioning Log tab.
Note -
The provisioning log updates infrequently while Windows Azure completes the deployment of the role instances.
-
You can cancel the provisioning of the group of Windows Azure nodes while the node health is Transitional.
-
If there were errors during the provisioning of one or more worker nodes, the state of those nodes is set to Unknown and the node health is set to Unapproved. To determine the reason for the failure, review the provisioning logs for the nodes. You can find additional information about the status of the role instances in the Windows Azure Management Portal.
-
The provisioning log updates infrequently while Windows Azure completes the deployment of the role instances.
-
After a worker node starts successfully, the node state changes to Offline.
-
To bring the nodes online, select the nodes that are in the Offline state, right-click, and then click Bring Online.
Note In Windows HPC Server 2008 R2 with SP3 or later, you can bring some nodes online and start running jobs on them as soon as the nodes have moved from the Provisioning node state to the Offline node state, even if other nodes in the group of nodes that you started to provision are still in the Provisioning state. In this case, the health of the whole group of nodes still appears as Transitional. You do not need to wait for the health of the nodes to transition to OK.
-
If there is a problem with your Internet connection or with the connection information for Windows Azure in the template, the Windows Azure node deployment can fail. If you experience a connection problem, validate the connection for Windows Azure in the node template in Node Template Editor. For more information, see Step 4: Create a Windows Azure Worker Node Template earlier in this guide.
-
If you are running Windows HPC Server 2008 R2 with SP2 or later, you can run the Windows Azure Firewall Ports diagnostic test and the Windows Azure Services Connection diagnostic test to help verify that the network firewall and other settings are properly configured for communication between Windows HPC Server 2008 R2 and Windows Azure or to troubleshoot connectivity problems.
-
If you experience partial failures in deployment, where the worker nodes do not come online, you can try running the following telnet command to see whether the HPC service name is reachable at the Windows Azure endpoint:
telnet <ServiceName>.cloudapp.net 7999
Note To run this command, the Telnet Client feature must be installed in the operating system. For information about how to install Telnet Client by using Server Manager, see the Telnet Operations Guide. -
A problem in Windows Azure can affect a subset of the worker nodes that are in a set. For example, if you are starting a large number of worker nodes, it is possible for the deployment to fail on one or more nodes. In this case, you will see appropriate status information for the failed nodes in Node Management.
-
Deployment status information appears in the service account information in the Windows Azure Management Portal. HPC Cluster Manager regularly queries this portal for updated status information. However, the information in the portal can differ from that in the provisioning logs or the operations log in HPC Cluster Manager.
-
If a deployment error occurs in Windows Azure, an error message and troubleshooting information may appear in the service account information in the Windows Azure Management Portal. If you cannot resolve the problem, visit Windows Azure Support. To assist in troubleshooting the problem, be prepared to provide the subscription ID that is configured in the worker node template and the deployment ID that appears in the provisioning log in HPC Cluster Manager and in the portal.
-
After you have provisioned a set of nodes in Windows Azure, you can start an additional set of nodes using the same node template. However, in some cases, the additional nodes fail to come online in HPC Cluster Manager, but they appear to be deployed successfully in Windows Azure. If this occurs, it may not be possible to use HPC Cluster Manager to stop or delete the failed nodes. If necessary, first stop and restart the HPC Management Service. Then, to delete the nodes, you must use the Windows Azure Management Portal.
-
During the provisioning process, Windows HPC Server 2008 R2 automatically retrieves the primary access key for the storage account that is specified in the worker node template.
-
When role instances are deployed, file packages that were previously uploaded to the storage account using the hpcpack command are automatically installed. You can also upload file packages to storage after the worker nodes are started, and then manually install them on the worker nodes. For more information, see hpcpack.
-
If your cluster is updated with HPC Pack 2008 R2 SP2 or later and you configured a startup script, it runs on the nodes when they are provisioned. If you need to troubleshoot a startup script, you can view the log files on the Windows Azure nodes. For more information, see Appendix 2: Configure a Startup Script for Windows Azure Nodes.
-
If you want to remove a set of role instances in Windows Azure, stop the nodes by using HPC Cluster Manager (apply the Stop action). This deletes the role instances from the service and changes the state of the worker nodes in the HPC cluster to Not-Deployed.
-
When Windows Azure nodes are brought online, the HPC Job Scheduler Service will immediately try to start jobs on the nodes. If only a portion of your workload can run on Windows Azure nodes, ensure that you update or create job templates to define what job types that can run on those nodes. For example, to ensure that jobs submitted with a template only run on on-premises compute nodes, you can add the Node Groups property to the job template and select Compute Nodes as the required value.