Troubleshoot Deployments of Azure Nodes with Microsoft HPC Pack
Updated: January 13, 2014
Applies To: Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2
This topic contains information to help you or Microsoft Support troubleshoot deployments of Azure nodes with HPC Pack.
For general requirements and best practices to deploy Azure nodes with HPC Pack, see the following:
In this topic:
If there is a problem with your Internet connection or with the Azure subscription information provided in the node template, the Azure node deployment can fail. You can validate the connection settings for Azure in the node template. Open the template in Node Template Editor. Then, on the Connection Information tab, click Validate connection information.
If there is a problem with the configuration of the Azure management certificate, see Troubleshoot certificate problems.
If you are running at least HPC Pack 2008 R2 with SP2, you can run the Azure Firewall Ports diagnostic test and the Azure Services Connection diagnostic test to help verify that the network firewall and other settings are properly configured for communication between HPC Pack and Azure or to troubleshoot connectivity problems.
If the system time is not set accurately on the head node computer (or head node computers), certain Azure operations such as node template creation or deploying new nodes can fail with an error similar to the following:
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
If you experience partial failures in deployment, where the Azure nodes do not come online, you can try running the following telnet command to see whether the cloud service specified in the node template is reachable at the Azure endpoint:
telnet <ServiceName>.cloudapp.net 7999
To run this command, the Telnet Client feature must be installed in the operating system. For information about how to install Telnet Client by using Server Manager, see the Telnet Operations Guide.
A problem in Azure can affect a subset of the Azure nodes that are in a set. For example, if you are starting a large number of nodes, it is possible for the deployment to fail on one or more nodes. In this case, you will see appropriate status information for the failed nodes in Node Management.
Deployment status information appears in the service account information in the Azure Management Portal. HPC Cluster Manager regularly queries this portal for updated status information. However, the information in the portal can differ from that in the provisioning logs or the operations log in HPC Cluster Manager.
If a deployment error occurs in Azure, an error message and troubleshooting information may appear in the cloud service information in the Azure Management Portal, or in the provisioning log in HPC Cluster Manager. If you cannot resolve the problem, you can review the trace logs that are generated on the role instances in the deployment. For more information, see Trace log files on Azure nodes in this topic.
You can also visit Azure Support. To assist in troubleshooting the problem, be prepared to provide the subscription ID that is configured in the node template and the deployment ID that appears in the provisioning log in HPC Cluster Manager and in the portal.
After you have provisioned a set of nodes in Azure, you can start an additional set of nodes using the same node template. However, in some cases, the additional nodes fail to come online in HPC Cluster Manager, but they appear to be deployed successfully in Azure. If this occurs, it may not be possible to use HPC Cluster Manager to stop or delete the failed nodes. If necessary, first stop and restart the HPC Management Service. Then, to delete the nodes, use the Azure Management Portal.
Starting with HPC Pack 2012 with SP1, to help troubleshoot Azure node deployments, you can opt to collect on the head node and send to Microsoft data about the availability, connectivity, and performance of Azure nodes. You might choose to do this if you need to open a support incident related to an Azure node deployment. To enable data collection, in HPC Cluster Manager, on the Options menu, click Azure Support Data Collection. Alternatively, configure the AzureMetricsCollectionEnabled cluster property by using the Set-HpcClusterProperty HPC PowerShell cmdlet. For more information about the data collection, see the Microsoft HPC Pack Privacy Statement.
Starting with HPC Pack 2008 R2 with SP4, trace log files are generated automatically on Azure worker nodes, and on the Azure HPC proxy nodes that are automatically provisioned for each deployment. The log files can help you or Microsoft Support troubleshoot issues during or after node provisioning – for example, conditions that can cause an Azure node to show a health state of Unreachable or Error, even though the Azure Management Portal might indicate a status of Ready.
The trace log files contain the following types of information about each node:
Bootstrapping information for the operating system.
Information about the HPC Pack services that should run on the node.
Information about the Hosts file.
Operating system performance counter data.
The log files are written to local storage on each node, as shown in the following table. The formats, characteristics, and naming of the trace log files depend on the version of HPC Pack.
The log files are only maintained in local storage on the Azure role instances while the nodes remain provisioned in Azure. Unless the files or data are copied to another location, you will not be able to review the trace log information after the Azure nodes are stopped or deleted. For more information, see Scenarios to store trace log data in this topic.
Version of HPC Pack
HPC Pack 2012
HPC Pack 2008 R2 with SP4
You can use the Configure settings for the cloud service in the Azure Management Portal to change the tracing level for specific processes on the Azure nodes (such as Microsoft.Hpc.Azure.AzureNodeManagerTracing).
The trace log files generated on Azure role instances remain in local storage on the role instances as long as the role instances are running. However, if you want to access the data after an Azure deployment is stopped or the nodes are deleted, you need to download or store the trace log files or data in persistent storage, such as Azure storage, while the role instances are running. The following are scenarios to store trace log files or data.
Starting with HPC Pack 2012 with SP1, the HPC cluster administrator can optionally enable the automatic transfer of trace log files from the Azure compute or proxy nodes in a deployment to a container in blob storage (hpclogs) in the Azure storage account for the deployment.
To enable automatic transfer of trace log files to blob storage in the Azure storage account, in HPC Cluster Manager, on the Options menu, click Azure Deployment Configuration. You can also set the AzureLogstoBlob HPC cluster property by using the Set-HpcClusterProperty HPC PowerShell cmdlet. You can choose to transfer logs for proxy nodes, worker nodes, or both. By default, transfer of log files blob storage is disabled. Changing the AzureLogstoBlob property only affects transfer of log files for future Azure node deployments. The current deployments are not affected. For more information see Set-HpcClusterProperty.
Saving Azure deployment log files in blob storage uses storage space and generates storage transactions on the storage account associated with each deployment. If enabled, saving log files from worker nodes can affect the performance of all Azure deployments that use the same storage account, especially if you have large deployments, or several concurrent deployments. The storage space and the storage transactions will be billed to your account. After you disable transfer of log files, the log files will not be automatically removed from Azure storage. You may want to keep the log files for future reference by downloading them. The log files can be cleaned up by removing the hpclogs container from your storage account.
You can run the hpcazurelog command on the head nodes to download data from blob storage in the storage account to a local folder and to delete the files from blob storage. For more information, see hpcazurelog.
Starting with HPC Pack 2012, the HPC cluster administrator can optionally enable the transfer of trace log data from the Azure nodes in a deployment to an Azure diagnostics (WADSLogsTable) table created for this purpose in the Azure storage account for the deployment.
To enable transfer of trace log data to the WADSLogsTable table in the Azure storage account, set the AzureLoggingEnabled HPC cluster property to true by using the Set-HpcClusterProperty HPC PowerShell cmdlet. By default only Critical, Error, and Warning events in the log files are filtered for inclusion in the WADSLogstable table. Changing the AzureLoggingEnabled property only affects logging for future Azure node deployments. The current deployments are not affected. For more information see Set-HpcClusterProperty.
Starting with HPC Pack 2012 with SP1, you can run the hpcazurelog command on the head node to download data from the WADLogsTable in the storage account to a local folder, and to specify the trace level of the data selected for storage in the table. For more information, see hpcazurelog.
To facilitate further analysis, you can manually download log files from Azure nodes to an on-premises computer, or upload them to an Azure storage account.
To download the log files, you can use one of the following procedures:
Run the hpcfile get command to download log files from each node individually.
Run a script that uses hpcfile get to download files from groups of worker nodes.
Use the Azure Management Portal to connect remotely to each node individually. You can then copy the log file or files to a local computer.
Run the hpcazurelog command on the head node to download files from Azure worker nodes or proxy nodes. This command was introduced in HPC Pack 2012 with SP1 and is not supported in previous versions. For more information, see hpcazurelog.
The following are example commands and scripts that use hpcfile get to download the log files from Azure worker nodes. For more information about command syntax, see hpcfile.
Example 1. To download the trace log files, including possible overflow files, from the Azure node AZURECN-001 on a cluster with an HPC Pack 2008 R2 with SP4 head node named myHeadNode to the current folder on the local computer, renaming the files to avoid overwriting files on the local computer:
hpcfile get /scheduler:myHeadNode /targetnode:AZURECN-001 /file:"C:\logs\hpcworker.log" /destfile:"worker001.log" hpcfile get /scheduler:myHeadNode /targetnode:AZURECN-001 /file:"C:\logs\hpcworker.log.001" /destfile:"worker002.log" hpcfile get /scheduler:myHeadNode /targetnode:AZURECN-001 /file:"C:\logs\hpcworker.log.002" /destfile:"worker003.log" hpcfile get /scheduler:myHeadNode /targetnode:AZURECN-001 /file:"C:\logs\hpcworker.log.003" /destfile:"worker004.log" hpcfile get /scheduler:myHeadNode /targetnode:AZURECN-001 /file:"C:\logs\hpcworker.log.004" /destfile:"worker005.log" hpcfile get /scheduler:myHeadNode /targetnode:AZURECN-001 /file:"C:\logs\hpcworker.log.005" /destfile:"worker006.log"
Example 2. To download the hpcworker_000000.bin log files from the Azure nodes in node group WorkerNodes with names beginning AZURECN on a cluster with an HPC Pack 2012 head node named myHeadNode to the C:\myFiles\myLogs folder on the local computer:
@echo off set "extension=.bin" set "fullfilepath=C:\myFiles\myLogs" mkdir C:\myFiles\myLogs FOR /F "tokens=1 delims="%%G IN ('node list /group:WorkerNodes ^| FIND "AZURECN-"') DO hpcfile get /scheduler:MyHeadNode /targetnode:%%G -file:"C:\logs\hpcworker_000000.bin" /destfile:"%fullfilepath%%%G%%%extension%"
You can use one of the following procedures to upload the trace log files from Azure worker nodes to an Azure storage account:
Download one or more log files to a local computer as described in the previous section, and then upload them to an Azure storage account by running the hpcpack upload command.
Run a script on one or more Azure nodes that uses hpcpack upload to upload the log files directly to the storage account.
To run a script on a group of Azure nodes, you can first upload the script from a local computer to the nodes.
As described in Scenario 1: Enable automatic transfer of trace log files to Azure blob storage, starting with HPC Pack 2012 with SP1, you can enable automatic transfer of trace log files to blob storage in the Azure storage account. However, if you are not using a version of HPC Pack that supports this capability, or you have not enabled automatic transfer of log files to blob storage, you can manually upload them to that location.
The following are example scripts that use hpcpack upload to upload the log files from Azure worker nodes to the Azure storage account. For more information about the command syntax, see hpcpack.
Because log files on worker nodes are named identically, you should avoid overwriting files when you upload them to the Azure storage account. For example, you can rename the log files with names that include the host name of the node, as shown in the following examples.
Example 3. To upload and rename the hpcworker_000000.bin files from Azure worker nodes to the container MyLogs in the Azure storage account named MyStorageAccount with a primary key named MyPrimaryKey
@echo off REM Get the host name of the Azure node FOR /F "usebackq" %%i IN ('e:\approot\mpiexec.exe -c 1 hostname') DO SET filename=%%i set "extension=.bin" set "fullpath=C:\logs" REM Consolidate the log file name (e.g., AzureCN-001.bin) set "fullfilePath=%fullpath%%filename%%extension%" REM echo:%fullfilePath% REM Create a temporary file with desired name copy C:\logs\hpcworker_000000.bin %fullfilePath% e:\approot\hpcpack upload %fullfilePath% /account:MyStorageAccount /container:MyLogs /key:MyPrimaryKey del %fullfilePath%
Example 4. To upload a script Uploader.bat (similar to the script in Example 3) from the head node to a container named MyContainer in an Azure storage account named MyStorageAccount, download the script to Azure nodes in the node group named WorkerNodes, and then run Uploader.bat on the nodes in WorkerNodes:
hpcpack upload uploader.bat /account:MyStorageAccount /container:MyLogs /key:MyPrimaryKey clusrun /nodegroup:WorkerNodes hpcpack download uploader.bat /account:MyStorageAccount /container:MyLogs /key:MyPrimaryKey /path:c:\logs clusrun /nodegroup:WorkerNodes c:\logs\uploader.bat clusrun /nodegroup:WorkerNodes del c:\logs\uploader.bat