Export (0) Print
Expand All
3 out of 6 rated this helpful - Rate this topic

Best Practices for Large Deployments of Windows Azure Nodes with Microsoft HPC Pack

Updated: January 13, 2014

Applies To: Microsoft HPC Pack 2008 R2, Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2, Windows HPC Server 2008 R2

Starting with HPC Pack 2008 R2 with Service Pack 1, Windows HPC cluster administrators and developers can increase the power of the on-premises cluster by adding computational resources on-demand in Windows Azure. This HPC cluster “burst” scenario with Windows Azure nodes enables larger HPC workloads, sometimes requiring thousands of cores in addition to or in place of on-premises cluster resources. This topic provides guidance and best practices recommendations to assist in planning and implementing a large deployment of Windows Azure nodes from an on-premises HPC Pack cluster. These recommended best practices should help minimize the occurrence of Windows Azure deployment timeouts, deployment failures, and loss of live instances.

noteNote
  • These best practices include recommendations for both the Windows Azure environment and the configuration of the on-premises head node. Most recommendations will also improve the behavior of smaller deployments of Windows Azure nodes. Exceptions to these guidelines are test deployments on which performance and reliability of the head node services may not be critical and very small deployments, where the head node services will not be highly stressed.

  • Many of the considerations for configuring an on-premises head node for a large Windows Azure deployment also apply to clusters that contain comparably large numbers of on-premises compute nodes.

  • These recommendations supplement the cluster, networking, and other requirements to add Windows Azure nodes to a Windows HPC cluster. For more information, see Requirements for Windows Azure Nodes.

  • These general recommendations may change over time and may need to be adjusted for your HPC workloads.

In this topic:

These recommendations are generally based on HPC Pack 2012, but they are also useful for large deployments performed with HPC Pack 2008 R2.

The following table lists the versions of HPC Pack and the related versions of Windows Azure SDK for .NET that these guidelines apply to.

 

HPC Pack Windows Azure SDK

HPC Pack 2012 with Service Pack 1 (SP1)

Windows Azure SDK for .NET 2.0

HPC Pack 2012

Windows Azure SDK for .NET 1.8

HPC Pack 2008 R2 with Service Pack 4 (SP4)

Windows Azure SDK for .NET 1.7

HPC Pack 2008 R2 with Service Pack 3 (SP3)

Windows Azure SDK for .NET 1.6

A deployment of Windows Azure nodes for an HPC cluster is considered “large” when it becomes necessary to consider the configuration of the head node and when the deployment will demand a significant percentage of the Windows Azure cluster of resources that could be used by a single cloud service. A larger deployment would risk deployment timeouts and losing live instances.

ImportantImportant
Each Windows Azure subscription is allocated a quota of cores and other resources, which also affects your ability to deploy large numbers of Windows Azure nodes. At this time, the default quota of CPU cores per subscription is 20. To be able to deploy a large number of Windows Azure nodes, you might first need to contact Microsoft Support to request a core quota increase for your subscription.

The following table lists practical threshold numbers of role instances for a large deployment of Windows Azure nodes in a single cloud service. The threshold depends on the virtual machine size (predefined in Windows Azure) that is chosen for the Windows Azure role instances.

 

Virtual machine size Number of role instances

A7

noteNote
This size is supported starting with HPC Pack 2012 with SP1.

250

A6

noteNote
This size is supported starting with HPC Pack 2012 with SP1.

250

Extra Large

250

Large

500

Medium

800

Small

1000

For details about each virtual machine size, including the number of CPU cores and memory for each size, see Virtual Machine and Cloud Service Sizes for Windows Azure.

To deploy more than these threshold numbers of role instances in one service with high reliability usually requires the manual involvement of the Windows Azure operations team. To initiate this, contact your Microsoft sales representative, your Microsoft Premier Support account manager, or Microsoft Support. For more information about support plans, see Windows Azure Support.

Although there is no hard, enforceable limit that applies to all Windows Azure node deployments, 1000 instances per cloud service is a practical production limit.

The following are general guidelines to successfully create and use large Windows Azure deployments with your HPC cluster.

Unless you have made arrangements to deploy to a dedicated Windows Azure cluster in a data center, the most important recommendation is to communicate the need to the Windows Azure operations team (through a Microsoft Support channel) for a large amount of capacity ahead of time and to plan deployments accordingly to eliminate capacity as the bottleneck. This is also an opportunity to obtain additional guidance about deployment strategies beyond the ones that are described in this topic.

We recommend splitting large deployments into several smaller-sized deployments, by using multiple cloud services, for the following reasons:

  • To allow flexibility in starting and stopping groups of nodes.

  • To make possible the stopping of idle instances after jobs have finished.

  • To facilitate finding available nodes in the Windows Azure clusters, especially when Extra Large instances are used.

  • To enable the use of multiple Windows Azure data centers for disaster recovery or business continuity scenarios.

There is no fixed limit on the size of a cloud service, but general guidance is fewer than 500 to 700 virtual machine instances or fewer than 1000 cores. Larger deployments would risk deployment timeouts, losing live instances, and problems with virtual IP address swapping.

The maximum tested number of cloud services for a single HPC cluster overall is 32.

noteNote
You may encounter limitations in the number of cloud services and role instances that you can manage through HPC Pack or the Windows Azure Management Portal.

Having dependencies on other services and other geographic requirements may be inevitable, but it can help if your Windows Azure deployment is not tied to a specific region or geography. However, it is not recommended to place multiple deployments in different geographic regions unless they have external dependencies in those geographic regions.

Having strict dependencies on a certain virtual machine size (for example, Extra Large) can impact the success of deployments at a large scale. Having flexibility to adjust or even mix-and-match virtual machine sizes to balance instance counts and cores can help.

It is recommended to use different Windows Azure storage accounts for simultaneous large Windows Azure node deployments and for custom applications. For certain applications that are constrained by I/O, use several storage accounts. Additionally, as a best practice, a storage account that is used for a Windows Azure node deployment should not be used for purposes other than node provisioning. For example, if you plan to use Windows Azure storage to move job and task data to and from the head node or to and from the Windows Azure nodes, configure a separate storage account for that purpose.

noteNote
You incur charges for the total amount of data stored and for the storage transactions on the Windows Azure storage accounts, independent of the number of Windows Azure storage accounts. However, each subscription will limit the total number of storage accounts. If you need additional storage accounts in your subscription, contact Windows Azure Support.

Proxy nodes are Windows Azure worker role instances that are automatically added to each Windows Azure node deployment from an HPC cluster to facilitate communication between on-premises head nodes and the Windows Azure nodes. The demand for resources on the proxy nodes depends on the number of nodes deployed in Windows Azure and the jobs running on those nodes. You should generally increase the number of proxy nodes in a large Windows Azure deployment.

noteNote
  • The proxy role instances incur charges in Windows Azure along with the Windows Azure node instances.

  • The proxy role instances consume cores that are allocated to the subscription and reduce the number of cores that are available to deploy Windows Azure nodes.

HPC Pack 2012 introduced HPC management tools for you to configure the number of proxy nodes in each Windows Azure node deployment (cloud service). (In HPC Pack 2008 R2, the number is automatically set at 2 proxy nodes per deployment.) The number of role instances for the proxy nodes can also be scaled up or down by using the tools in the Windows Azure Management Portal, without redeploying nodes. The recommended maximum number of proxy nodes for a single deployment is 10.

Larger or heavily used deployments may require more than the number of proxy nodes listed in the following table, which is based on a CPU utilization below 50 percent and bandwidth less than the quota.

 

Windows Azure nodes per cloud service Number of proxy nodes

<100

2

100 - 400

3

400 - 800

4

800 - 1000

5

For more information about proxy node configuration options, see Set the Number of Windows Azure Proxy Nodes.

Large deployments of Windows Azure nodes can place significant demands on the head node (or head nodes) of a cluster. The head node performs several tasks to support the deployment:

  • Accesses proxy node instances that are created in a Windows Azure deployment to facilitate communication with the Windows Azure nodes (see Adjust the number of proxy node instances to support the deployment, in this topic).

  • Accesses Windows Azure storage accounts for blob (such as runtime packages), queue, and table data.

  • Manages the heartbeat interval and responses, the number of proxies (starting with HPC Pack 2012), the number of deployments, and the number of nodes.

As Windows Azure deployments grow in size and throughput, the stress put on the HPC cluster head node increases. In general, the key elements necessary to ensure your head node can support the deployment are:

  • Sufficient RAM

  • Sufficient disk space

  • An appropriately sized, well-maintained SQL Server database for the HPC cluster databases

The following are suggested minimum specifications for a head node to support a large Windows Azure deployment:

  • 8 CPU cores

  • 2 disks

  • 16 GB of RAM

For large deployments we recommend that you install the cluster databases on a remote server that is running Microsoft SQL Server, instead of installing the cluster databases on the head node. For general guidelines to select and configure an edition of SQL Server for the cluster, see Database Capacity Planning and Tuning for Microsoft HPC Pack.

As a general best practice for most production deployments, we recommend that head nodes are not configured with an additional cluster role (compute node role or WCF broker node role). Having the head node serve more than one purpose may prevent it from successfully performing its primary management role. To change the roles performed by your head node, first take the node offline by using the action in Node Management in HPC Cluster Manager. Then, right-click the head node, and click Change Role.

Additionally, moving cluster storage off of the head node will ensure the head node does not run out of space and will operate effectively.

When the head node is operating under a heavy load, its performance can be negatively impacted by having many users connected with remote desktop connections. Rather than having users connect to the head node by using Remote Desktop Services (RDS), users and administrators should install the HPC Pack Client Utilities on their workstations and access the cluster by using these remote tools.

For large deployments, performance counter collection and event forwarding can put a large burden on the HPC Management Service and SQL Server. For these deployments, it may be desirable to disable these capabilities by using the HPC cluster management tools. For example, set the CollectCounters cluster property to false by using the Set-HpcClusterProperty HPC PowerShell cmdlet. There may be a tradeoff between improved performance and collecting metrics that may help you troubleshoot issues that arise.

To ensure a minimal hardware footprint from the operating system, and as a general HPC cluster best practice, disable any operating system services that are not required for operation of the HPC cluster. We especially encourage disabling any desktop-oriented features that may have been enabled.

Although HPC Pack allows quick configuration of the Routing and Remote Access service (RRAS) running on the head node to provide network address translation (NAT) and to allow compute nodes to reach the enterprise network, this may make the head node a significant bottleneck for network bandwidth and may also affect its performance. As a general best practice for larger deployments or deployments with significant traffic between compute nodes and the public network, we recommend one of the following alternatives:

  • Provide a direct public network connection to each compute node.

  • Provide a dedicated NAT router, such as a separate server running a Windows Server operating system and that is dual-homed on the two networks and running RRAS.

The TtlCompletedJobs property of the cluscfg command and the Set-HpcClusterProperty HPC cmdlet control how long completed jobs remain in the SQL Server database for the HPC cluster. Setting a large value for this property ensures that job information is maintained in the system for a long time, which may be desirable for reporting purposes. However, a large number of jobs in the system will increase the storage and memory requirements of the system, since the database (and queries against it) will generally be larger.

HPC Pack uses a heartbeat signal to verify node availability. A compute node's lack of response to this periodic health probe by the HPC Job Scheduler Service determines if the node will be marked as unreachable. By configuring heartbeat options in Job Scheduler Configuration in HPC Cluster Manager, or by using the cluscfg command or the Set-HpcClusterProperty HPC cmdlet, the cluster administrator can set the frequency of the heartbeats (HeartbeatInterval) and the number of heartbeats that a node can miss (InactivityCount) before it is marked as unreachable. For example, the default HeartbeatInterval of 30 seconds could be increased to 2 minutes when the cluster includes a large Windows Azure deployment. The default InactivityCount is set to 3, which is suitable for some on-premises deployments, but it should be increased to 10 or more when Windows Azure nodes are deployed.

noteNote
Starting with HPC Pack 2012 with SP1, the number of missed heartbeats is configured separately for on-premises nodes and Windows Azure nodes. The InactivityCountAzure cluster property configures the number of missed heartbeats after which worker nodes that are deployed in Windows Azure are considered unreachable by the cluster. The default value of InactivityCountAzure is set to 10. Starting with HPC Pack 2012 with SP1, the InactivityCount property applies exclusively to on-premises nodes.

If the head node or WCF broker nodes are configured for high availability in a failover cluster, you should also consider the heartbeat signal used by each failover cluster computer to monitor the availability of the other computer (or computers) in the failover cluster. By default, if a computer misses five heartbeats, once every second, communication with that computer is considered to have failed. You can use Failover Cluster Manager to decrease the frequency of heartbeats, or increase the number of missed heartbeats, in a cluster with a large Windows Azure deployment.

If you are running service-oriented architecture (SOA) jobs on the Windows Azure nodes, you may need to adjust monitoring timeout settings in the service registration file to manage large sessions. For more information about the SOA service configuration file, see SOA Service Configuration Files in Windows HPC Server 2008 R2.

Starting with HPC Pack 2008 R2 with SP2, you can set a registry key on the head node computer to improve the performance of diagnostic tests, clusrun operations, and the hpcfile utility on large deployments of Windows Azure nodes. To do this, add a new DWORD value called FileStagingMaxConcurrentCalls in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\HPC. We recommend that you configure a value that is between 50% and 100% of the number of Windows Azure nodes that you plan to deploy. To complete the configuration, after you set the FileStagingMaxConcurrentCalls value, you must stop and then restart the HPC Job Scheduler Service.

CautionCaution
Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

See Also

Did you find this helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft. All rights reserved.