Capacity Planning Windows Azure Services for Windows Server

Contents

Capacity Planning Windows Azure Services for Windows Server Components

Performance Analysis: Load Testing the Service Management Tenant Portal and Service Management Tenant API

How to determine if a Service Management Portal or Service Management API is CPU bound

General Recommendations when Capacity Planning for the Service Management Portals and the Service Management APIs

Capacity Planning for Windows Azure Services Web Site Cloud

General Recommendation when Capacity Planning for Windows Azure Services Web Site Cloud

Capacity Planning for Windows Azure Services VM Cloud

General Recommendation when Capacity Planning for Window Azure Services VM Cloud

Windows Azure Services for Windows Server 2012 allows hosting providers to use new technologies to quickly build and deploy private clouds for hosting web sites, databases, and Virtual Machines that subscribing Tenants can use as needed for running their applications. Because private cloud hosting solutions can encompass a very wide range of processing requirements, the testing teams at Microsoft have developed some general recommendations to ensure that private clouds are provisioned with sufficient resources for meeting demand without costly overprovisioning and waste. An ideal scenario would be to provision a private cloud with precisely enough capacity to handle all client requests without wasting a cycle of excess capacity. Realistically speaking, this is not a feasible goal since client processing requests are somewhat unpredictable. Therefore a private cloud solution should be provisioned with some surplus capacity to handle occasional demand spikes. Private cloud infrastructure provisioning should seek to satisfy customer demand including spikes while avoiding overprovisioning of resources that end up seldom if ever used. Overprovisioning that leads to unused infrastructure can be as expensive as the cost of customer dissatisfaction and lost opportunity associated with under provisioning of resources.

Capacity Planning Windows Azure Services for Windows Server Components

When capacity planning for Windows Azure Services you should be aware of the fundamental components of Windows Azure Services and then focus capacity planning on each of these components individually. The essential components of Windows Azure Services for Windows Server include:

  1. Service Management Admin Portal – IIS Web Site used by hosting service providers to configure and management their Windows Azure Services for Windows Server environment. The Service Management Admin Portal can be scaled out to additional instances as dictated by load requirements.

  2. Service Management Tenant Portal – IIS Web Site used by Tenants, or end users that subscribe to the cloud services offered by hosting service providers. The Service Management Tenant Portal can be scaled out to additional instances to handle additional load.

  3. Service Management API – Exposed as a RESTful web service that accepts REST API calls from resource providers. All REST API calls from resource providers to the Service Management API must implement the RESTful contracts described by the Service Management API layer. The Service Management API comprises three different levels of functionality:

    1. Service Management Admin API – Exposes the highest possible level of access to the Service Management API.

    2. Service Management Tenant API – Exposes the level of access to the Service Management API necessary for Tenants to manage the cloud services included in any Plan(s) that they subscribe to.

    3. Service Management Tenant Public API – Also exposes the level of access to the Service Management API necessary for Tenants to manage the cloud services in any Plans that they subscribe to. The Service Management Tenant Public API uses a different authentication mechanism than the Service Management Tenant API and would be the recommended Tenant API to expose outside of a firewall if necessary.

  4. Infrastructure required to support the Service Management Portals and Service Management API:

    1. IIS Support - Because the Portals and the API run in instances of W3WP.exe, each instance of a Service Management Portal or Service Management API must run on a computer or VM with Internet Information Services (IIS) installed.

    2. SQL Server Support – Windows Azure Services for Windows Server keeps track of all runtime and configuration information in SQL Server databases. Therefore a computer or VM with SQL Server or SQL Server Express installed must be available when installing Windows Azure Services for Windows Server.

Performance Analysis: Load Testing the Service Management Tenant Portal and Service Management Tenant API

The Test team at Microsoft performed load testing of the Tenant Portal and API with Visual Studio Load Test. The following hardware configuration was used for the test:

Load Test Hardware Configuration

The configuration used for load testing was as follows.

Four Hyper-V Virtual Machines running on the same Hyper-V host computer, each allocated 4GB of RAM and 2 cores, the following components were installed on the specified virtual machines:

  1. Service Management Tenant Portal

  2. Service Management Admin Portal

  3. Service Management Tenant API

  4. Service Management Admin API, Resource providers for SQL Server and MySQL Server and the WASfWS Usage Service.

3 virtual machines provisioned for the following purposes:

  1. Hosting the SQL Server instance required for the WASfWS databases.

  2. Hosting the SQL Server instance required for the SQL Server resource provider.

  3. Hosting the MySQL Server instance required for the MySQL Server resource provider.

Load Test Configuration

All load testing was performed with Visual Studio Load Test configured with 5 Test Agents to ensure that sufficient load could be generated.

Load test scenario mix:

  • Tenant Portal Dashboard Load – 75%

  • Create Web site – 5%

  • Delete Website – 3%

  • Update website configuration – 3%

  • Create SQL database – 5%

  • Create MySQL database – 5%

  • Delete SQL database – 2%

  • Delete MySQL Database – 2%

Other Load Test parameters:

  • Load test duration – 3 hours

  • # of tenants/end users before load testing started – 8000

  • # of subscriptions before load testing started (each tenant was allocated a single subscription) – 8000

  • # SQL Server tenant/end user databases configured at start of load test – 6500

  • # MySQL Server tenant/end user databases configured at start of load test – 6500

  • Maximum size in MB specified for all tenant/end user databases – 50 MB

  • # tenant/end user web sites configured at start of load test – 5000 web sites

  • # tenants/end users added during load test – 2000

  • # of subscriptions added during load test (each new tenant/end user was allocated a single subscription) – 2000

  • # tenants/end users at end of load test – 10000

  • # subscriptions at end of load test – 10000

Load Test Results for Tenant Portal

The table below displays the maximum concurrent users and # of Requests per second observed as the number of Tenant Portal instances was increased. For this load test, end users/tenants were configured to initiate a new request every minute.

The average CPU utilization observed on each Tenant portal instance was about 80% and responses to page requests were processed within 5 seconds. Response times for client posts to the Tenant portal which involved performing tasks such as creating web sites or creating end user databases were not included when measuring page request response times.

Maximum Concurrent Users # of Tenant Portal Instances
(2 Cores / 4 GB RAM per VM)
Requests per Second

3000

1

97

5000

2

165

7500

4

214

10000

6

258

Tenant Portal Load Test Observations

  1. A single instance of a tenant portal was observed processing 97 requests per second with a response time of less than 5 seconds with 3000 concurrent users. A consistent response time of less than 5 seconds is considered satisfactory performance for most usage scenarios.

    Note

    The calculated page request response time does not include response times for client posts to the Tenant Portal which involve performing tasks such as creating web sites or creating end user databases.

  2. A 60% increase in throughput was observed after scaling out the number of tenant portal instances from one to two.

  3. While adding tenant portal instances increases throughput, the increase is non-linear so each additional portal instance provided relatively less throughput improvement.

  4. Based upon observed performance metrics, 6 tenant portal instances will support 10000 concurrent users. For most production deployments this would be considered a very high load scenario.

Load Test Results for the Tenant API

The table below displays the throughput of the Tenant API observed during the load test. The average CPU utilization observed on each Tenant API instance was about 80% and the load on the Tenant API instance(s) increased as the number of Tenant Portal instances was increased.

Requests per Second # of Tenant API Instances
(2 Cores / 4 GB RAM per VM)

164

1

252

2

Tenant API Load Test Observations

  1. A single instance of the Tenant API was observed to reliably support 164 requests per second with an average CPU utilization of 80%.

  2. A single instance of the Tenant API was able to support 10000 concurrent users at 60% average CPU utilization with each user configured to initiate one request per minute.

  3. Scaling out the Tenant API from a single instance to two instances was observed to yield a 50% increase in throughput.

How to determine if a Service Management Portal or Service Management API is CPU bound

The Service Management Portal and Service Management API components required to enable Private Cloud solutions are critical components of a Windows Azure Services for Windows Server private cloud solution. To ensure that your Windows Server private cloud solution performance is not bottlenecked by the computers or VMs running the Service Management Portal or Service Management APIs, periodically evaluate performance of these components and consider scaling out any Portal or API instances which consume 70% or more of the available CPU resources on the computer or VM that the instance is running. To evaluate the % Processor Time utilized by the IIS Worker Process that is hosting a Service Management Portal or Service Management API, follow these steps:

  1. Complete the steps in Microsoft Knowledge Base article 281884, The Process object in Performance Monitor can display Process IDs (PIDs) (https://support.microsoft.com/kb/281884) to configure Performance Monitor to display the Process object with the Process ID (PID) associated with the Process object. This must be done so that you can identify the IIS Worker Process that is hosting a Service Management Portal or Service Management API and subsequently measure the % Processor Time and/or Requests / Sec of the associated IIS Worker Process.

  2. After you have applied the Registry change documented in The Process object in Performance Monitor can display Process IDs (PIDs), open Task Manager, click More Details at the bottom left of Task Manager, click the Details tab and then click the Name column heading to sort all running processes by name. Scroll through the sorted list of entries in the Name column until you locate one or more processes with the name w3wp.exe.

  3. If there is not already a column heading with the name PID displayed, right-click one of the existing column headings, click Select columns from the context menu, check the box to display the PID column and then click OK. Task Manager should now display a PID column along with the other columns.

  4. If there is not already a column heading with the name User name displayed, right-click one of the existing column headings, click Select columns from the context menu, check the box to display the User name column and then click OK. Task Manager should now display a User name column along with the other columns.

  5. Left-click the PID column heading and drag the PID column heading next to the Name column heading. You will now be able to see the process Id (PID) associated with any running instances of the IIS Worker Process w3wp.exe.

  6. Left-click the User name column heading and drag the User name column heading to the right of the PID column heading. This will allow you to easily identify the process ID of the IIS Worker Process associated with a running Service Management Portal or instance of the Service Management API because the Service Management Portals and Service Management APIs all run in application pools that are configured with an Identity of ApplicationPoolId and a Name that describes the service. This causes the User name property of these processes to match the friendly name of the service so that it becomes a simple matter to associate an instance of w3wp.exe with the service it is hosting. Furthermore, since you have applied the registry entry described in The Process object in Performance Monitor can display Process IDs (PIDs), you will now be able to easily associate any instance of a Process object with its Process ID (PID). Once you know the PID of an IIS worker process listed in Task Manager it is easy to correlate the PID to the corresponding Process object in Performance Monitor so you can add performance metrics for the process (like CPU utilization or Requests / Sec) to the Performance Monitor graph and evaluate these metrics in real time.

  7. Open Performance Monitor, expand Monitoring Tools and then click the green plus sign at the top of the graph to open the Add Counters dialog box.

  8. Scroll through the list of counters for <local computer> until you find the Process counter. Select the Process counter to populate the list below which includes all running processes on the computer or VM. Scroll down the list of running processes until you locate one or more instances of the w3wp process.

  9. Because of the registry entry modification applied previously, each instance of w3wp should be appended with an underscore and a number that correlates to the Process ID (PID) of the process. Select each instance of w3wp with a process ID correlating to an instance of Service Management Portal or Service Management API, as described in step 6. If you need to select more than one instance press and hold down the CTRL key on your keyboard while selecting instances.

  10. Click the down arrow next to Process to see the list of Process counters that can be added to Performance Monitor and select only the %Processor Time counter.

  11. Click the Add>> button under the list of process instances to add the %Processor Time counter for each selected process to the Performance Monitor chart and then click OK. You will now be able to quickly determine the CPU utilization associated with any running instance of Service Management Portal or Service Management API. If CPU utilization for any of these processes singly or collectively (when multiple services are installed on the computer or VM) approaches 80% then there is a likelihood that these services are now or will soon cause a CPU bottleneck condition, hindering performance of Windows Azure Services for Windows Server.

General Recommendations when Capacity Planning for the Service Management Portals and the Service Management APIs

In a typical production scenario the Service Management Tenant Portal and Service Management Tenant API have significantly higher load requirements than the Service Management Admin Portal and Service Management Admin API. In a production environment consider installing more than one instance of the Service Management Tenant Portal and Service Management Tenant API to provide high availability and also to ensure that the Service Management Tenant Portal and Service Management Tenant API can provide satisfactory performance during peak load times. When installing multiple instances of the Service Management Tenant Portal or Service Management Tenant API, make use of a hardware load balancer or NLB to ensure equitable distribution of load across instances of the Service Management Tenant Portal / Service Management Tenant API. As load requirements increase you can either continue to scale out by installing additional instances of the Tenant Portal and Tenant API or scale up existing Tenant Portal and Tenant API instances with additional CPU cores or memory. To evaluate CPU utilization of a Tenant Portal or Tenant API instance, follow the steps in How to determine if a Service Management Portal or Service Management API is CPU bound. If you determine that a Tenant Portal or Tenant API instance is CPU bound then you should consider scaling up that instance or scaling out by installing additional instances to provide satisfactory performance for your end users/Tenants. You should also periodically evaluate performance of the Service Management Admin Portal and Service Management Admin API and scale these up or out as necessary.

Capacity Planning for Windows Azure Services Web Site Cloud

Capacity planning for a Windows Azure Services Web Site Cloud is largely a function of proper configuration and tuning the Web Sites roles that are provisioned for the Web Site Cloud. Windows Azure Services Web Sites roles can be installed on physical computers or on Hyper-V virtual machines. Regardless of whether Web Sites roles are installed on physical hardware or on virtual machines, the host operating system must be Windows Server 2012. As the performance gap between virtual machines on Hyper-V and physical hardware continues to shrink, the advantages of running production software on virtual machines increases. For the great majority of usage scenarios, installing Web Sites roles on Hyper-V virtual machines provide a combination of excellent performance, superior flexibility and convenience that outweigh any minor performance gains that may be possible when installing Web Sites roles on physical hardware.

General Recommendation when Capacity Planning for Windows Azure Services Web Site Cloud

The list below provides a summary of the Web Sites roles to be installed on each Web Site Cloud as well as specific recommendations for optimizing performance of these roles:

  • Controller Role – The Web Sites Controller role is responsible for provisioning and managing the other Web Sites roles and runs an enhanced version of the Web Farm Framework (WFF). The Web Sites Controller maintains detailed logging while provisioning the remaining Web Sites roles and should be the first place to check if any problems occur while installing Web Sites roles. The Web Sites Controller WebFarmService.exe service monitors running processes, active requests and the status of any computers or VMs running other Web Sites roles required by the Web Sites Cloud. While the Web Sites Controller serves a vital function it does not directly engage in handling client requests and does not typically cause any performance bottlenecks in a Web Sites Cloud. A second Web Sites Controller can be created with a PowerShell cmdlet to provide fault tolerance / redundancy for the Web Sites Controller role.

  • Front End Role – The Web Sites Front End role is comprised of an IIS web server running the Application Request Routing (ARR) IIS module. It is recommended that a hardware load balancer or NLB be configured to initially receive and forward incoming requests to the Front End role computer(s) or VM (s) however this is not a strict requirement. High availability for the Web Site Front End role can be accomplished by configuring a hardware load balancer or NLB to use round robin DNS to route incoming requests to 2 or more computers or VMs configured with the Web Sites Front End role. If multiple computers or VMs are configured with the Web Worker role, the Front End Role computers or VMs continually monitor the state of the Web Site Cloud to evaluate which Web Worker role computers or VMs are best equipped to handle these requests and then routes these requests accordingly. The Front End role computers or VMs are typically not a bottleneck for a Web Site Cloud but can be CPU intensive; the maximum throughput observed in lab conditions is ~100 requests per second per core without consideration of additional overhead required when processing and routing inbound SSL traffic. Consider deploying more than one instance of the Front End role for purposes of fault tolerance / redundancy.

  • REST API Role – The Web Sites REST API role exposes the Web Sites Management API via a REST endpoint. If several tenants are creating sites or performing other management tasks, computers or VMs running the REST API role may experience heavy CPU utilization, potentially creating a bottleneck on the Web Sites Cloud. The Web Sites REST API role requires only about 4 GB RAM in a production environment and if possible should be assigned two cores. Consider deploying more than one instance of the REST API role for purposes of fault tolerance / redundancy. To provision an additional computer or VM to run the REST API role follow these steps:

    1. Complete the steps outlined in the “Systems Requirements” and the “Role Account Preparation” sections of the Service Management Portal / Service Management API Installation Guide on a computer or VM that is running Windows Server 2012 to prepare the computer or VM to run a web sites role.

    2. Run the following PowerShell commands on the computer or VM configured to run the Controller role; specify the name of the newly prepared computer or VM for the <NewManagementServer> parameter:

      Add-pssnapin WebHostingSnapin
      New-ManagementServer –ManagementServerName <NewManagementServer>

  • Publisher Role –The Publisher role computer or VM hosts both a web deploy and FTP service for purposes of application deployment. If several tenants are publishing simultaneously, computers or VMs running the Publisher role may experience heavy CPU utilization, potentially creating a bottleneck on the Web Sites Cloud. Consider deploying more than one instance of this role to ensure reliability.

  • File Server Role – The File Server role computer or VM houses all of the application files for every web site configured to run on the Web Site Cloud. When 100s or even 1000s of web sites are loading application files from the File Server role computer or VM, the File Server role computer or VM will likely present a performance bottleneck if the underlying file storage is not optimized. For production environments the File Server role computer should use a dedicated SAN or high performance NAS to ensure acceptable performance. Because the file system in use will almost always be the source of any bottleneck for the File Server role computer, there is typically little performance to be gained if additional File Server role computers are provisioned when the File Server role computers are sharing the same file system infrastructure. To provide fault tolerance for the File Server role the focus should typically be on the underlying file system.

  • Web Worker Role – The Web Worker role is responsible for processing outstanding client requests. The most critical resource for a Web Worker role computer or VM is available memory. When hosted web sites do not have sufficient memory, web site performance will incur a significant performance penalty when virtual memory is swapped from disk to accommodate the shortfall. It is recommended to provision at least 3 Web Worker role computers or VMs in production environments, more as dictated by memory requirements. To determine the number of Web Worker role computers or VMs to deploy consider the following:

    1. Each Web Worker role computer or VM requires ~ 1.2 GB of RAM for the operating system. RAM available above this threshold can be used to run web sites.

    2. On average, based on observed production workloads, roughly 5% of web sites in a Web Site Cloud are active. The percentage of web sites that are active at any given moment can be significantly higher or lower. When assuming an “active web site” rate of 5%, one may extrapolate the maximum number of web sites that can be provisioned to a Web Site Cloud as 100% / 5% or 20x the number of active web sites.

    3. The average memory footprint observed for websites running in production environments is ~ 70 MB. When using the observed production web site memory footprint of 70 MB, the amount of memory that should be allocated across all Web Worker role computers or VMs installed on a Web Site Cloud may be extrapolated as follows:

      # Of Provisioned web sites * 70MB * 5% - (# of Web Worker Roles * 1.2 GB)

      For example, if 5,000 web sites are provisioned on a Web Site Cloud that is running 10 Web Worker role computers or VMs then each Web Worker role computer or VM should be allocated 7060 MB of RAM determined as follows:

      5,000 * 70 * .05 – (10 * 1044) = 6.89 or ~ 7 GB

      This value is based on the percentage of active web sites observed in a production environment and the average memory footprint of web sites observed in a production environment.

  • SQL Server Database – Windows Azure Services Web Site Cloud makes extensive use of SQL Server. To ensure that SQL Server performance does not hinder overall performance of the Web Site Cloud, consider following these guidelines for allocating RAM, Disk, and CPU resources:

    1. Allocate at least 4 GB of RAM to your SQL Server for every 30,000 sites that are provisioned. When allocating resources for SQL Server consider that available memory is often the resource upon which SQL Server performance is most dependent. SQL Server will use as much memory as you allocate to it and SQL performance will benefit from additional memory for most scenarios.

    2. Allocate at least 4 GB of disk space to your SQL Server disks for every 10,000 sites that are provisioned. Additionally, perform testing to verify that your SQL Server disk subsystem allows your SQL Server installation to handle multiple simultaneous requests effectively. For information about how to simulate SQL Server activity on a disk subsystem see How to use the SQLIOSim utility to simulate SQL Server activity on a disk subsystem (https://support.microsoft.com/kb/231619).

    3. Allocate an additional CPU core to your SQL Server computer or VM for each increment of 750 actively running web sites / 15000 provisioned sites or when the % Processor Time of the SQL Server service as observed in Task Manager or Performance Monitor approaches 70%.

Capacity Planning for Windows Azure Services VM Cloud

Not surprisingly, the primary variable to consider when provisioning resources for VM Cloud hosting on Windows Azure Services for Windows Server is the number of Virtual Machines in use by subscribers to plans that provide VM Hosting. Appropriately sized hardware and software resources must be allocated to ensure satisfactory performance of subscriber’s Virtual Machines and the VM interface exposed by the Tenant portal.

General Recommendation when Capacity Planning for Window Azure Services VM Cloud

The Systems Center Test team has developed the following guidance upon review of performance metrics observed in the Systems Center Lab running VMM Server with SPF Server to provide VM Cloud hosting to Windows Azure Services on Windows Server:

  1. Regardless of the number of VMs to be hosted, it is recommended that you host a maximum of 5000 VMs per VMM Server. To provide VM Hosting capabilities for between 1 and 5000 VMs, a minimum of 4 servers should be provisioned to ensure satisfactory performance:

    • A single VMM Server

    • A single VMM SQL Server

    • A single SPF Server

    • A Single SPF SQL Server

    For this scenario, each computer should be configured with 4 cores and 8 GB of RAM.

    Note

    When provisioning the SQL Server computers required by VMM Server and SPF Server it is recommended to host the SQL Server databases on a high performance SAN with separate LUNS available to house the data files, the log files, and the tempdb database.

    Note

    Ensure that a 1:1 ratio of VMM Servers to SPF Servers is maintained for all scenarios. If this ratio is not maintained then the VM interface exposed in the Tenant Portal can become less responsive and provide unsatisfactory performance from the perspective of a subscribing VM end user / tenant. It is also recommended that a 1:1 ratio of VMM Servers to VMM SQL Servers is maintained.

  2. Use the values in the table below to determine how many VMM Servers, VMM SQL Servers, SPF Servers and SPF SQL Servers to provision in order to host the number of VMs specified in the # VMs column:

    # VMs # VMM Servers
    Cores / RAM per server
    # VMM SQL Servers
    Cores / RAM per server
    # SPF Servers
    Cores / RAM per server
    # SPF SQL Servers
    Cores / RAM per server

    < 5000

    1 computer
    4 Cores / 8GB RAM

    1 computer
    4 cores / 8 GB RAM

    1 computer
    4 Cores / 8GB RAM

    1 computer
    4 cores / 8 GB RAM

    < 10000

    2 computers
    4 cores / 8GB RAM

    2 computers
    4 cores / 8GB RAM

    2 computers
    4 cores / 8GB RAM

    1 computer
    8 cores / 8 GB RAM

    < 15000

    3 computers
    4 cores / 8GB RAM

    3 computers
    4 cores / 8GB RAM

    3 computers
    4 cores / 8GB RAM

    1 computer
    8 cores / 8 GB RAM

    < 20000

    4 computers
    4 cores / 8GB RAM

    4 computers
    4 cores / 8GB RAM

    4 computers
    4 cores / 8GB RAM

    1 computer
    8 cores / 8 GB RAM

    < 25000

    5 computers
    4 cores / 8GB RAM

    5 computers
    4 cores / 8GB RAM

    5 computers
    4 cores / 8GB RAM

    1 computer
    8 cores / 8 GB RAM