Managing resource efficiency in Azure and on-premises datacenters

Article
05/11/2018

Technical Case Study

July 2015

When IT professionals transition to a hybrid cloud environment, they face the challenge of efficiently allocating server infrastructure resources. How is Microsoft IT approaching these challenges and controlling costs? This technical case study describes the strategic approach to analyzing and identifying underutilized resources and implementing the solutions used by Microsoft IT.

Download
Technical Case Study, 904 KB, Microsoft Word file

Situation	Solution	Benefits	Products & Technologies
When deciding to transition all servers to Microsoft Azure, Microsoft needed a strategic approach to identify cold and frozen servers and to manage server migrations.	Identifying servers that are underutilized is the starting point. SCOM provides a customized report that classifies servers into five categories: frozen, cold, warm, hot, and on fire. The process, known in the industry as a “scream test,” gradually shuts down the functionality of underutilized servers (cold or frozen) until the owner responds. Then, a decision can be made about the disposition of the hardware.	Saved over $30M in seven months of total annualized cost avoidance. Reduced total footprint 15 percent. Decommissioned 7,000 servers, migrated 6,000+ servers to Azure, and enabled a lower fully burdened rate than on-premises servers. Made maintenance and management easier because all Azure servers are on current, compliant configurations. Increased density and efficiency of application infrastructure builds, beginning at the engineering level. Improved asset management, ensuring the correct server owner is identified, enabling application-to-server mapping.	Microsoft Azure Azure Active Directory with Group Policy WMI System Center Operations Manager

Situation

Every business, even those running a handful of servers, needs to manage servers for efficient provisioning, allocation, overutilization, underutilization, and lifecycle management. At Microsoft, many servers in these categories incur significant overhead hosting charges because:

Many poorly utilized assets could be removed or better optimized, both in datacenters on-premises and in Azure virtual servers.
Lifecycle management is receiving much-needed attention. For example, the team scrutinized areas that include new servers ordered too soon, servers that have been abandoned but not decommissioned, and servers that are overprovisioned compared to actual usage.
Inaccurate asset management information is problematic and affects the financial entities that manage purchasing and depreciation.
Servers that are overprovisioned and underutilized cost the company money, which adds up, regardless of the number of servers in use. There are no mandated and managed on/off protocols in place. Azure virtual machines are extremely expensive to run when not in use.
When server owners are asked about server usage, they typically respond that they need the server. In some cases, this is asset hoarding. Requests for exceptions to established usage metrics must be scrutinized to ensure that needs are not exaggerated.

Mt186482.image006(en-us,TechNet.10).jpg **
Figure 1. Hosting Resource and Recovery (HRR) program team vision for server and application lifecycles.**

Servers on-premises and in Microsoft Azure see four common uses where the server is not fully utilized at all times. Unlike Azure, on-premises servers cannot be resized or turned off on demand and must incur the fully burdened usage rate whether they are used or not.

On and off: On-and-off workloads (for example, batch jobs and test environments) waste overprovisioned capacity; time to market can be cumbersome.
Growing fast: Successful services need to scale to keep up with growth; this is a big IT challenge when IT cannot provision hardware fast enough.
Unpredictable bursting: Unexpected or unplanned peaks in demand cause sudden spikes and impact performance. This means IT cannot overprovision for extreme cases in a cost-conscious manner.
Predictable Bursting: Services with micro-seasonality trends (for example, retail seasonality) peak during periodic increased demand. This increases IT complexity, which wastes overprovisioned capacity.

Solution

The Microsoft Hosting Resource and Recovery (HRR) program is a sustainable, end-to-end solution for Microsoft teams to first examine the usage of servers and categorize them, bringing them into a tiered-response acquisition process. The returned results might end with a server being decommissioned, reallocated, or other appropriate action. The HRR program’s overall goal is to drive proper usage of Microsoft IT capacity and to reduce the unused footprint by targeting such areas as underutilized servers, datacenter closures, fully depreciated hardware, and noncompliant categories.

System Center Operations Manager monitoring can be used to collect performance details and aggregate this data into categories of utilization from frozen to hot. (Note that most IT organization running Windows, Azure Active Directory with Group Policy, or Windows Monitoring Interface (WMI) can easily gather similar data points).

Identifying poorly utilized servers is the starting point. The HRR team monitors on-premises datacenters and Azure virtual servers daily, using System Center Operations Manager (SCOM) performance analytics. Processor, network, and hard drive data is gathered, and the team then makes a determination based on codified thresholds. The focus is on servers that are underutilized in terms of computing and throughput. SCOM provides a customized report that classifies servers into five categories: frozen, cold, warm, hot, and on fire.

Underutilized servers are often abandoned because they are no longer needed or a result of hardware degradation. Underutilized servers that are on-premises are prime candidates to move to the cloud. The team is migrating OS instances and SQL databases to Azure at a fast pace to vacate datacenters. Analyzing cold or frozen Azure servers ensures that these servers are either resized up or down, or turned off when not in use. The team also maintains an exception process for known scenarios where servers will be underutilized, such as disaster recovery and inactive cluster nodes.

Tiered response acquisition process

The HRR team uses Azure Active Directory with Group Policy Objects on Organizational Units to gradually limit and reduce the functionality of servers to elicit server owner responses.

This process, widely known in the industry as a “scream test,” gradually shuts down the functionality of servers that are reporting as underutilized (cold or frozen) until the owner responds. A decision can then be made about the disposition of the hardware.

The HRR team typically uses this tiered-response acquisition process over a period of four weeks, with one week allocated to each stage. The timeline for each stage is easily adjustable. A huge benefit of using the Azure Active Directory option is that, by dragging and dropping servers into different organizational units, each stage can be initiated or reversed very quickly.

Stage 1: A custom logon window asks the user to contact the IT department.

Stage 2: The server begins restarting every four hours.

Stage 3: All services are disabled and the server is unusable.

Stage 4: Permission to decommission is requested.

When a response is received from the server owner, the team can talk to the server owner about several possible outcomes: downsizing, right sizing, or decommissioning.

Mt186482.image007(en-us,TechNet.10).jpg **
Figure 2. Azure use cases**

Azure snooze and resize

By properly leveraging the capabilities built in with Azure, wasted costs due to overprovisioning can be recovered. Cloud infrastructure enables elasticity by meeting the various infrastructure fluctuations. Azure infrastructure costs and usage can now reflect the actual demand and usage of your cloud infrastructure according to the four common use cases. Azure billing is a direct reflection of what is turned on and used, not an estimated footprint.

On and off: Users can turn off their Azure virtual machines when not in use and immediately turn them back on when they are in use. The server owner does not pay for servers when they are off.
Growing fast: Azure virtual machine deployments can be built within a few minutes versus a few days for on-premises datacenters, allowing service owners to provision servers as needed when their application is growing. The owner does not need to provision additional servers too early - when they are not yet used - and ultimately pay for unused servers.
Unpredictable bursting: Azure virtual machine deployments can be built within a few minutes versus a few days for on-premises datacenters, which allows service owners to provision servers as needed when their application is growing. Owners can provision additional virtual machines for a periodic, unpredicted burst in demand.
Predictable bursting: Servers can be provisioned to meet maximum capacity demand but can be turned off or on or resized in times of low usage. This enables the owner to pay for the servers only when they are being used, but also has the capacity available to scale up on-demand.

Benefits

Cost reduction: over $30M in seven months of total annualized cost avoidance.
Smaller datacenter footprint: The HRR team reduced the total footprint by 15 percent, decommissioned 7,000 servers, migrated more than 6,000 servers to Azure, and enabled a lower fully burdened rate than on-premises servers.
Easier maintenance and change/configuration management because all Azure servers are on current, compliant configurations.
Higher density and efficiency of application infrastructure builds, beginning at the engineering level.
Improved asset management, which ensures that the correct person is identified as owning a server and enables application-to-server mapping.
Security and risk management, including a reduction in redundancy and less downtime.
Resource and churn reduction enabled by simple, self-service, automated tooling at a single source for owners to leverage.
Alignment with future IT infrastructure capabilities, meaning fully elastic infrastructure.

The tiered-response acquisition process has processed 15,000 servers in seven months, generating a 98 percent response rate within 30 days, without a resource intensive team.

Best practices

Change existing capacity allocation thinking in IT:

Provision only what you need.
Decommission what is not needed.
Drive to the cloud.
Move to an elastic infrastructure that grows and shrinks with demand.

Define new capacity models for use in the cloud:

Implement an "Azure first" mindset.
Migrate on-premises to Azure to fit the new IT mindset.
Resize to fit the need.
Schedule server hibernation or turn off through automation.

For more information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:

www.microsoft.com

www.microsoft.com/ITShowcase

© 2015 Microsoft Corporation. All rights reserved. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.