Microsoft IT Improves Operational Efficiency and Reduces Costs with Microsoft System Center
Technical Case Study
Published: February 2014
The following content may no longer reflect Microsoft’s current position or infrastructure. This content should be viewed as reference documentation only, to inform IT business decisions within your own company or organization.
Microsoft IT discusses its use of the System Center suite of services and the business and operational benefits realized across Application and Infrastructure automation.
Technical Case Study, 707KB, Microsoft Word Document
|Situation||Solution||Benefits||Products and Technology|
Like other enterprises, Microsoft Information Technology (Microsoft IT) strives for operational excellence. With over 30,000 servers to manage, they needed to ensure that the shared, foundational services they provide to Microsoft business units were streamlined and automated while reducing operating expenses.
Microsoft IT leveraged four key System Center 2012 components to deliver operational efficiencies and to maximize their value to the business by simplifying and automating how they implement software updates and distribution, infrastructure management, server virtualization, and application management.
The Manageability team within Microsoft IT Service Deployment and Operations (SDO) division is responsible for managing the shared information technology foundation required to support internal line-of-business (LOB) applications.
With more than 30,000 servers to manage, thousands of line-of-business applications spanning the world and workloads increasing approximately 10 to 15 percent annually, it was not financially practical to increase resources—people or tools—linearly to keep pace.
Microsoft IT leveraged System Center 2012 components to deliver predictable, reliable, and cost-effective monitoring and automation to drive operational efficiency. The following sections discuss how the Manageability team within Microsoft IT Operations utilizes System Center products and the business value that each component provides.
System Center Configuration Manager
System Center Configuration Manager is the component that provides change and configuration management, and lets you perform tasks such as:
- Take an inventory of the hardware and software that is deployed across various assets.
- Deploy security and non-security updates to ensure servers are secure and stable.
- Detect configurations that are known to cause issues and even set those configurations to known good values.
Configuration Manager enabled Microsoft IT to:
- Gain insight: Routine collection of inventory keeps the Configuration Management Database (CMDB) accurate and supports numerous business intelligence (BI) reports.
- Maintain compliance: Security threats close quickly—95 percent compliance across more than 30,000 servers within 19 days, every month.
- Prevent known problems from reoccurring: Desired Configuration Management allows Microsoft IT to define and detect configuration drift and remediate systems back to a known good state.
Additionally, System Center 2012 Configuration Manager exposes a wealth of data via both data sets available in the Configuration Manager SQL database and through the Configuration Manager Software Development Kit (SDK) and Windows PowerShell module. This allows Microsoft IT to run queries and produce reports to consolidate information throughout the organization.
Likewise, by using Configuration Manager, Microsoft IT is able to coordinate deployments and configuration management policies within approved maintenance windows for each server, coordinating actions taken on systems around key business times.
Since Microsoft IT runs the Configuration Manager platform as a centralized service, they have defined some basic measurements to ensure that the shared responsibility for compliance is being met. Two KPIs they use to evaluate this shared responsibility are:
- Is our agent installed on your system, and is our agent healthy (is the agent fully functioning and on your system at the time we are ready to patch)?
- Did we patch your system when we said we would patch your system?
Asset inventory has become a data source that server and application administrators use as a rapid auditing tool. In order to manage the tens of thousands of servers at Microsoft, Microsoft IT collects large amounts of asset inventory information, and with that inventory data the business is able to do the following:
- Understand current capacity and plan for future needs by understanding how many processors, RAM, and storage systems have.
- Identify systems that are running out-of-date software that may present security, stability, or performance issues.
- Create BI scenarios where business units (BUs) join the data available in CMDB into their queries and reports and then pivot on their business-specific data.
- Create custom solutions such as cross product scenarios within the System Center. For example, they can use relationship data to identify which virtual machines (VMs) are hosted on which Hyper-V hosts and craft an alert consolidation scenario such that if a Hyper-V host goes offline, alerts generated by all the VMs on that host are ignored or consolidated into a single alert.
Microsoft IT has a baseline set of attributes that are collected for the purposes of keeping the CMDB up to date. Beyond that, any team is able to request extensions be added to the inventory. And after that data is being collected, it can be exposed via a SQL database for teams to run BI against and/or extract-transform-load (ETL) as they see fit.
Security Patch Deployment and Software Distribution
Minimizing the threat of vulnerabilities requires that organizations have properly configured systems, use the latest software, and install the recommended software updates. Assessing and maintaining the integrity of software in a networked environment through a well-defined patch management program is a key first step toward successful information security. Microsoft IT uses Configuration Manager as the primary solution in its server patch management process.
Microsoft takes security compliance very seriously. Compliance is everyone’s responsibility, and everyone is held accountable. Security patches are released the second Tuesday of every month (known as Patch Tuesday). Microsoft IT uses a 19-day cycle to complete their security update deployments, and the goal is 95 percent compliance within that time frame.
The 19-day cycle, developed in cooperation with executive leadership, operations, server and application owners, and Information Security, balances the desire to reduce risk and provide the business the time to prepare and orchestrate updates across test and production servers. The process drives the activities of the teams that are accountable for security patching. Microsoft IT took the following approach for deployment patch management and recommends that you develop a similar approach for addressing your security needs.
Within that 19-day cycle, Microsoft IT starts with the servers with the lowest risk to the business and progress to the ones with the highest risk. They also provide a change request process to accommodate BUs needing to delay patching of their systems. Microsoft IT structured the 19-day cycle as follows:
- Process Initiation (week 1)
- List of required updates delivered by the Information Security team.
- Configuration Manager package developed and WSUS updates authorized.
- Configuration Manager infrastructure patched.
- Smoke Test (week 1)
- Configuration Manager deployment targeted to Development, Test, and UAT systems.
- Production Patching (week 2)
- Each day is divided into six 2-hour maintenance windows. Server owners use CMDB to schedule their maintenance window.
- Self-patch exception where asset owners patch their servers manually.
- Forced Patching (week 3)
- Any remaining noncompliant servers updated in accordance with the customer selected maintenance window.
Above and beyond deployments of security updates, Microsoft IT uses Configuration Manager to quickly deploy other non-security software updates to systems as well. To ensure that these updates have a minimal impact to the business, Microsoft IT leverages Configuration Manager features to insert the software update step into the standard patching process and distribute the update, executing the change within the same patch maintenance windows.
Desired State Management
The Desired Configuration Management (DCM) feature of Configuration Manager allows you to assess the compliance of servers with regard to a number of configuration items. Because configuration drift, often a result of change, can be a significant driver of incidents in any environment, a desired configuration service was developed and reports on any configuration drifts.
The centralized DCM service enables Microsoft IT to drive compliance based on changes to global and LOB-specific application standards. Across the data center, Microsoft IT has global baselines—standards applied to each system they support. In addition, each BU has LOB-specific applications that have specialized configurations that they want to monitor. Configuration Manager can monitor and detect the global and application configurations.
Microsoft IT evaluates compliance on a daily basis; and, therefore, an errant change made by support personnel is reflected in the report that the server owners receives within 24 hours.
Systems Center Operations Manager
Businesses, small and large, are typically dependent on the services and applications provided by their computing environment. IT departments are responsible for ensuring the performance and availability of those critical services and applications. That means that IT departments need to know when there is a problem, identify where the problem is, and determine the source of the problem—ideally before the users of the applications encounter the problem. The more computers and devices in the business, the more challenging this task becomes.
To address this, Microsoft IT centralized the monitoring platform using Operations Manager. Operations Manager provided Microsoft IT with:
- Centralized "Core" Alert Response: A single administration console is used by a central team for "core" alerts related to the hardware, OS, and core services (for example: Active Directory, DNS, and so on). Core issues are identified quickly and handled consistently.
- Extensive out-of-the box monitoring: Foundational monitoring is provided as part of Operations Manager. Retail Management Packs provide the base monitoring for all servers and Microsoft technologies.
- Extensive configurability: App-specific monitoring and customized UI can be built as needed.
In addition to the centralized "core" monitoring that is offered to everyone, BUs with specific needs can pay for additional monitor services to support their specific business needs. In these cases, the deployment of Operations Manager is maintained by the centralized IT team and then application-specific configuration of monitoring as well as responding to the alerts from that monitoring is managed by the business itself. This enables Microsoft IT to centralize the overhead and expertise of infrastructure management, while distributing the specifics of how applications are monitored to the teams that know the applications the best. The following sections detail these two service offerings further.
Core Monitoring Services
Microsoft IT monitors the customer’s core OS and hardware and automatically responds to incidents through the use of Operations Manager and a set of tuned retail management packs. The Core Monitoring service enables performance and health reporting and is measured by the following KPIs:
- Agent health: Maintain a 98 percent healthy state by checking agent health multiple times per day.
- Alert to troubleshooting guide: Ensure a one-to-one ratio where every workflow that can generate an alert has a troubleshooting guide associated with it within the company knowledge base.
- Alert to action: Ensure that alerts which are repeatedly being closed immediately are reviewed, as it is very likely that those alerts are not actionable and could be tuned or turned off.
For servers managed by Microsoft IT, alerts are automatically sent to the server incident management team, which centrally handles first touch on incidents related to IT servers.
Dedicated Monitoring Services
BUs who want to fully utilize Operations Manager can opt to have their own instance of Operations Manager deployed for their use. In these cases, the team that runs the centralized infrastructure for the core monitoring service takes responsibility for keeping the Operations Manager infrastructure and agents up, running, and healthy. This is due to the fact that they already have process, automation, and expertise in place for supporting infrastructural components at scale. The business unit is then delegated full control of the Operations Manager environment so that they can author monitoring as they see fit and respond to the resulting alerts. Business units typically use their dedicated instances to focus on one or more of the following scenarios:
- End-to-end business monitoring
- User experience monitoring
- Alert response
- Performance collection
- Performance and health reporting
Application support teams can consume Operations Manager as a service from the central Manageability team; enabling customers to focus on monitoring services from end to end, not managing agents. BUs with a dedicated environment utilize the monitoring environment for their business needs, including:
- Using retail management packs provided by Microsoft software product groups combined with customized dashboards to bring all monitoring into a single view, whether it’s the web console, Microsoft SharePoint, or directly in Operations Manager.
- Tracing end-to-end business transaction through an asynchronous system and use Application Performance Management (APM) to gain deep insight into application performance and exceptions.
- Using the new Global Service Monitor (GSM) solution with Operations Manager 2012 to run web-based business scenario tests globally. The built-in GSM feature provides operators with an at-a-glance global view of the health status of all monitored servers, and the APM feature gives Microsoft IT deep insight into application performance and exceptions.
Note: To learn more about how Microsoft IT monitors their environment using System Center Operations Manager, see How Microsoft Does IT: Monitoring Best Practices.
System Center Virtual Machine Manager
Over the past 5 to 10 years, the concept of cloud-based computing has been evolving and maturing at an incredibly rapid pace. Microsoft IT has grown its approach to adopting cloud computing in step with those progressions, and over time a cloud strategy has emerged which provides a decision framework for how investments are made. Within the framework, on-premises server virtualization has been the cornerstone for cloud-based computing. As time progresses, Microsoft IT continues to rely on server virtualization for optimized use of on-premises assets, and they will invest more and more in migration from on-premises virtualization to the cloud or in moving to fully built services such as Microsoft Office 365.
For Microsoft IT, the path from moving from on-premises to the cloud is occurring in phases over many years. The journey started with establishing a centralized "compute utility," where business units could acquire virtualized servers from Microsoft IT. With the introduction of Office online services and then Microsoft Azure service offerings, the strategy grew. The end goal is to develop a solution and supporting manageability processes to allow the use of IaaS and PaaS cloud technologies in unison, while utilizing the most cost effective option for the job. The following figure illustrates this.
Figure 1: Microsoft IT cloud transformation strategy.
How this relates to the System Center suite is through the Virtual Machine Manager (VMM) product. As stated earlier, the first step in the migration to the cloud shifted from on-premises physical systems to virtualized systems. This enabled Microsoft IT to significantly reduce their CAPEX expenditures over time by achieving higher density. As early as FY11, this amounted to $304.62 per virtual machine (VM) in monthly cost avoidance for both hardware and power consumption—and another $329,000 million in cost avoidance through VM automation.
Starting in FY13, Microsoft IT needed to determine how to best manage their existing budget and how to get more value out of their existing budget in the future. Specifically, they have $200 million in CAPEX hardware that is due to expire. With the current economy and business shifts, they will not be allocated that same budget to replace that hardware. Migrating those workloads to the cloud is their best choice. They can increase the virtualization density to save money, but ultimately eliminating the cost of hardware and utilizing cloud services provided by other groups within Microsoft makes the best solution for managing their budget going forward.
In addition to optimization through choosing the right cloud provider, Microsoft IT continues to build on top of Virtual Machine Manager to automate the end-to-end process of creating VMs. They defined a framework, a series of discrete, autonomous steps that breaks down the build process. These steps are optimized with automation in mind. By using discrete steps, Microsoft IT has the ability to insert a checkpoint at any step and also allows the step to be rolled back to the beginning without having to roll back the entire process—should a failure occur.
Another benefit of using this autonomous, stepped approach is that if they need to make improvements or version changes, they don’t have to modify the entire sequence to implement the improvement or change. By isolating each of the pieces, it becomes easier to manage and adapt. For example, "service packs and pre-requisite updates" is something they iterate every quarter, versus "initiate compliance scan" is something they defined just once and don’t need to touch again.
Figure 2 illustrates the set of discrete steps that has worked well for Microsoft IT. They recommend using a similar pattern for building your discrete, autonomous steps for provisioning VMs—whether you write them as custom scripts or as orchestrated workflows.
Figure 2. Microsoft IT build process autonomous steps.
Automating the VM provisioning process enabled Microsoft IT to reduce their VM provisioning time by 102 minutes, an 85 percent reduction. In addition, the process enabled them to increase customer satisfaction as measured by On Time Delivery (OTD) KPI, which assumes build completeness and correctness. By automating VM provisioning, OTD increased from 42 percent to 94 percent, averaged over days to deliver.
Virtual Machine Manager Value Proposition
Virtual Machine Manager enabled Microsoft IT to:
- Reduce costs: Capacity costs are down; 60 percent server virtualization resulted in $54 million in cost avoidance in FY12.
- Reduce build times and increased on-time-delivery: Automated VM builds mean builds are performed 85 percent faster and done right, resulting in more than 94 percent On Time Delivery.
System Center Orchestrator
Incident and Change Management disciplines are one of the major cost drivers within IT operations, beyond physical capacity costs. They are also the areas of IT operations that have the greatest impact on NSAT for the organization. Change requests not executed in a timely fashion or that result in an undesirable outcome reduce NSAT. Likewise, a change that negatively impacts a server, requiring the owner to perform their Business Continuity and Disaster Recovery (BCDR) process, also has a negative impact on NSAT.
System Center 2012 Orchestrator is a workflow management solution for the data center and an ideal solution for Change Management teams. Change Management teams can put together a routine process where they understand what is driving their change volumes. The teams use that analysis to determine which areas they should automate, and then iterate through that process on an ongoing basis. They can then use Orchestrator as their platform to host that automation and have their change team trigger that automation when needed. The benefits to using Orchestrator are:
- Automating a manual routine increases efficiency and accuracy, resulting in cost avoidance.
- Automating workflows using the Orchestrator platform provides scalability, auditing, ability to retry the workflow if it fails, and so on.
Orchestrator enabled Microsoft IT to:
- Increase efficiency: High-volume KCTs are fully automated. Nine runbooks have been created for KCT thus far, saving an average of close to 32 hours a month for top volume change requests.
- Increase accuracy: By having the procedures captured into a runbook, the steps and their outcomes are consistent every time.
- Reduce delivery time: Server build times reduced. Complex blade builds went from 28 days to seven days.
To date, the primary areas of focus for workflow automation in Microsoft IT has been around the following three process areas:
- Bare Metal Builds of blade servers
- Complex Patching Routines
- Top volume drivers for Change Requests
Bare Metal Builds
Orchestrator is used to automate the process of building out blade servers to Microsoft IT standards. The process includes installing and configuring the OS image, service packs and pre-requisite updates, network and storage configuration settings, and security settings. Each of these activities is included in a runbook, which contains the instructions for the automated task or process.
Complex Patching Routines
Configuration Manager is used to patch approximately 85 to 90 percent of the servers in the data center. But there is a small portion, 10 to 15 percent, for which Configuration Manager is not the best tool for the job, such as:
- Legacy systems with complex shutdown, start-up, or validation processes.
- Systems that are configured for high availability or load balancing and that require specific steps for failover or taking systems out of or putting them back into rotation.
- Systems running applications with complex interdependencies where the sequence of how the systems are taken offline must be orchestrated.
- Systems that must be patched outside the maintenance windows.
For these systems, Microsoft IT uses Orchestrator to automate the process by which they patch these systems. The benefit is that the process is automated and eliminates the need for a large team of technicians to manually install the patches—decreasing delivery times, reducing error rates, and reducing resource costs.
Note: System Center Configuration Manager and Orchestrator are used as complementary technologies for patching within Microsoft IT. Configuration Manager is best suited for high scale patch deployments, whereas Orchestrator is best suited for patch deployments that require coordination across many systems.
Change Management Process
The Global Change Operations team at Microsoft is going through Known Change Types (KCTs)—change type by change type—building runbooks to automating those routines, prioritizing by ticket volumes. After the runbooks are built, instead of manually executing the change routine, Operations can use a web form and trigger a runbook in the background in order to implement the change.
System Center features provide enterprises with efficiency and reduced costs now—and the foundation for additional value gains in the future. By utilizing System Center features Microsoft IT has realized the following benefits:
- Millions of dollars in CAPEX cost avoidance brought about by virtualization of 60 percent of data center systems. System Center Virtual Machine Manager has allowed Microsoft IT to reduce hardware costs through server virtualization and allows for faster server builds via automated VM builds.
- Maintenance of 95 percent compliance across tens of thousands of servers within 19 days of release of updates. System Center Configuration Manager enabled Microsoft IT to automate the deployment of updates to servers, to gather accurate reports that keep CMDB inventory up to date, and to report on changes to enable them to be proactive.
- Achievement of a 15-minute time to respond SLA for critical alerts across thousands of applications and tens of thousands of servers. System Center Operations Manager enabled Microsoft IT to have a central console for identifying server and network alerts working with retail management packs for the base monitoring for all servers but enables them to customize business process monitoring to fit specific needs. Application Performance Monitoring enables template-based application monitoring at the code level. Global Service monitoring gives customers dashboards to see outside in custom views of services.
- Reduced time to completion from days to hours and, in some cases, minutes while also reducing the number of people involved in implementing common and well-defined change types. System Center Orchestrator provides the foundation for Microsoft IT’s workflow management by utilizing runbooks (the instructions for an automated task) to automate the creation, monitoring, and deployment of server builds and common change types that are performed by the Change Management team.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:
© 2014 Microsoft Corporation. All rights reserved. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.