Six Sigma in the datacenter drives a zero-defects culture

Article
09/14/2018

Technical Case Study

February 2016

Microsoft constantly strives to improve reliability and availability in its worldwide datacenter network. Microsoft IT developed an innovative defect reduction program that applies ITIL Problem Management and Six Sigma methodologies to datacenter operations and assets for the first time. Internal hosting consumers now use a greatly standardized and simplified environment. Supported by a robust data-driven framework, Microsoft IT professionals strive to eradicate defects in the datacenter and systematically reduce IT infrastructure failures.

Download
Technical Case Study, 3.73 MB, Microsoft Word file

Situation	Solution	Benefits
Like many IT organizations, Microsoft IT wants to keep its global infrastructure available at all times. Scope, scale, and an environment where production code is frequently disrupted by software builds all contribute to the challenge of providing complete availability and reliability.	For the first time, Microsoft IT successfully applied Six-Sigma methodologies to global datacenter operations. Server defects are now systematically identified and eradicated. The organization standardized its platform, built a robust BI system to identify and publicize defects at all levels of the organization, and empowered its staff to proactively address defects on an ongoing basis.	Microsoft made a strong commitment to using statistical methods and data-driven decisions to drive down defects in its worldwide server infrastructure. The defect eradication framework has clearly improved availability, productivity, performance, and security. Other groups within the enterprise can also leverage this framework.

Situation

Worldwide, Microsoft IT manages more than 40,000 datacenter servers. Servers are spread across datacenters and subsidiaries in the Americas, Europe, Middle East, and Asia Pacific regions. The IT infrastructure is a critical component of the overall business—Microsoft servers run thousands of line of business applications, and process $40B in sales annually.

Like other large enterprises that manage global datacenters, it is a challenge for Microsoft to keep its IT infrastructure available at all times. Interruptions in service or outright server failures disrupt business in many ways. Software development can slow, system failures can occur during critical business periods, and disruptions can potentially contribute to revenue losses.

Availability and reliability is even more at risk at a software and services company like Microsoft, where server availability can be easily disrupted by software changes, trials during incident troubleshooting, infrastructure changes, and so on.

The following figure shows the scale and complexity of the Microsoft IT server platform.

Figure 1. Server platform operations

Problem management framework

Microsoft was already leveraging the Information Technology Infrastructure Library (ITIL) Problem Management discipline in its datacenters, and had successfully improved datacenter processes using Six Sigma continuous improvement methodologies. Microsoft has taken Six Sigma beyond process improvement and applied it to datacenter assets and operations. At the same time, Microsoft did not want to lose its significant investment in ITIL Problem Management.

Microsoft chose to apply Six Sigma methodologies to its ITIL Problem Management practices, and established a comprehensive framework for defect management. The goal was to reduce defects, enabling the infrastructure to be more consistent, reliable, and stable worldwide. By applying Six Sigma methodologies to its ITIL Problem Management practices, Microsoft sought to create a zero defects culture, empowering its IT managers and server owners to consistently focus on proactive defect eradication.

Organization support

The IT organization knew that to successfully make such a significant investment, it would need to create both a comprehensive framework and a culture change. This meant that they had to get support at all levels of the organization—from executive management and their hosting service consumers.

However, until the team had hard data, they knew it would be a challenge to get executive buyoff on the project and create internal support from stakeholders around the world. Therefore, Microsoft IT correlated server incidents and deviations from configuration standards, also referred to as defects, for 90 days. The results clearly showed that servers with less defects had fewer failures; servers with more defects had more failures. With the data in hand, the organization obtained the necessary approvals to move the project forward.

Solution

Microsoft approached identifying and eradicating defects by phasing in improvements gradually, accommodating the complex hosting environment. The project had three distinct phases. At a high level, they were:

Establish platform standards. Microsoft established platform configuration standards, reduced environment complexity across the enterprise by driving compliance to standards, and developed a comprehensive configuration baseline.
Measure and publish. With an established baseline in place, configuration gaps (defects) were identified. Robust data collection and business intelligence (BI) systems made defects visible to all levels of the IT organization, driving adoption.
Eradicate Defects. Defects with the greatest potential to cause failure were prioritized and remediated. Managers and server owners were empowered to understand the impact of defects in their infrastructure portfolio, and were given tools to remediate them.

The phases were implemented sequentially, over about five years. The scope and scale of the project was large—touching five global datacenters, almost 300 virtual branch offices, hundreds of storage switches and arrays, and thousands of Microsoft SQL Server instances.

The project spanned all ITIL processes, from build, to run, to support, and problem management. In the process, Microsoft moved from 2.5 Sigma to 4.0 Sigma, which reflects a significant decrease in defects, and which resulted in greater infrastructure availability and fewer incidents. In 2013, the DPMO (Defects per Million Opportunities) defect eradication program was launched across Microsoft IT.

The following sections explain in more detail how Microsoft developed its system to identify, accurately measure, and proactively correct defects in order to improve IT infrastructure availability and reliability.

Establish platform standards

Microsoft first needed to reduce complexity and standardize its datacenter platform. The goal was to work within a contained set of configuration standards, with the understanding that the latest versions of software and hardware are generally more stable than their predecessors. Microsoft also adopted the same product support lifecycle as external customers who receive product support from Microsoft.

In driving comprehensive adoption to the most current hardware and software standards, the organization accomplished two sets of goals. First, it established consistent and updated configuration standards for every server to adhere to, ensured that key stakeholders understood the standards, and drove compliance to them. Second, it created an effective baseline for defect measurement.

Detailed server platform configuration standards were defined with precise business impact mapping. Building on the thorough platform knowledge within the problem management discipline, a virtual team of subject matter experts with deep technical expertise contributed to this aspect of the initiative. Microsoft also leveraged a wide range of other sources for robust input. These include industry best practices, the results of root cause analysis from past major or high priority incidents, critical server failure event data, deep analysis of the infrastructure, product group recommendations, and customer engagement data. With this information, Microsoft was able to both define the standards, and map specific standards’ non-compliance to a definite set of infrastructure failures.

The areas where detailed server platform configuration standards were defined include:

Microsoft operating systems
Microsoft SQL Server and Database management
Storage area network (SAN) Storage
Clustering technologies
Hyper-V layer
Hardware models
Firmware
Third-party applications
Miscellaneous software packages

To help create a deep level of infrastructure accountability, this foundational phase also included Microsoft mapping servers to specific owners and teams within the organization.

Measure and publish

With consistent datacenter platforms and a baseline in place, Microsoft next leveraged System Center Configuration Manager (SCCM), System Center Operations Manager (SCOM), and an in-house tool to collect server data. At regular intervals, large amounts of data were collected to validate compliance to the configuration standards.

Microsoft created a robust BI system to analyze and report on the data. Compliance performance against the configuration standards was published and visible up to the CIO level. Role-specific views of reports were created, to support the needs for both long-term and immediate decision making. The data transparency alone gave stakeholders an incentive to invest in their server footprint; after data was published, close to 20 percent of defects were proactively remediated.

Collect data

SCCM and SCOM collected inventory and configuration data from servers, while monitoring them for availability. Additional configuration data was collected by the in-house tool. The daily and weekly data collection processes mine infrastructure assets without affecting their functionality.

Customized Windows Management Instrumentation requests, registry calls, and customized SQL Server queries captured additional configuration data. Results from the configuration scans were then compared to the platform configuration standards.

Establish a BI system

Embracing a zero defects culture required a well-defined BI system. Microsoft developed a robust BI system for data analysis and reporting that allowed Microsoft IT and business groups to focus on the specific goal of reducing the number of server defects in production.

Microsoft needed to empower its stakeholders. Stakeholders needed to both understand the impact of defects in their infrastructure, and have the tools to remediate defects. The Server Deployment and Operations BI team built a series of self-service reports that help anyone in Microsoft IT manage defects in their server portfolio.

In the reports, ownership data was coupled with each server’s defects, the potential severity per defect, and the configuration remediation steps per defect. With this rollup, each IT organization could see a comprehensive representation of the defects in their own server footprint. Figure 2 shows the data available to the IT organizations. The red lines represent defects within a specific server, called out with their descriptions, priority, and remediation requirements. DPMO values are calculated from the underlying count of defects and servers.

Figure 2. Reporting structure

Figure 2. Reporting structure

Provide specific data views

Microsoft needed to provide data views for a variety of business functions, such as leadership and engineering. A CIO usually has different information needs than a network engineer. This differentiated approach supports data-driven decisions, regardless of where in an organization hierarchy the defects occur.

Microsoft created a leadership and manager view to provide insights on the health of the production environment for an extended period. DPMO trends for a 16-week period allowed this layer of the organization to drive decisions and priorities that are tied to critical business schedules.

Figure 3. Role-specific views

For engineers and IT Pros, flexible, self-service reports allow server owners to quickly find defects in their servers, to easily understand issues, and, most importantly, to take necessary remediation steps.

Eradicate Defects

In the final phase of the project, infrastructure risk was assessed, and then most dangerous defects were identified, prioritized, and addressed. The team incorporated input from root cause analysis in order to assess and prioritize infrastructure risk. By analyzing existing problem records, Microsoft could quantify infrastructure components that could eventually fail, causing service disruptions to customers. After each remediation effort, the teams were empowered to perform simple hypothesis tests, which consistently demonstrated a statistical significance (P-Value < 0.05) in the reduction of infrastructure failures.

Prioritize risk

Microsoft used the Six Sigma Risk Priority Number (RPN) index to prioritize risk. The higher the index, the higher the priority. Configuration deviations that cause severe business impact are categorized as priority 1 (P1) defects. The rest are categorized as priority 2 and 3, based on their potential impact to underlying services. Using the RPN calculation, Microsoft categorized all defects with a score of 200 or more as P1.

The following figure shows how risk criteria were prioritized.

Figure 4. Risk Prioritization

The RPN calculation took into account potential impact to the business, likelihood of failure, and the ability to detect the risk. Finalized standards with appropriate risk categorization were then ported to a SQL Server database, enabling a data collection engine to collectively analyze and assign defect priority throughout the infrastructure.

Apply Six Sigma

Once the business understood the areas of opportunity, the IT project team worked with the appropriate business teams to apply the Six Sigma DPMO framework. The DPMO framework was applied to P1 defects across the environment. The teams worked together to execute continuous defect eradication initiatives. As mentioned earlier, after defects began reporting up through the organization, 20 percent of defects were proactively remediated. After the business and IT teams started working together, another 40 percent of defects were resolved within a year.

The DPMO effort began in 2012, when a DPMO Proof of Concept was published. The DPMO components include:

Defects. Deviations from approved standards or accepted configurations.
Opportunities. Measurable service components that can deviate from agreed service levels and impact customer experience.

Once defects and opportunities are known, the DPMO ratio can be calculated. DPMO also can be used to calculate process efficiency and effectiveness. Once the defects, opportunities, and DPMO were measured through the data collection process, the team was ready to derive business insight and introduce a zero defects culture approach to IT infrastructure management.

The following figure depicts an overview of the defect reduction framework.

Figure 5. Framework overview

Benefits

By standardizing its server platform, and then measuring and eradicating defects in its IT infrastructure, Microsoft has created a more stable server platform, process efficacy has improved, and enterprise risk has been reduced. Adopting a systematic, data-driven action framework that focuses on measurable and quantifiable results drove process improvements and real business impact. For example:

Before the close of each quarter, 40 percent of P1 defects are eradicated on servers hosting revenue-generating applications, revenue-processing applications, or both.
Publishing DPMO and Six Sigma RPN Values in the CIO scorecard enabled the organization to improve the DPMO score by 20 percent, without intervention from a project team.
During the first year of implementation, Microsoft improved the number of remediated defects by 40 percent, without investing in additional resources.
Overall, mean time between sequential failures has gone from 18 days to 125 days.

A strong commitment to using statistical methods to make data-driven decisions helped Microsoft drive significant business impact. As of publication, Microsoft IT servers fail six times less often, compared to three years prior.

Microsoft IT successfully applied the Six Sigma framework to datacenter operations for the first time. By comparing server defects and ticket-to-asset ratios, the organization determined that servers with large numbers of defects are more likely to have incidents, and may cause corresponding downtime for hosting customers. Merging Six Sigma methodology with Problem Management practices has provided a continuous improvement framework that Microsoft uses to drive down defects on an ongoing basis.

Reliability improvements

In just about 18 months, Microsoft IT servers reduced their ticket-to-asset ratio, compared to earlier servers with a comparable number of defects. For the consumers of the Microsoft IT hosting infrastructure, this means a platform with less issues overall.

The following figures show 2014 data compared to 2015 data.

Figure 6. Correlation of risk to ticket volume per asset, 2014

Figure 7. Correlation of risk to ticket volume per asset, 2015

Availability increased

Defect reduction has directly affected availability. The figure below represents the behavior of the environment quarter over quarter after defect remediation projects were implemented at different areas of the infrastructure. In every quarter, the failure rate after remediation consistently decreased by over 50 percent, regardless of the number of servers remediated.

Figure 8. Failure to asset ratio over 90 days

After each project, Microsoft saw a substantial drop in the number of unexpected failures per server. Failures take the server down for a considerable amount of time, which directly affects application availability for customers and, in turn, affects overall end-to-end business processes, and, in extreme scenarios, revenue loss.

Every quarter, Microsoft IT performed a simple hypothesis test to show the statistical significance in the reduction of infrastructure failures. The test results below, performed on three data sets from three consecutive quarters, show that there is a significant difference in failure rates before and after the remediation efforts.

Figure 9. Change in failure rates before and after remediation

Portability

The defect eradication framework can be applied wherever a deviation can be measured. At time of publication, Microsoft IT is collaborating with other groups to adapt the framework for their use. For example, the Microsoft Azure product team is considering leveraging the framework to improve availability and reliability of the Azure Infrastructure as a Service (IaaS) offering. Microsoft IT is also partnering with the System Center product group to enable seamless integration with their product. Internally, Microsoft IT has begun using this framework to improve reliability of its network infrastructure and reduce infrastructure security risks.

Conclusion

For the first time, Microsoft IT moved beyond applying Six Sigma to its datacenter processes. The organization successfully adopted the ITIL Problem Management discipline and Six Sigma methodologies to address operations defects in its hosting environment. Microsoft IT managers are now empowered with a defect eradication framework that results in minimal infrastructure risks and greater availability worldwide.

Given the significance of compliance to configuration standards, that the risk to the infrastructure was well understood across the business, and that a robust and accurate set of data was readily available for consumption, efforts to deploy structured defect eradication projects across the Microsoft IT environment were very effective.

Microsoft can now save time for consumers of its hosting infrastructure environment by improving availability, customer productivity, and satisfaction. Microsoft IT hosting customers can confidently rely on a service that is ready and able to deliver when they need it.

For more information

Microsoft IT

www.Microsoft.com/ITShowcase

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the web, go to:

www.Microsoft.com

© 2016 Microsoft Corporation. All rights reserved. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.