How Microsoft IT Centralized Disaster Recovery
Published: December 2011
The following content may no longer reflect Microsoft’s current position or infrastructure. This content should be viewed as reference documentation only, to inform IT business decisions within your own company or organization.
Microsoft builds disaster recovery (DR) into its applications and systems. Recently, Microsoft Information Technology (Microsoft IT) improved DR planning by consolidating and centralizing its DR efforts. The new plan focuses on the critical processes and applications, ensuring minimal downtime and data loss in the event of an outage.
Article, 182 KB, Microsoft Word file
Planning and preparation are essential to business success. However, because disasters are rare, DR planning may not be a priority for some companies. Instead, decision makers often choose to focus on business challenges where the return on their investment is clear and measurable.
What would happen if your business experienced a major power outage or if a virus infected your systems? In the event of an earthquake or flood, do you have plans for restoring critical processes and data? Does your company know what systems and applications are the most important? Do you have redundancies and backup systems in place? You can answer these questions by implementing a solid DR plan.
Why Did Microsoft Need a Consolidated Disaster Recovery Program?
For more than 20 years, Microsoft has been building DR into applications and systems. Many business units have been working with DR strategies and tools to help make sure that valuable data and processes are safe and can be recovered if an unplanned outage occurs.
However, until recently, there was limited central DR coordination and planning across teams. With more than 1,500 internal applications running, Microsoft IT had limited visibility into which applications had tested and documented DR plans. Microsoft IT needed to review every application in order to determine which applications had the right level of DR in place. Microsoft IT set out to implement a DR plan that would help determine which applications were critical and then to determine which applications required DR.
Microsoft IT encountered other challenges that pointed to a need for a consolidated and centralized DR:
Increasing legal requirements such as Voluntary Private Sector Preparedness Program (PS-Prep) and other regulations currently in progress have specific standards that must be documented.
Results from DR implementation interviews and surveys suggested that the understanding and definition of DR was inconsistent across business units.
Engineering and application teams deploy applications. Each team determined which applications would have DR and how much, without applying a standard methodology.
New feature improvements and DR depended on the same funding sources. Some teams did not request DR funding, leading to inconsistent DR investment.
Reporting was limited, and independent verification of ongoing DR testing did not exist.
Identifying Critical Applications
When asked if their business processes and applications are critical, Microsoft business unit owners will likely say, "Yes, of course it 's critical." However, Microsoft IT created a way to evaluate which processes are truly critical.
Microsoft IT starts with the premise that processes, not applications, are critical.
Before the beginning of each fiscal year, people who are connected to the different business groups within Microsoft (i.e. operations, Windows, etc.) work together to conduct a business impact analysis (BIA). The purpose of the BIA is to examine business processes in an objective, fact-based manner in order to determine what problems an outage to each process will cause.
The BIA evaluates several business triggers, including financial loss, legal issues, customer impact, and employee impact. If an outage causes a process to hit one or more of these triggers within the first 72 hours, that process is considered critical. If an outage does not cause a process to hit any of the triggers within the first 72 hours, the process is not considered critical.
This evaluation of specific triggers identifies which systems and processes have the largest impact on the business if they fail. It also helps determine the sequence necessary for the restoration of processes.
After a process is determined to be critical, an evaluation of the applications underneath that process determines whether the application is critical.
The BIA not only evaluates whether processes hit the triggers within the first 72 hours, it also analyzes the time that the processes take to hit the triggers. Some critical processes may have shorter times before they hit the triggers.
The BIA uses two measurements:
Recovery Time Objective (RTO) measures how long can a process be down.
Recovery Point Objective (RPO) measures how much data loss can occur.
One example of a critical process might be closing the financial books. This process might depend on an application such as SAP. The following sections discuss this example in more detail.
Critical Trigger Example (Part 1)
Analysis confirms that the critical process of closing the financial books cannot be down for more than 12 hours. SAP is the application underneath the process, so SAP inherits the criticality.
Critical Trigger Example (Part 2)
It is possible that the process and the application will not have the same RTO or RPO. The process of closing the financial books has an RTO of 12 hours. However, the critical application might need an RTO of 4 hours so that other processes connected to that application can be repaired. Each step in the process depends on the step before it, and all the steps in the process must work in time to meet the overall RTO of 12 hours.
Critical Trigger Example (Part 3)
A secondary analysis of the critical application identifies any secondary applications associated with it. The finance department indicates that SAP is critical to their processes. They do not know about other application dependencies for SAP.
The BIA must incorporate input from people who technically understand how the application works to verify any secondary application dependencies. In this example, SAP depends on Active Directory Domain Services, wide area network (WAN) technologies, and other applications. Those applications now inherit the criticality and may have an even tighter RTO or RPO because they must be working before SAP can work again.
After an application is determined to be critical, the application owner must either build DR into the application or get a high-level company executive to sign a risk acknowledgement form stating that he or she accepts the risks associated with not building DR for that critical application.
Implementing Disaster Recovery
Due to funding and software changes, Microsoft IT also decided on a multiyear DR strategy for some processes that have large ecosystems. As an example, one critical licensing process has an ecosystem with 39 applications that are linked and interdependent. Because of the substantial work required to build DR for these larger ecosystems, DR build-out may take several years to complete.
Additionally, each year, BIAs are refreshed. As the business evolves and grows, new processes are evaluated and critical applications are added to the DR plan.
Managing the Centralized Disaster Recovery Program
Funding for DR is managed centrally, and the necessary hardware is also purchased centrally.
To control budgets and meet tight deadlines, Microsoft IT built three standard platforms for DR: Hyper-V virtual machines for general systems, dedicated individual blade servers for midrange use, and high-end dedicated stand-alone Microsoft SQL servers.
To help determine which technology is the best fit for DR needs, Microsoft IT created a technology selection matrix based on the RTO and RPO for specific applications. Figure 1 illustrates the selection matrix. The selection matrix reinforces using operating system (OS), SQL, and application-layer technologies whenever possible due to their lower costs. It also shows how storage area network (SAN) replication technologies can be expensive and are not necessary for most applications.
The selection matrix also illustrates how expensive it is to expect an RTO or RPO of zero. When asked how much downtime or data loss is acceptable, business owners typically answer "None." By using the selection matrix, business owners can see that an RTO or RPO of zero is possible, but a new data center may have to be built with a capital expense of more than $40 million US. This usually resets expectations to a more realistic RTO/RPO.
SQL, OS, or app replication
Capital expenditure~ low
SAN replication hardware
Capital expenditure ~ $200,000—$400,000
SAN controller software
Capital expenditure ~ $1 million—$10 million
New data center
Capital expenditure ~ $40 million+
Figure 1. Technology selection matrix for disaster recovery
Tracking DR Program Progress
Microsoft IT established central reporting across the enterprise to the chief information officer (CIO). Central reporting tracks 10 milestones per application, each with specific dates: three milestones for planning, four milestones for deployment, and three milestones for ongoing production testing.
In reporting to the CIO, milestones appear as green or red depending on their status. The intent is to give the CIO visibility to obstacles in the process while driving the behavior of application owners to meet deadlines. In the first year, 60 applications were tracked via these milestones, providing 600 data points.
Microsoft IT also created standard templates for people who are building DR within Microsoft. The templates are used for DR planning, DR testing playbooks, and project management in implementing test DR plans. The templates require standardized information, and they can be customized to meet the specific needs of the person or team that is building DR.
Managing Ongoing DR Production Testing
An important part of the DR plan is ongoing testing. Tests prove that a failover of a process and associated applications meets established DR requirements: production-level functionality, capacity, and performance, in addition to established RTO and RPO.
There are four levels of testing maturity that increase over time:
Testing Maturity Level One (year 1 required). Simulated failover test of the application. Potentially a small subset of the application is tested to prove the functionality.
Testing Maturity Level Two (year 1 recommended, year 2 required). Full production failover test of the application.
Testing Maturity Level Three (year 2 recommended, year 3 required). Full production failover test of the application and any other applications that are critical dependencies. This becomes a test of a larger ecosystem.
Testing Maturity Level Four (year 3 recommended, year 4 required). Full integrated testing—a full production failover test of the application and all dependent applications connected to this process. The goal is to confirm that the entire process will meet the established DR requirements if an outage occurs.
Figure 2 illustrates how a new wave of applications is introduced to the testing plan each year and how each wave progresses through the testing maturity levels over time.
Figure 2. Schedule for testing plan
Microsoft IT used these strategies to ensure the success of the consolidated DR plan:
Let the process, not the application, drive criticality. If an application does not map back to a critical process that hits one of the triggers within 72 hours, the application is not critical.
Use a BIA to manage affected processes. By calculating the number of hours before processes hit criticality triggers, it is possible to determine which processes are critical and which applications need DR.
Know the testing maturity level and the last date that the application was tested. This knowledge provides a sense of confidence that a failover to the DR site will be successful, that established DR requirements of functionality, performance, and capacity will be met.
Microsoft IT recognized these key benefits from the consolidated DR plan:
Improved CIO visibility to obstacles in the process while driving the behavior of application owners to meet deadlines.
Improved operations by standardizing the DR process through the use of: three standard platforms, standardized templates for anyone building DR within Microsoft, and a technology selection matrix to drive better DR technology choices.
Improved process managementÔBIA confirmation that critical processes and associated applications meet the established RTO and RPO.
A small project team is responsible for the overall DR program at Microsoft. It makes technology recommendations, provides central reporting, and operates as an independent witness to testing applications and processes. The DR program works because Microsoft IT consolidated and centralized the DR plan and defined objective standards to determine which processes are critical to the company. By using a collaborative model, Microsoft IT helps business units implement the right DR for their processes and influences the opinions of business leaders. This approach has increased the value of disaster recovery for Microsoft.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information through the World Wide Web, go to:
© 2011 Microsoft Corporation. All rights reserved.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Active Directory, and Hyper-V, Microsoft SQL Servers are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.