Disaster Recovery and Business Continuity Planning in Action: Japan 2011
Published: July 2011
The Japan earthquake in March 2011 presented one of the greatest disaster-recovery challenges that Microsoft has ever faced. Contingency planning combined with process, organization, and technology strategies, quickly confirmed the safety of all staff and ensured the recovery of business to a working state over a weekend. The lessons learned from this event can help Microsoft further improve the model and can help customers and partners to plan for and cope with similar disasters.
Article, Microsoft Word file, 260 KB
|Disaster-recovery administrators, organization leadership, and IT managers.|
On Friday, March 11, 2011, a major (magnitude 9.0) earthquake and subsequent tsunami destroyed 400 kilometers of coastline of northeastern Japan. These natural events triggered accidents at the Fukushima and other nuclear power plants. They also caused a power shortage that affected 14 million households in the Kanto region where Tokyo, Japan’s political and economic center is located. Reduced levels of public transport services resulted in the Microsoft Shinagawa headquarter office recommending that its employees work from home for a period of one week as precautionary procedure. The effect was an interruption of Microsoft business activities across the entire Asian Pacific Japan (APJ) region.
The Microsoft corporate network in the APJ region decreased to 6 percent of its normal capacity. By Monday, March 14, all services were available on near normal network capacity with minimal impact to IT services. Figure 1 shows the significant milestones in the recovery effort.
Figure 1. Japan Earthquake Incident Timeline
The goal of the Microsoft Information Technology (Microsoft IT) team’s plan for disaster recovery and business continuity is to recover business systems as quickly and efficiently as possible. This is more than just a technology issue. Processes, technology, communications, and contingency planning all must be aligned to do this effectively.
The Japan earthquake demonstrates the need for corporations to have the ability to respond at a moment’s notice when unexpected events occur. At Microsoft, a number of interrelated programs and technologies enabled this to happen. For example, recent trends toward a more mobile working environment at Microsoft mean that employees are not tied to central office locations. This is safer for employees, and it aids rapid recovery in the event of a disaster.
Microsoft IT Response
Microsoft IT is organized through shared-services model: Key technology experts in every region offer 24-hour, seven-day-a-week customer support, which facilitates the ability to react to unexpected events. Combining technology such as Microsoft® Lync™ 2010 communications software, the Windows® 7 operating system, DirectAccess, and virtualization has created an agile environment where Microsoft can deal with crises while maintaining an underlying principal of putting employee safety first. Within this structure, Microsoft IT identified the following areas as the main priorities:
Alignment with the local business
Disaster recovery and crisis management
Business continuity planning (BCP) and execution
Customer and partner support (community)
Priorities and Disaster Recovery
Immediately following the earthquake on March 11, Microsoft senior management in Japan established a daily disaster triage meeting. Key support teams, such as Microsoft IT and Real Estate and Facilities, mirrored this effort and conducted triage meetings. Microsoft organized teams and job functions at a detailed level to establish the extent of the disaster recovery effort that was needed.
Teams initially focused on confirming the safety of employees and their families. The accomplished this task in only a few hours. The next priority was to investigate the status of the data center and the branch offices closest to the earthquake's epicenter. Microsoft has two branch offices in northeastern Japan: Takasaki and Sendai offices. Figure 2 displays Microsoft Japan office locations except Takasaki and Sendai.
Figure 2. Microsoft Japan Office Locations
During the weekend, the Microsoft Japan Leadership Team formed a subsidiary-wide Incident Management Team (IMT). Before Monday, the IMT set had communicated the following priorities to all employees:
Confirmation of the safety of employees and their families.
Actions and procedures for reacting to a limited power supply in Microsoft offices
Measures on how to help customers and partners
Measures for helping the government, nongovernmental organizations (NGOs), and citizenship activities
Communications (external and internal)
The IMT then enacted measures to address short-term safety concerns while disaster recovery and BCP proceeded. These measures included relocating staff, extending the enterprise voice (EV) network to staff and partner organizations, and relocating key Customer Service and Support (CSS) teams to offer continued service to customers.
As the situation at the Fukushima nuclear plant became increasingly worse over the weekend, Microsoft headquarters activated a Global Security Team (GST) to plan for the possible evacuation of Microsoft buildings and employees' families in the affected area. The IMT also created a subgroup to specifically focus on nuclear issues and work closely with the GST. A Microsoft IT business relationship manager became a member of both teams and acted as the single point of contact for all IT-related activities.
One key advantage that Microsoft had in the period immediately after the earthquake was the ability of staff to communicate via the Lync service. During this crucial period, most of the telecommunication services in Japan were offline for several hours because of infrastructure problems. This impact was particularly significant for the cellular network, which saw damage to thousands of base stations in northern Japan, followed almost immediately by a spike in call volume as people tried to check the safety of their families. Microsoft staff promptly shared all information about the welfare of fellow employees and their families with the Emergency Office formed by the president of Microsoft Japan.
Immediate Business Priorities
The CSS teams were critical to the recovery effort of both Microsoft and Microsoft customers. A safe, stable location with reliable network connectivity had to be identified for these teams. Microsoft IT’s biggest challenge to this issue was to create a working network as quickly as possible.
The earthquake occurred late in the business day on March 11. Microsoft locations within Japan maintained network connectivity. However, undersea cable networks that connected the region to the rest of the world sustained damage during the following tsunami and aftershocks. The Microsoft IT network team working in Redmond (Washington, United States), Hyderabad (India), London (United Kingdom), and Singapore learned that three undersea cables that connect the Microsoft wide area network (WAN) backbone in the APJ region were affected.
Carriers that provide bandwidth to the Microsoft corporate network were no longer able to provide service across the damaged cables. Under normal circumstances, network capacity between hub sites in Japan, Singapore, and the United States provide adequate capacity to run the Microsoft business in the APJ region. Shortly after the earthquake, the network decreased to a single network connection between Sydney (Australia) and the United States.
The severe reduction in normal WAN capacity presented significant business challenges, particularly for the following business units and services:
The India and China CSS teams
The China and India R&D business
The regional Product Activation business
Recovering connectivity to key business systems for Japan, the Republic of Korea, Taiwan, and other sites with systems located at the Tokyo data center were also high priorities.
Figure 3 displays network links and status after the earthquake.
Figure 3. Damage to Network Links
Networks and Data Center Management
A crucial component to recovery was an assessment of whether other carriers or telecommunications service providers could quickly provision network capacity. Within the first few hours of the crisis, only the Sydney-to-U.S. link was carrying traffic between the APJ region and the United States. Because regional carriers were struggling to recover capacity, the Microsoft IT network team contacted a related department within Microsoft—Global Foundation Services (GFS). GFS, the network arm of the Microsoft Online Services business unit, still had available capacity on its network backbone between Singapore and its network hub in Silicon Valley, United States.
Microsoft Global Network Operations Center (GNOC) management, in partnership with GFS, quickly extended the Microsoft communications network with additional capacity by adding two temporary links within the first 24 hours. Although much of this was standby capacity, it was enough to enable business activity and recovery to continue without significant impact to other subsidiaries on the APJ network, while full services were restored on the main cable breaks.
The additional links that the GFS team supplied also provided the shared-services teams enough time over the weekend to identify which key services and applications were at risk in the Kawaguchi data center. GFS then established contingencies in case the facility went offline. This secured the immediate business needs while failover measures were established to cover the possibility of long-term power outages. All this enabled normal business to resume quickly, with the emphasis on customer and community support in Japan.
In addition to the network management efforts, the Data Center Management team also implemented the disaster recovery guidelines to ensure that 48-hour backup generator capacity was available, with enough fuel available to enable continuous running for up to seven days. This was an essential component, considering the expected blackouts that major catastrophes cause and that were inevitable because of the impact on Japan’s electricity-generation capacity. These measures, coupled with power-saving procedures at Microsoft, helped alleviate the impact on Japan’s power grid.
In evaluating BCP impact, reports indicated that the Sendai branch office was not the only Microsoft office affected by the disaster. Rolling blackouts planned by the Tokyo Electric Power Company (TEPCO) also had a major bearing on business activity. Microsoft offices are also located in Chofu City in Tokyo and Takasaki City in Gunma, which were included in target areas for power outages. The critical situation at the Fukushima nuclear power plant meant that 10 percent of Microsoft employees needed to evacuate the Kanto area.
Through close communication with the Real Estate and Facilities team, as well as other teams affected by rolling blackouts and evacuations, Microsoft IT took actions to support business continuity of following areas.
BCP for Microsoft Facilities
A key to supporting Microsoft office facilities was managing communications with the teams affected by the rolling blackout. These teams included:
The Power System Management team for office building support.
Facilities for power management of Microsoft IT–managed tools and applications.
The security team that monitored Microsoft IT services from outside Japan.
Microsoft IT and the Real Estate and Facilities teams worked closely to create a communication plan for all employees and minimize the risk for IT infrastructure and business processes. It was particularly important to manage time slots when electricity was shut off. Microsoft IT consolidated the maintenance schedule during this period to facilitate smooth communications with all teams involved.
BCP for the Organization
The CSS teams in the Chofu office suffered the biggest impact from the rolling blackouts. To continue offering ongoing customer support, 40 percent of CSS employees relocated to the Shinagawa office, 10 percent relocated to the Osaka office, and 10 percent relocated to the Nagoya office in the first week of planned rolling blackouts.
With network recovery underway, Microsoft proceeded to address business recovery duties for its customers and partners. As a result of the migration to Lync and full EV services that had no reliance on traditional private branch exchange (PBX) telephone switches, moving teams and individuals such as CSS staff was much easier than in previous years.
After CSS services relocated to the Shinagawa, Osaka, and Nagoya offices, they were able to operate immediately without requiring additional infrastructure changes. This level of mobility was also available to other teams that wanted to evacuate the affected area to safer locations, such as alternative offices or their homes.
This process was further aided by the mobile workforce that Microsoft had established around the globe, including Shinagawa. This policy of flexible workspaces enables movement of teams and quick changes to the environment to meet changing circumstances.
BCP for Individuals
Provisioning an adaptive mobile environment helps solve the immediate issue of physical safety in disaster scenarios. Lync, Windows 7, Windows Phones, DirectAccess, mobility features of Microsoft Exchange 2010, virtualization, and a range of other technologies combined to provide this level of instant mobility. The GFS cloud team already supported many of these services, helping to ensure continuous availability. Robust security features were also available through a combination of Microsoft Forefront® client security, smart-card network access, BitLocker® Drive Encryption, and BitLocker To Go® technology to help protect data on Windows 7-based PCs and portable storage devices.
Although many staff members had the full range of products available to make instant decisions about location, safety, and mobility, others were still tied to specific locations because they did not have full EV facilities. Also, some contingent staff either did not have DirectAccess or did not have the ability to establish a virtual private network (VPN) connection because of restrictions in smart-card authentication.
Microsoft IT resolved these issues within the first 48 hours by:
Expanding the IP range and the number of network ports.
Adding 300 additional EV migration numbers.
Expanding inbound phone numbers and analog lines in secondary offices.
Making security exceptions and extending full mobility services to contingent staff. (Seven days after the disaster, Microsoft was asked to remove this access to comply with local regulations.)
These efforts enabled both full-time and contingent employees to decide where, when, and how to work. Those living outside the immediate disaster areas had the ability to stay out of danger, while those inside the affected areas could leave and rejoin the recovery effort with their families in a safer location. Employees could be closer to their families during this time and still stay connected to Microsoft.
BCP for the Community
The ability to identify the areas most in need was an immediate concern, as was the commitment to partner with customer organizations in a mutual recovery effort.
While recovery was underway, Microsoft offered several hundred EV numbers to NGOs to help them stay connected during the crisis. Microsoft made additional support available to Premier customers who were conducting similar recovery programs.
Microsoft IT participated in an emergency cross-group initiative to provide Microsoft customers and partners with comprehensive guidelines for backing up and restoring data, shutting down servers, and migrating applications to cloud solutions such as the Windows Azure™ technology platform. Microsoft also created a portal site (http://support.microsoft.com/gp/jishin-taisaku/ja) to help customers and the industry cope with the rolling blackouts.
In the longer-term effort to help support ongoing relief efforts, Microsoft has made an initial commitment of $2 million US, which includes $250,000 in cash as well as in-kind contributions. Microsoft also matches employee donations and U.S. staff members have already donated more than $650,000.
The March 11, 2011, earthquake in the vicinity of Sendai, Japan, caused massive damage to Japan’s economy and infrastructure. For Microsoft and many other companies, this meant an almost total loss of the communications network in the APJ region, temporarily inhibiting the company’s ability to function effectively and join the recovery efforts.
A combination of processes, technology, communications, and contingency planning enabled Microsoft to react quickly and effectively to the crisis, recovering the internal communications infrastructure from 6 percent to almost 100 percent capacity within four days. Additionally, flexible workplace environments and mobility technology helped ensure employee safety early in the crisis, enabling staff to choose the mobility options that suited their own personal circumstances and that of their families.
By successfully managing the recovery efforts internally, Microsoft could then turn its attention to helping others, such as NGOs, to aid the needs of the Japanese people. These efforts are still ongoing.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, BitLocker, BitLocker To Go, Forefront, Lync, Windows, and Windows Azure are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.