Migrating Microsoft TechNet and MSDN to Hybrid Cloud using Windows Azure
In June 2011 the Enabling Platform Experience (EPX) group, part of Microsoft’s Developer Division, began a journey that would see two of its largest developer and IT professional websites migrated from an entirely on-premises infrastructure to take advantage of the reliability, scalability, and availability of the cloud. The goal was to move both the Microsoft Developer Network (MSDN) and Microsoft TechNet to Windows Azure, and achieve a number of significant benefits:
This case study describes how the benefits were achieved, and the requirements met, during the first phase of the migration—moving Microsoft TechNet to Windows Azure.
The MSDN and TechNet Websites
The Enabling Platform Experience (EPX) group at Microsoft is responsible for managing a number of Microsoft Developer online and offline experiences, including the Microsoft Developer Network (MSDN) and Microsoft TechNet. Usage of both of these sites is highly variable; with significant spikes when new products launch, or during training or conferences. To support such variability before the migration, EPX had to provision four times the number of servers required to support peak volumes, resulting in an aggregate server utilization of 20% over the course of a year.
The MSDN and TechNet websites have a similar architecture and are hosted on the same hardware, although TechNet receives less traffic overall than MSDN. Therefore, the decision was made to perform the migration for TechNet first; and then apply the lessons learned to the MSDN website. The midterm target for EPX is to move all applicable websites and applications to Windows Azure by the end of 2014.
The Architecture of TechNet before Migration
TechNet is designed for high web traffic. It experiences a high number of reads with no caching on the front end web servers. TechNet has an average traffic level of over 25 million unique users per month, in addition to other requests such as indexing by search engines.
TechNet has a typical two tier architecture, with web servers already in a virtualized environment. Visitors to TechNet access web front ends (WFEs) that, in turn, read content from a farm of database servers. This database layer is primarily hosted on high-end, four terabyte servers with the content replicated in four different datacenters. Figure 1 illustrates the original on-premises architecture of TechNet.
Figure 1 - The TechNet on-premises architecture
Content for the site is pushed to the databases from a content publishing system. A complicating factor in the migration was that TechNet was in the midst of transitioning from one content management system to another, which meant that two code bases were sourcing the content for end users.
The Migration Process
Microsoft engaged Accenture for the initial evaluation and feasibility assessment utilizing their Azure Migration Accelerator assets, and the Premier Field Engineering (PFE) teams who have vast experience in debugging and troubleshooting performance issues. Then, in June 2011, a team of just three service engineers embarked on the migration within an aggressive time frame to perform the migration.
The final design for the overall migration of all sites is a hybrid application where a portion of the application runs on Windows Azure; while the data layer, monitoring and management functions, and content publishing system remain on-premises. The result is a future-state architecture to which the team can migrate applications over time through gradual re-architecting.
To move the TechNet infrastructure to the cloud, the team made design decisions at each layer of the infrastructure. For traffic routing, the team utilized TechNet’s current global load balancing capabilities using AKAMAI to direct traffic to Windows Azure. This enabled the team to pilot the approach with live traffic and divert incremental amounts of traffic from on-premises datacenters.
For the web front ends, the team used the Windows Azure Virtual Machine (VM) role, which allowed them to use an existing VM image to seed the cloud migration. This reduced the probability that engineering changes would be required, and provided the team with more control over the migration and configuration. Because of the two code bases, the team packaged each into its own VM role and used a custom content switching solution to drive traffic to the appropriate role.
To achieve elasticity and the consequent minimization of runtime costs, the team chose the Enterprise Library Autoscaling Application Block from Microsoft’s patterns & practices group. This enables the application capacity to automatically mirror demand by starting and stopping role instances in response to a range of factors, such as server load and resource usage. Incorporating the Autoscaling Application Block and configuring the autoscaling rules took the team just a few hours.
Databases were another key consideration during the migration. Because the TechNet content database is almost four terabytes, SQL Azure was not a viable option in the short term. Instead, the team created a hybrid cloud solution in which the web front end resides on Windows Azure, and the data tier remains on-premises.
In terms of content switching, the on-premise platform used a hardware-based solution. For the migrated solution, the team created an Application Request Routing (ARR) role to achieve these same results.
Figure 2 illustrates the hybrid application architecture at the time this case study was written.
Figure 2 - The current TechNet hybrid application architecture
To achieve network connectivity and authentication/authorization between the on-premises databases and cloud-hosted web front ends, the team chose to adopt a solution provided by Microsoft’s Global Foundation Services (GFS). GFS is the engine that powers the infrastructure and many services for Microsoft’s global datacenters. The GFS Services for Azure Applications (GSAA) team provides a solution that uses a Windows Azure plug-in or a base VM role image to manage traffic heuristics. It provides a framework that meets all of the medium business impact (MBI) requirements for Windows Azure data and GFS, which allows data to reside in and pass through the cloud.
GSAA also provides a new domain named azr.gbl (the Windows Azure domain) that has trust relationships with the Microsoft internal on-premise domain that hosts many online services. This enables the use of integrated authentication. With all of these controls in place, the GFS network team allows connectivity between Windows Azure hosted services and internal systems and hosted services. The GSAA solution seamlessly enables Windows Azure hosted services to interact with Microsoft’s internal hosted services over GFS’s world class network, without making intrusive security or network changes.
A major goal for the move to Windows Azure was to maintain or improve performance for the migrated applications. The EPX Performance and Reliability team compared the performance of the original on-premises and the Windows Azure hosted TechNet sites. The charts in Figure 3 illustrate the page load time for the initial user experience (PLT1) for pages served from four regional datacenters.
Figure 3 - TechNet on-premises and Windows Azure hosted performance comparison
Page load times are measured above fold time (AFT). Fold time is the point at which the browser clears the current page and starts to load the new page. Measuring performance from this point ensures that results are not skewed by factors such as DNS resolution time and proxy server negotiation.
For both the first time user experience and the returning user experience, the difference in page load times was less than 200 milliseconds for the majority of pages, which is within the margin of acceptable performance. The difference is mainly due to latency of the Content Delivery Network (CDN) and advertisement delivery.
A few pages exhibited differences of up to 400 milliseconds in some regional datacenters, and the team is investigating individual performance improvements for these cases. However, extensive performance and reliability testing has shown that the overall performance after migration to Windows Azure is equivalent to the on-premises applications, and in some cases better for certain pages.
The Proposed Near Term Implementation
Microsoft’s EPX team is now pursuing initiatives aimed at providing a future architecture for TechNet and other Windows Azure hosted websites and applications. The proposed architectural changes include:
Figure 4 illustrates the proposed future architecture.
Figure 4 - The proposed future TechNet hybrid application architecture
In addition, the experience gained from the initial TechNet migration will be used as a template for the migration of other EPX sites.
This initial migration of the TechNet website to Windows Azure provided EPX and other teams at Microsoft with many pointers that will be useful for future migrations. For example, the assessment outlined that numerous operational processes would need to change as EPX transformed to support cloud based solutions. These processes include:
At the time this case study was written, the EPX development team had migrated 40% of TechNet and MSDN traffic to Windows Azure utilizing the design configuration just described. By approaching this as an infrastructure migration, with no core application code or architecture changes, the team reduced the effort and testing required and completed the initial migration within three months of completing the feasibility assessment. Since then, TechNet has been running for over sixty days on Windows Azure and three more sites are lined up for migration. The migration of TechNet traffic will enable EPX to reduce its forecasted server acquisitions by 20%.
Ultimately, through the full migration of all on-premises MSDN and TechNet web front ends to Windows Azure, estimates suggest that Microsoft could save between 18% and 25% on hosting costs. The benefits become even more compelling when considering the reduction in cost and management associated with spikes in capacity. Dynamic scaling capabilities and simple configuration changes can change the number of role instances deployed in Windows Azure.
Furthermore these benefits have been achieved without adverse effects on performance and service availability to customers. Early reviews of performance at the client for the Windows Azure hosted solution compared to the original on-premises deployment show that the two are statistically equivalent for all pages when using local resources. In other words, Windows Azure may be slightly faster or only slightly slower than the on-premises solution, depending upon the normal variations in Internet traffic.
In terms of reliability, scalability, availability, and minimizing costs, the migration proves that Windows Azure works—and is here to stay.
Purush is Director of Service Engineering for Microsoft Developer Division online properties. He is responsible for managing platforms for high volume online properties like http://msdn.microsoft.com & http://technet.microsoft.com with over 300 million page views a month and over 60 million unique users worldwide. Purush is a long time Microsoft veteran and has over 14 years’ experience managing online properties. He is passionate about improving the quality of service to customers while optimizing costs year-over-year.