Skip to main content

Migrating Microsoft TechNet and MSDN to Hybrid Cloud using Windows Azure

Introduction

In June 2011 the Enabling Platform Experience (EPX) group, part of Microsoft’s Developer Division, began a journey that would see two of its largest developer and IT professional websites migrated from an entirely on-premises infrastructure to take advantage of the reliability, scalability, and availability of the cloud. The goal was to move both the Microsoft Developer Network (MSDN) and Microsoft TechNet to Windows Azure, and achieve a number of significant benefits:

  • To maximize resources utilization. The requirement to meet highly variable traffic patterns means that overall server utilization can fall to 20% overall. However, this over provisioning is necessary to meet demand during busy periods. A cloud-hosted solution can provide elasticity through the easy addition and removal of servers to meet demand.
  • To reduce infrastructure and running costs. Every organization is looking for ways to reduce energy usage and cost, and to minimize investment in infrastructure. The use of hosted virtual servers can minimize initial and ongoing hardware, infrastructure, and maintenance costs; as well as achieving significant savings in day-to-day running costs.
  • To “green” the datacenter. A significant environmental (and often regulatory) focus for all companies today is to minimize their carbon footprint. Hosted solutions that support dynamic resource scaling can achieve a significant reduction in energy usage and help companies to meet emissions targets.
  • To act as a learning exercise. Microsoft will move many of its websites and applications to Windows Azure over time. The knowledge and experience gained during the initial phase of migrating Microsoft TechNet will be invaluable for the migration of other applications.


The team working on the migration also had some requirements that the migration process must meet:

  • No code or architecture changes. The migration should be accomplished with minimum changes to configuration, and with no changes at all to the architecture or code of the applications. By eliminating the need to re-architect the application to run on the cloud, this approach would enable a much faster and less expensive migration.
  • Equivalent or better performance. Performance of the migrated application must be equivalent to, or better than the on-premises solution. In particular, it must be able to scale dynamically to meet demand, while minimizing running costs.
  • Ease of operation. The migrated application must be easy to operate, monitor, manage, and maintain.
  • Reduced on-premises requirements. The result must provide opportunities to minimize on-premises infrastructure requirements and costs.

This case study describes how the benefits were achieved, and the requirements met, during the first phase of the migration—moving Microsoft TechNet to Windows Azure.

The MSDN and TechNet Websites

The Enabling Platform Experience (EPX) group at Microsoft is responsible for managing a number of Microsoft Developer online and offline experiences, including the Microsoft Developer Network (MSDN) and Microsoft TechNet. Usage of both of these sites is highly variable; with significant spikes when new products launch, or during training or conferences. To support such variability before the migration, EPX had to provision four times the number of servers required to support peak volumes, resulting in an aggregate server utilization of 20% over the course of a year.

The MSDN and TechNet websites have a similar architecture and are hosted on the same hardware, although TechNet receives less traffic overall than MSDN. Therefore, the decision was made to perform the migration for TechNet first; and then apply the lessons learned to the MSDN website. The midterm target for EPX is to move all applicable websites and applications to Windows Azure by the end of 2014.

The Architecture of TechNet before Migration

TechNet is designed for high web traffic. It experiences a high number of reads with no caching on the front end web servers. TechNet has an average traffic level of over 25 million unique users per month, in addition to other requests such as indexing by search engines.

TechNet has a typical two tier architecture, with web servers already in a virtualized environment. Visitors to TechNet access web front ends (WFEs) that, in turn, read content from a farm of database servers. This database layer is primarily hosted on high-end, four terabyte servers with the content replicated in four different datacenters. Figure 1 illustrates the original on-premises architecture of TechNet.

 

Figure 1 - The TechNet on-premises architecture

Content for the site is pushed to the databases from a content publishing system. A complicating factor in the migration was that TechNet was in the midst of transitioning from one content management system to another, which meant that two code bases were sourcing the content for end users.

The Migration Process

Microsoft engaged Accenture for the initial evaluation and feasibility assessment utilizing their Azure Migration Accelerator assets, and the Premier Field Engineering (PFE) teams who have vast experience in debugging and troubleshooting performance issues. Then, in June 2011, a team of just three service engineers embarked on the migration within an aggressive time frame to perform the migration.

The final design for the overall migration of all sites is a hybrid application where a portion of the application runs on Windows Azure; while the data layer, monitoring and management functions, and content publishing system remain on-premises. The result is a future-state architecture to which the team can migrate applications over time through gradual re-architecting.

To move the TechNet infrastructure to the cloud, the team made design decisions at each layer of the infrastructure. For traffic routing, the team utilized TechNet’s current global load balancing capabilities using AKAMAI to direct traffic to Windows Azure. This enabled the team to pilot the approach with live traffic and divert incremental amounts of traffic from on-premises datacenters.

For the web front ends, the team used the Windows Azure Virtual Machine (VM) role, which allowed them to use an existing VM image to seed the cloud migration. This reduced the probability that engineering changes would be required, and provided the team with more control over the migration and configuration. Because of the two code bases, the team packaged each into its own VM role and used a custom content switching solution to drive traffic to the appropriate role.

To achieve elasticity and the consequent minimization of runtime costs, the team chose the Enterprise Library Autoscaling Application Block from Microsoft’s patterns & practices group. This enables the application capacity to automatically mirror demand by starting and stopping role instances in response to a range of factors, such as server load and resource usage. Incorporating the Autoscaling Application Block and configuring the autoscaling rules took the team just a few hours.

Databases were another key consideration during the migration. Because the TechNet content database is almost four terabytes, SQL Azure was not a viable option in the short term. Instead, the team created a hybrid cloud solution in which the web front end resides on Windows Azure, and the data tier remains on-premises.

In terms of content switching, the on-premise platform used a hardware-based solution. For the migrated solution, the team created an Application Request Routing (ARR) role to achieve these same results.

Figure 2 illustrates the hybrid application architecture at the time this case study was written.

 

Figure 2 - The current TechNet hybrid application architecture

To achieve network connectivity and authentication/authorization between the on-premises databases and cloud-hosted web front ends, the team chose to adopt a solution provided by Microsoft’s Global Foundation Services (GFS). GFS is the engine that powers the infrastructure and many services for Microsoft’s global datacenters. The GFS Services for Azure Applications (GSAA) team provides a solution that uses a Windows Azure plug-in or a base VM role image to manage traffic heuristics. It provides a framework that meets all of the medium business impact (MBI) requirements for Windows Azure data and GFS, which allows data to reside in and pass through the cloud.

GSAA also provides a new domain named azr.gbl (the Windows Azure domain) that has trust relationships with the Microsoft internal on-premise domain that hosts many online services. This enables the use of integrated authentication. With all of these controls in place, the GFS network team allows connectivity between Windows Azure hosted services and internal systems and hosted services. The GSAA solution seamlessly enables Windows Azure hosted services to interact with Microsoft’s internal hosted services over GFS’s world class network, without making intrusive security or network changes.

Performance Comparisons

A major goal for the move to Windows Azure was to maintain or improve performance for the migrated applications. The EPX Performance and Reliability team compared the performance of the original on-premises and the Windows Azure hosted TechNet sites. The charts in Figure 3 illustrate the page load time for the initial user experience (PLT1) for pages served from four regional datacenters.

 

Figure 3 - TechNet on-premises and Windows Azure hosted performance comparison

Page load times are measured above fold time (AFT). Fold time is the point at which the browser clears the current page and starts to load the new page. Measuring performance from this point ensures that results are not skewed by factors such as DNS resolution time and proxy server negotiation.

For both the first time user experience and the returning user experience, the difference in page load times was less than 200 milliseconds for the majority of pages, which is within the margin of acceptable performance. The difference is mainly due to latency of the Content Delivery Network (CDN) and advertisement delivery.

A few pages exhibited differences of up to 400 milliseconds in some regional datacenters, and the team is investigating individual performance improvements for these cases. However, extensive performance and reliability testing has shown that the overall performance after migration to Windows Azure is equivalent to the on-premises applications, and in some cases better for certain pages.

The Proposed Near Term Implementation

Microsoft’s EPX team is now pursuing initiatives aimed at providing a future architecture for TechNet and other Windows Azure hosted websites and applications. The proposed architectural changes include:

  • Migration of the web front ends from VM roles to Web roles in order to reduce support requirements, remove the need to manage the operating system, and to simplify deployment.
  • Migration of the databases to Windows Azure using the new Infrastructure as a Service (IaaS) capabilities.
  • Use of an on-premises virtual private cloud implemented with Windows Server 8 to allow content to be published from on-premises servers to Windows Azure.

Figure 4 illustrates the proposed future architecture.

 

Figure 4 - The proposed future TechNet hybrid application architecture

In addition, the experience gained from the initial TechNet migration will be used as a template for the migration of other EPX sites.

Lessons Learned

This initial migration of the TechNet website to Windows Azure provided EPX and other teams at Microsoft with many pointers that will be useful for future migrations. For example, the assessment outlined that numerous operational processes would need to change as EPX transformed to support cloud based solutions. These processes include:

  • Logging and monitoring. Windows Azure Diagnostics transfers Windows logs and other trace information to blob storage as scheduled jobs. Data must be downloaded from blob storage, and the team chose Microsoft System Center Operations Manager and Virtual IP (VIP) monitoring tool (an internal HTTP monitoring solution) for this task. Local Instance level health checks are also performed, while third party providers monitor application pages (Keynote) and perform network traffic management.
  • Business Continuity and Disaster Recovery (BCDR). Existing traffic management capabilities plus local instance health checks of pages enable a failover to or from Windows Azure at the cluster level. The health check pages incorporate functionality to test for issues such as loss of data layer connectivity.
  • Backup and Restore. Existing systems manage backup and restore for on-premises data. Specific backup and restore facilities are not required in the cloud hosted portion of the application because Windows Azure automatically replicates data, such as the log information persisted in blob storage.
  • Operating System Updates. A service engineer connects to a “golden master” VM and applies operating system and security updates using msnpatch.exe, and then uses an automated deployment process to publish the VMs in Windows Azure.
  • Deployment. Operating system and Internet Information Services (IIS) updates are applied to a differencing disk. This is deployed to Windows Azure staging and a VIP Swap occurs to move it into live production. Additional scripts can be used to push content deployments independently to each running Windows Azure role instance when minor changes are required.


The ongoing migration has also revealed some useful guidance for the future, which will benefit all designers and developers considering migration of their applications and websites to Windows Azure:

  • Don’t reinvent the wheel – explore and apply known good practices.
  • Consider application and data security. Remember that Windows Azure is a public space.
  • Understand the capabilities and limitations of Windows Azure, outsourcing this process if required. Use the resources available on the Windows Azure portal, forums and user groups.
  • Use available tools to evaluate your code against known Platform as a Service (PaaS) migration challenges.
  • Understand your application and its potential risk areas. These may include server-specific configurations, special networking requirements such as content switching or affinity, support for multiple sites, and connectivity to supporting systems or business layers.
  • Gain operational flexibility by allowing configuration and content to be modified independently from the package or VM that you deploy.
  • Take full advantage of Windows Azure services such as Service Bus and Data Sync, and tools or frameworks such as the patterns & practices Enterprise Library Extensions for Windows Azure.


Finally, keep in mind that, as with any migration project, there will be issues that you only discover when something doesn’t work quite as you anticipated! Some issues that the team came across were:

  • Windows Azure is always in Coordinated Universal Time (UTC), while on-premises services are likely to use local time.
  • Consider if you need to change the page size for web and worker roles based on the size of your role instance and application.
  • Always use the latest development SDK version when developing applications, and consult the Known Issues pages when you upgrade your SDK version.

Conclusion

At the time this case study was written, the EPX development team had migrated 40% of TechNet and MSDN traffic to Windows Azure utilizing the design configuration just described. By approaching this as an infrastructure migration, with no core application code or architecture changes, the team reduced the effort and testing required and completed the initial migration within three months of completing the feasibility assessment. Since then, TechNet has been running for over sixty days on Windows Azure and three more sites are lined up for migration. The migration of TechNet traffic will enable EPX to reduce its forecasted server acquisitions by 20%.

Ultimately, through the full migration of all on-premises MSDN and TechNet web front ends to Windows Azure, estimates suggest that Microsoft could save between 18% and 25% on hosting costs. The benefits become even more compelling when considering the reduction in cost and management associated with spikes in capacity. Dynamic scaling capabilities and simple configuration changes can change the number of role instances deployed in Windows Azure.

Furthermore these benefits have been achieved without adverse effects on performance and service availability to customers. Early reviews of performance at the client for the Windows Azure hosted solution compared to the original on-premises deployment show that the two are statistically equivalent for all pages when using local resources. In other words, Windows Azure may be slightly faster or only slightly slower than the on-premises solution, depending upon the normal variations in Internet traffic.

In terms of reliability, scalability, availability, and minimizing costs, the migration proves that Windows Azure works—and is here to stay.

 

Purush Vankireddy

Purush is Director of Service Engineering for Microsoft Developer Division online properties. He is responsible for managing platforms for high volume online properties like http://msdn.microsoft.com & http://technet.microsoft.com with over 300 million page views a month and over 60 million unique users worldwide. Purush is a long time Microsoft veteran and has over 14 years’ experience managing online properties. He is passionate about improving the quality of service to customers while optimizing costs year-over-year.