How Microsoft IT Developed a Private Cloud Infrastructure
Technical Case Study
Published: August 2011
Microsoft IT created an efficient, flexible Research and Development (R&D) facility that serves the development and test environments at Microsoft. Learn how Microsoft IT leveraged the flexibility and density of the facility, along with the supporting network to develop a private cloud infrastructure that uses cutting edge technology to provide infrastructure as a service and support the needs of the internal businesses.
Technical Case Study, 228 KB, Microsoft Word file
Products & Technologies
Microsoft IT wanted to reduce lab space server sprawl and introduce a new level of management and support efficiency. The facility needed to be both efficient and flexible enough to support the research and development needs of the different product groups.
|MSIT built an energy-efficient, flexible, high-density facility that meets the needs of the research and development community at Microsoft and is able to host private clouds that provide infrastructure as a service (IaaS).||
To reduce lab space server sprawl and introduce a new level of management and support efficiency, Microsoft built a flexible facility that reduced the footprint of private research, development, and test labs spread across the campus. This facility offered a flexible and efficient infrastructure able to meet the varying demands of all the individual product teams within the company. The facility marks a milestone in the cultural shift under way at Microsoft from the traditional model, where product groups managed their own labs in an office building, to a centrally managed and more energy-efficient alternative. Many of the facility's benefits are realized through the economies of scale, remote hosting methodologies, and efficient infrastructure as well as stretching the distance between developers and physical systems. Creating separation from systems with respect to proximity reinforces a new nature of code development being inherently "remote" and in the cloud.
Once the facility was built, Microsoft IT (MSIT) wanted to fully leverage its capabilities to offer a private cloud that provided an elastic and infinitely scalable infrastructure as a service platform. This allowed the R&D community at Microsoft to dynamically scale their application development environments up or down, on demand, paying only for their actual consumption. By building a private cloud with a very large centrally procured resource pool, MSIT was able to realize the benefits of higher density, fewer resources, and lower costs per resource. MSIT built a private cloud that provided virtual infrastructures that the research, development, and test business customers at Microsoft would have been looking to build for themselves. In doing so, MSIT was able to reduce:
- The number of physical systems.
- The sprawl of the systems.
- The cost of procuring and managing the systems.
- The number of resources required to manage the systems.
- The time the systems took to be deployed.
- Variations in the deployment of the systems.
MSIT needed the systems to be available in a predictable amount of time. They needed the systems to look and behave the same way every time. And this had to be accomplished without a great deal of human intervention. This could have been done without virtualization, but it would have been more difficult and not as cost-effective.
Currently, the facility has tens of thousands of virtual machines, representing multiple private clouds. While private clouds represent only a fraction of what is physically hosted within the facility, the number of virtual machines well exceeds the number of physical devices.
This technical case study provides a high-level description of how the facility, the network, and fabric, and the private cloud build upon each other's capabilities to provide an infrastructure as a service offering. It also provides some best practices and insights that MSIT developed during the planning and deployment of the private cloud hosted within the R&D facility. This case study should not function as a procedural roadmap, however, because operational environments differ among organizations.
MSIT created an efficient, flexible Research and Development (R&D) facility at Redmond Ridge that serves the development and test environments at Microsoft. The opening of this facility represented a transition point in the company culture because in the past the vast majority of the development and test community at Microsoft was developing products in private lab spaces on campus.
With usable floor space of about 34,000 square feet, the facility replaces more than 180,000 feet of office space that would have been converted, at a large expense, to support R&D lab space within office buildings. As opposed to other lab spaces within Microsoft, the facility was designed and purposely built to be a research and development space, providing a common environment with uniform densities and cooling to provide resiliency across the environment without favoring one particular area of the lab (R&D space). It also features a robust fiber backbone for distribution to the entire environment. This design allows MSIT to deliver stable lab space and provide the flexibility to meet various clients' needs, including those required to host private clouds.
The facility's design criteria fell into three major themes:
- Resiliency over redundancy
- High-density delivery
- Energy efficiency
Resiliency over Redundancy
This concept is present in the design of several important components within the facility. All of the pods are UPS backed. The dynamic nature of the eight-generator system allows for each pod within a cluster to be prioritized as required, based upon current business need such as during a key development or test cycle. The operations staff can dynamically edit the load shed procedure to provide coverage as required by the business. The local utilities provider has provided some level of redundancy by offering dual 25 megawatt power connections into the building from a dedicated power substation.
MSIT delivered higher than industry standard density through increased use space capabilities within racks. The facility is fitted with 52 U racks with 51 U of usable space per rack. All racks are equipped with 15 kilowatts (kw) of usable power. There are 48 pods (8 rows of 6 pods) separated into two groups of 24 on either side of a central control/work area.
The building was designed to have a Power Usage Effectiveness (PUE) of 1.3 or lower and has been operating at 1.2. Because MSIT maximized density, having a smaller footprint inherently contributed to power saving. A smaller facility requires less lighting, needs less wiring, and has less square footage to maintain with climate control systems. All of the space lighting is on sensor systems that detect the presence of individuals in the work areas and only lights an area as required, based upon presence.
Climate control is a major power-consumption consideration when designing a data facility. MSIT decided to use evaporative coolers and fully contained hot aisles and a shared cold supply air space that flows around all pods within the room. Positive pressure is kept within the supply aisles, with negative pressure inside of the contained hot aisle, further facilitating air flow. Blanking of empty spaces is also utilized to maintain proper pressure differentials between supply and return spaces. Evaporative coolers are more energy efficient than normal "chillers" and fully contained hot and cold aisles allow for greater efficiency of the cooling system and heat removal from the devices.
Network and Fabric Infrastructure
In determining the best design for a network service for the facility, MSIT took into consideration the shift from "one server-one IP address" to private clouds that offer multiple virtual hosts per server.
Emerging server technologies and Hyper-V® capabilities have indicated an increasing need for MSIT to maintain enough network capacity to stay ahead of the curve. As more services transition into the cloud, MSIT will need to design networks that accommodate the product groups' growing need to test their next generation of software products that operate on ad-hoc schedules.
Accommodating network design in a progressive facility, for an industry in transition, required that MSIT focus on the costs of logical capacity, as well as the future capabilities of routers and switches, and oversubscription rates. Other design criteria for the network infrastructure were that it should be easy to operate and that it also included standards-based interfaces to automate against.
To keep costs down and simplify the network environment, MSIT made some tradeoffs and chose to use fewer devices, but the design still had to provide all of the required capacity, so device selection became even more critical. MSIT made a conscious effort to select products that could deliver on both today's requirements and scale up for future capabilities.
When looking at the product data sheets that described the capabilities of individual hardware devices, MSIT had to understand the difference in performance capabilities as a single device and as part of a larger environment.
Designing the Network Fabric
To consolidate many separate groups into a single facility, MSIT needed to deliver access ports to servers with low oversubscription. The scale and density of the facility at Redmond Ridge impacted the way MSIT needed to plan for the three logical aspects of network equipment:
- Layer 2 MAC address tables
- Layer 3 Address Resolution Protocol/Neighbor Discovery (ARP/ND) Cache
- Control Plane Protocols
MSIT designed and implemented a four-tier network with well-defined roles.
|Two||Layer 3 Distribution||
|Three||Layer 2 Aggregation||
|Four||Layer 2 Host Access||
Table 1. Four-tier network roles and responsibilities
Starting from the host up to the core, each role has a predefined oversubscription rate to ensure consistency over the entire facility. With the exception of the Layer 2 Host Access, each device in the Core, Layer 3 Distribution, and Layer 2 Aggregation is a modular chassis with enhancements. MSIT has been working with the network hardware vendor to develop ways to further increase capacity over time.
Arriving at Logical Constraints
Host virtualization, with anywhere from 8 to 25 virtual hosts per server (with the exception of logical ports), drives the network like a stand-alone server. A single Layer 2 domain of a prior generation may have seen 8,000 to 24,000 hosts; today MSIT is seeing 40,000 to 100,000 in a fraction of the space. This combined with the ongoing IPv6 transition can triple memory requirements. It also significantly increases CPU load with ARP/ND and the control plane traffic from the number of hosts on a network.
Delivering for Today While Planning for Tomorrow
MSIT's current business model buys the capacity that they require throughout the depreciation cycle and provides opportunities to scale up in the future. MSIT has formed close partnerships with their network vendors to drive feature enhancements and capacity gains within their respective products.
MSIT made deliberate decisions to design a network with enough logical capacity and oversubscription to realize the full value of the initial investment over its lifetime. They also designed it in such a way to include the possibility of an in-chassis upgrade in the future to further expand its logical capacity.
Building a Private Cloud
The private cloud provides a way for MSIT to supply the business with virtual machines for as long as they are needed with the configurations that are required. The virtual machines are primarily used for development, test, and quality assurance (QA) activities, but sometimes people from other areas of the business also require a virtual machine to host a file server for a couple of weeks or they need a temporary location to store some data because their facility is going offline.
With a private cloud, internal customers no longer need to buy their own servers and contend with deploying and managing them. Alternatively, groups that still want to purchase and run their own systems only need to buy servers for their primary workloads, or steady state capacity, as they can always get additional resources as they are required from the private cloud.
Determining Host Requirements
Before MSIT could build out host machines, they needed to define what it meant to be successful in the delivery of their workloads. Did all of the virtual machines need to look the same? Are they all going to run the same operating systems? Do they all need the same amount of memory and storage? If the answers to those questions were yes, it would be easy to purchase one server configuration that is optimized for that workload. But if the answer is no, flexibility of the underlying platform (of the private cloud) is required to provide variability in the amount of memory, cores, and storage as well as to accommodate for different operating systems and the network that are going to be assigned to the resources.
To get the most efficiency out of deploying in the facility, MSIT purchased fairly homogenous hosts. Having homogenous systems simplifies the management of the cloud components, as long as the systems have sufficient flexibility (storage, network, etc.) to meet the varying virtual machine requirements. This resulted in lower overall operational cost.
Memory, Storage, and Processors
While building the private cloud, MSIT had to decide whether to constrain on memory, storage, or processors to keep costs down. Processors are not currently a limiting factor, but memory and storage can be. Memory is more expensive than storage, so MSIT maximized the usage of memory by putting as many virtual machines on a host as possible.
One important hardware selection criteria was whether to use hosts with onboard storage, or blades with array-based storage. Typically cost ($/raw GB) favors the selection of local storage, and flexibility and reliability favor array-based storage. With the high variability of storage demands from virtual machines in a private cloud, sizing local storage becomes more challenging because there is either a risk of stranding memory by not having enough storage, or stranding storage by over-purchasing to meet peak demands. With local storage, each node must have a buffer to accommodate all potential requests, whereas on an array, only one buffer is needed.
To ensure that the purchased hosts could support memory and storage requirements, MSIT also had to address another challenge. How would they ensure that what they were buying would support, for example, a virtual machine with 2 gigabytes (GB) of memory and 50 GB of storage at the same time it is supporting a virtual machine with 16 GB of memory and a terabyte of storage?
MSIT chose to solve that problem through the use of shared storage; in this case the total buffer required on the arrays was much lower than the sum of the buffers on local storage, reducing the cost advantages of local storage. Since MSIT was setting up the private cloud in a facility that was optimized for density, they deployed blade servers with fiber channel arrays providing shared storage, via storage area network (SAN). This allowed every host to serve as a compute node and storage was attached to it as needed. Some costs were associated with this fabric-based deployment, but there were advantages as well. One was the resulting flexibility—if MSIT needs 300 GB of storage for one host and 3 terabytes for another, a single host configuration can support both of these workloads without any wasted storage space.
There was also a specific type of workload identified that was very storage-intensive that leveraged discrete servers with a high density of low-cost local storage. Even with some of the wasted storage capacity, lower costs were associated with using local storage for those workloads.
Optimizing Pods to Host the Private Cloud
Once MSIT determined what their workloads required and decided on configurations for storage, network, and compute, they communicated the requirements to the facilities team. MSIT and the facilities team then worked together to determine what changes would be needed to support those requirements.
Higher density . When MSIT deployed the blade and SAN-based solution, they needed higher density and removed the keyboard, video, and mouse (KVM) and the top of rack (TOR) Ethernet switches, which are included in the blade chassis.
Power needs . The private cloud also had higher power needs in their blade racks than a traditional server rack would need. Rather than the two 30-amp three-phase power strips that were in the rack, they increased them to two 50-amp three phase. This was easily accomplished given the modularity of the bus duct system because the system allows for easy replacement of circuit modules within pods.
Network. MSIT needed additional fiber cable between each rack and the networking rack to account for the fiber channel storage. Instead of utilizing the building routers, MSIT put one within the pod with the servers and storage so that the pod owned its own network. MSIT did that for two reasons: complexity and isolation. MSIT needed to host multiple different networks, so rather than dealing with different uplinks going into different places, MSIT aggregated them in the pod and then distributed to the hosts from there. And at the scale that they were going to deploy, it made more sense to dedicate the traffic rather than use a shared router.
MSIT can quickly add capacity, enabled by the facility and supported by the choice of homogenous hardware. There was some tooling on the back end for MSIT to deploy, image, and configure the servers that needed to be automated. MSIT authored a number of scripts that automate all of the configuration steps and developed a robust process for quickly adding capacity and building out hosts. The scripts were written in Windows PowerShell® and used error handling. Because they were written in PowerShell, MSIT will be able to leverage System Center Orchestrator 2012 for further levels of deployment automation without rewriting scripts.
To address one of the challenges in driving developers to adopt virtual machines in the private cloud, MSIT used industry standard KVM solutions to provide developers with a work experience similar to that of working from a physical server at their desk. Using KVM over IP, developers can start and stop services, access the BIOS, access DVD drives, shut down, restart, and do everything they typically did in a day except for change hardware or replace cables.
Reducing Operational Costs
Homogeny reduces variability in the environment, which lowers support costs. The more complicated a host is made, whether by increasing the number of supported configurations or introducing elements such as shared storage or fabrics, the more it can increase operational costs. Adding things like storage fabric and arrays does add a new operational aspect, the storage administrator. But those costs can be overcome through more efficient use of resources, higher resiliency with redundant fabrics, and with the development of automated processes.
Use Any Host for Any Workload
In the homogenous environment, it didn't make sense to dedicate certain groups of hosts to certain types of networks. If a host runs out of capacity and other hosts have capacity for the other networks and don't meet their demand, the hosts have to be reconfigured. MSIT combined all of the networks into a single location. Storage was either pre-allocated or dynamically assigned to the host as needed; for all intents and purposes it is a pure compute node. Since it has all of the networks on it, any host can be used for any workload.
There are cases in the application space where it is necessary for a cloud to span more than one pod. In these cases network traffic is aggregated within the pod and significant amounts of bandwidth is available between the pods. For example, if there is storage in one pod and compute in another, there needs to be good network fabric between those pods for them to work.
The flexibility of the facility and the network fabric infrastructure provided an ideal location to host the private cloud that MSIT developed to offer IaaS to the business. The design of the facility delivered high density and energy efficiency and the network infrastructure was designed to meet both current and future capacity requirements. Those capabilities helped MSIT successfully build and run the private cloud in a flexible, cost-effective, and efficient environment.
Virtualization reduced operational overhead by reducing the required number of physical systems. Depending on the server role, the ratio of physical host to virtual machine can be as high as 8:1. Offering a private cloud that provides virtual machine resources to the research, development, and test teams provided those teams with a way to attain the resources they require in a shorter amount of time, with the configuration they required, and for as long as they required, at lower cost than procuring and managing physical systems in individual lab spaces.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:
© 2011 Microsoft Corporation. All rights reserved.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Hyper-V, and Windows PowerShell are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.