Cloud Computing: Architecting a Microsoft Private Cloud

Article
08/31/2016

In this first of a four-part series, you’ll learn what a private cloud is, and how hosted Infrastructure as a Service can support that environment.

David Ziembicki and Adam Fazio

There are many definitions for cloud computing, but one of the more concise and widely recognized definitions comes from the National Institute of Standards and Technology (NIST). NIST defines five essential characteristics, three service models and four deployment models. The essential characteristics form the core of the definition. The required characteristics for any solution to be called a true “cloud” solution include:

On-demand self-service
Broad network access
Resource pooling
Rapid elasticity
Measured service

NIST also defines three service models, or what are sometimes called architecture layers:

Infrastructure as a Service (IaaS)
Software as a Service (SaaS)
Platform as a Service (PaaS)

Finally, it defines four deployment models:

Private Cloud
Community Cloud
Public Cloud
Hybrid Cloud

Getting into the Cloud

Microsoft Services has designed, built and implemented a Private Cloud/IaaS solution using Windows Server, Hyper-V and System Center. Our goal throughout this four-part series will be to show how you can integrate and deploy each of the component products as a solution while providing the essential cloud attributes such as elasticity, resource pooling and self-service.

In this first article, we’ll define Private Cloud/IaaS, describe the cloud attributes and datacenter design principles used as requirements, then detail the reference architecture created to meet those requirements. In parts two and three, we’ll describe the detailed design of the reference architecture, each of the layers and products contained within, as well as the process and workflow automation. Finally, in part four we’ll describe the deployment automation created using the Microsoft Deployment Toolkit and Hydration Framework for consistent and repeatable implementations.

For a consistent definition of the cloud, we’ll use the NIST deployment models. We’ll use the term Private Cloud frequently in a variety of contexts without specifying the service model being discussed.

Besides the characteristics described in the NIST definition, we took on several additional requirements for this project:

Resiliency over redundancy
Homogenization and standardization
Resource pooling
Virtualization
Fabric management
Elasticity
Partitioning of shared resources
Cost transparency

A team within Microsoft gathered and defined these principles. The team profiled the Global Foundation Services (GFS) organization that runs our mega-datacenters; MSIT, which runs the internal Microsoft infrastructure and applications; and several large customers who agreed to be part of the research. With the stated definitions and requirements accepted, we moved on to the architecture design phase. Here we further defined the requirements and created an architecture model to achieve them.

Private Cloud/IaaS Reference Architecture

Using an architectural approach described in another of my technical articles, “From Virtualization to Dynamic IT” (The Architecture Journal, June 2010), we decided on the model shown in Figure 1 as the basis for the reference architecture.

Figure 1 The basis for our reference architecture.

Hardware Layer

The hardware layer includes the datacenter facility and mechanical systems, as well as the storage, network and computing infrastructure. Each of these elements must provide enabling management interfaces to interact with higher levels of the architecture. Examples include servers supporting Web Services-Management (WS-Management) and storage arrays providing Windows PowerShell or Storage Management Initiative – Specification (SMI-S) interfaces.

Microsoft states it developed the Microsoft Hyper-V Cloud FastTrack program to combine Microsoft software; consolidated guidance; validated configurations from OEM partners for compute, network and storage; and value-added software components in order to create private cloud solutions. Hewlett-Packard Co., Dell Inc., IBM Corp., Fujitsu, Hitachi Ltd. and NEC Corp. are all FastTrack partners and provide integrated and validated solutions for the hardware layer.

Virtualization Layer

Windows Server 2008 R2 (now with service pack 1) and Hyper-V provide the virtualization layer. This lets us use virtual machines (VMs) and network with VLANs, and provide storage through cluster shared volumes and virtual disks. The virtualization layer helps us achieve several of the essential NIST characteristics, such as resource pooling and elasticity. We’re able to share and provision capacity much faster through virtualization.

Automation Layer

The automation layer is the next layer of the stack from bottom to top (see Figure 2). The automation, management and orchestration layers build from the most granular to the widest breadth in terms of IT process automation. The lowest layer—the automation layer—includes technologies like Windows PowerShell 2.0, Windows Management Instrumentation (WMI) and WS-Management. These foundational technologies provide the interface between higher-level management systems and the physical and virtual resources.

Figure 2 The bottom-to-top architecture model used for the Private Cloud model.

Management Layer

The management layer consists of several Microsoft System Center products that leverage automation-layer technologies to perform management tasks such as checking for patch compliance, deploying patches and verifying installation. The management layer provides basic process automation, but is usually limited to one aspect of the server management lifecycle (such as deployment, patching, monitoring, backup and so on).

Orchestration Layer

The orchestration layer is one not typically seen in traditional IT environments, but it’s critical to providing cloud attributes. The orchestration layer binds multiple products, technologies and processes to enable end-to-end IT process automation. While System Center Configuration Manager can automate patch deployment, integrating it with a service-management system or additional third-party products and solutions requires an orchestration layer to coordinate an end-to-end process across multiple products.

For this layer, we use System Center Opalis (soon to be named System Center Orchestrator). Opalis integrates the System Center suite and facilitates integration with a number of third-party and partner solutions. The orchestration layer helps us create workflows or run books that can automate complicated tasks, such as cluster deployment, host patching and VM provisioning.

User Self-Service and Administrator Interfaces

The on-demand or user self-service attribute of the NIST definition is a new concept for many IT organizations. It’s primarily about removing the barriers between users’ needs for IT resources and the delivery of those resources. For example, in some organizations, it may take up to six months from the time a new server is requested for it to be readily available. Process and technology limitations cause the delay.

Self-service capability requires a new interface that lets users request services. This is most commonly manifested in an IT self-service portal. This portal would present users with a service catalog from which they can request items such as a new VM.

In our reference architecture, we define both a self-service interface for consumers and a centralized administrator interface for IT. For the consumer interface, Microsoft provides the System Center Virtual Machine Manager (VMM) Self-Service Portal 2.0, and for custom scenarios and hosters, the Dynamic Datacenter Toolkit for Hosters (DDTK-H). For our solution, we used a custom version of DDTK-H due to some of the required customization and automation. We anticipate using a more out-of-the-box solution from future Microsoft products.

For the administrator interface, we used System Center Service Manager (SCSM) and the System Center interfaces. SCSM is the newest Microsoft System Center product. It provides a configuration management database (CMDB) as well as a robust change-management solution. All common operations in our solution originate as change requests in SCSM. Those trigger automated workflows in Opalis. This is how we ensure proper change management while providing advanced automation.

Private Cloud/IaaS Logical Model

One of the key distinctions between a traditional datacenter and server environment and a private cloud is the abstraction of physical resources such as servers, networks and disks. These are placed into higher-level, logical groupings such as resource pools, fault domains, upgrade domains and so on. These logical groupings are mapped to physical infrastructure and help you make intelligent provisioning and management decisions. Based on work done by Microsoft Global Foundation Services, Windows Azure and MSIT, we used a logical model for our reference architecture (see Figure 3).

Figure 3 The logical grouping model for Private Cloud/IaaS.

These are the object definitions:

IaaS Fabric: The fabric is all infrastructure and systems under the scope of control of the reference architecture. The fabric can consist of multiple sites and datacenters.

Datacenter/Site: A physical location or site housing one or more resource pools.

Resource Pool: A resource pool is comprised of server, network and storage scale units that share a common hardware and configuration baseline. They don’t share a single point of failure with any other resource pool (other than the facility itself). You could sub-divide a resource pool into further fault domains, with the definition of a fault domain being a group of physical infrastructure pieces with a common configuration that doesn’t share a single point of failure with any other fault domain. For simplicity, a resource pool and a fault domain are equivalent in our solution.

Scale-Unit: A scale-unit is a set of server, network and storage capacity deployed as a single unit. It’s the smallest unit of capacity deployed in the fabric. Depending on the customer size, a scale unit may be a four-node Hyper-V cluster or a full rack of 64-blade servers. It’s typically sized as the average new capacity required on a quarterly basis. Rather than deploying a single server at a time, deploy a new scale-unit when you need additional capacity to fulfill the need and leave room for growth.

Host Cluster: A host cluster is a group of two to 16 Hyper-V servers in a failover cluster configuration and their associated networks and storage.

Upgrade Domain: An upgrade domain is a set of infrastructure pieces within a resource pool you can maintain, take offline or upgrade without cause for any downtime to the VMs or workloads running in the resource pool. In this architecture, each node one in all of the clusters in resource pool one forms an upgrade domain. Because each cluster has a spare node (15 plus one) we can perform maintenance on one node in each cluster with no downtime (VMs are live-migrated off prior to maintenance). So all node ones in resource pool one are defined as upgrade domain one. All node twos are upgrade domain two, and so on (see Figure 4).

Figure 4 A resource pool with its child scale-units.

The reason for defining and implementing these containers is so you can automate intelligent provisioning and management. For example, with a four-server Web farm, you need to ensure high availability within at least one site in case of site failure. Simply ensure the provisioning request is spread across two sites and two or more resource pools. This is ensured by the definition of the resource pools and their mapping to physical infrastructure. The proper layout of the VMs achieves resiliency of the service.

Experienced System Center users will note that the containers and definitions described here are not in System Center out of the box. We used the extensibility of the SCSM CMDB to define these containers, attributes and relationships. Opalis workflow automation bases its output on these. In the future, with VMM 2012, many of these containers and relationships will come out of the box, albeit with a different naming convention.

Private Cloud/IaaS Reference Implementation

The logical and physical separation of the management platform from the VM hosting platform helps each scale independently (see Figure 5). The center of the diagram in Figure 5 shows the resource pools under the scope of the management system and that the entire solution can be deployed within an existing datacenter.

Figure 5 A logical diagram of how we would implement the architecture.

One of the key elements to the reference implementation is automated deployment to improve both speed of deployment and consistency of implementation. This is true because Microsoft Services works with such a wide range of partners and customers. For deployment automation, the reference implementation includes the free Microsoft Deployment Toolkit (MDT) and the Microsoft Services Hydration Framework. This provides additional deployment automation on top of MDT.

The next step in the design process was to identify all the necessary areas of detailed design. These include:

Detailed design for each System Center product
Detailed design for the Fabric Management Hosting Infrastructure
Fabric management provisioning
Scale-unit design
Scale-unit provisioning
Workflow design

The reference architecture provides a solution for each of the NIST cloud attributes and an engine for advanced IT automation. In choosing which scenarios to automate, we focused on the higher complexity, higher cost and higher risk for user error scenarios. To that end, the solution automates the following processes:

Fabric Management Installation:

Fabric management Hyper-V Host deployment
Virtualized SQL Cluster deployment
VMM deployment
SCSM deployment
System Center Operations Manager (SCOM) deployment
System Center Configuration Manager (SCCM) deployment
System Center Opalis deployment
Customization and configuration

Scale-Unit (Host Cluster) Provisioning:

Bare-metal OS installation
Hyper-V installation
Cluster configuration

Scale-Unit (Host Cluster) Patching:

Per upgrade domain, orchestrate live migration of VMs off hosts for patching using VMM maintenance mode and SCOM maintenance mode
Orchestrate SCCM to patch hosts and verify patch success
Remove hosts from maintenance mode and move to the next upgrade domain

Host Maintenance:

Orchestrate live migration of VMs off hosts requiring maintenance using VMM maintenance mode and SCOM maintenance mode
Remove hosts from maintenance mode

VM Provisioning:

Provide VM provisioning capability through portal interface
Opalis takes provisioning requests and orchestrates the provisioning of VMs from pre-configured templates
Opalis ensures the VM is created and visible in all System Center products
Opalis installs the SCOM agent in the requested VMs
VMs are presented and manageable from the portal interface

VM De-Provisioning:

Make requests to de-provision VMs from portal interface
Opalis takes de-provisioning requests and removes the VM from the System Center products and deletes the VM
Opalis deletes the VM’s Active Directory computer account and DNS A-record

In the next part of this series, we’ll delve into detailed design for the fabric-management architecture, including the fabric-management Hyper-V cluster design, the virtualized SQL cluster design and the design of each of the System Center products. We’ll also illustrate the scale-unit design comprised of 16-node Hyper-V clusters.

David Ziembicki is a solution architect in the Microsoft Public Sector Services CTO organization, focusing on virtualization and private cloud computing. A Microsoft Certified Architect | Infrastructure, Ziembicki has been with Microsoft for five years, leading infrastructure projects at multiple government agencies. He’s a lead architect for Microsoft private cloud and virtualization service offerings, has been a speaker at multiple Microsoft events, and has served as an instructor for multiple virtualization-related training classes. Visit his blog.

Adam Fazio is a Solution Architect in the US Public Sector Services CTO organization with a passion for evolving customers' IT infrastructure from a cost-center to a key strategic asset. With focus on the broad Core Infrastructure Optimization model, specialties include: private cloud, datacenter, virtualization, management & operations, storage, networking, security, directory services, people and process. Check out Adam on the Private Cloud TechNet Blog and Twitter.