Achieving High Availability for Hyper-V
At a Glance:
- Consolidating servers using Hyper-V
- Ensuring high availability of virtual machines
- Setting up a Windows Server 2008 Failover
Server virtualization is poised to make a significant impact in enterprise IT departments, and Hyper-V with Windows Server 2008 can make it a reality. The consolidation of servers onto fewer physical machines has huge advantages in resource and cost savings, but two key factors need to be considered during the planning process. Users have increasing expectations regarding the availability of their software, including both line-of-business (LOB) applications and tools such as messaging and collaboration platforms. Furthermore, problems or failure on servers can have a significantly greater impact on operations. Windows Server 2008 and Hyper-V provide solutions that can be implemented to provide high availability (HA) of virtual machines (VMs) as well as to the workloads being hosted inside the VMs.
Availability means that users can access a system to do their work. With high availability, there is a significant expectation that users will always be able to access the system, because it has been designed and implemented to ensure operational continuity.
High availability for Hyper-V is achieved through the use of the Windows Server 2008 Failover Cluster feature. High availability is impacted by both planned and unplanned downtime, and failover clustering can significantly increase availability of virtual machines in both of these categories.
Virtual machines can be managed by the Failover Cluster, and the Failover Cluster can be used inside of virtual machines to monitor and move the workloads that are hosted in the VM. I'll describe both of these configuration scenarios in more detail, but the majority of this article will be about managing virtual machines Before you get started, however, you may want to take a look at the "Useful Hyper-V Terms" sidebar.
Hosts and Guests
Because there are multiple operating systems running on a Hyper-V system, it can get challenging to keep clear which layer, or OS, is being discussed. I use the term "guest" to refer to the OS and environment within a Hyper-V VM that is running in a child partition. I use the term "host" to indicate the physical machine, which is being managed by the OS on the Hyper-V parent partition.
Host availability addresses the problems that come with the "putting all your eggs in one basket" scenario that server consolidation can cause. The Windows Server 2008 Failover Cluster can be configured on the Hyper-V parent partition (host) so that Hyper-V child partitions (virtual machines or guests) can be monitored for health and moved between nodes of the cluster. This configuration has the following key advantages:
- If the physical machine that Hyper-V and the VM are running on needs to be updated, changed, or rebooted, then the VMs can be moved to other nodes of the cluster. The VMs can then be moved back once the physical machine is returned to service.
- If the physical machine that Hyper-V and the VMs are running on fails (perhaps a motherboard failure) or is significantly degraded, the other members of the Windows Failover Cluster will take over the ownership of the VMs and bring them online automatically.
- If the VM fails, it can be restarted on the same Hyper-V server or moved to another Hyper-V server. Since this is detected by the Windows Server Failover Cluster, it will automatically take recovery steps based on the settings in the VM's resource properties. Downtime is minimized due to the detection and recovery automation.
Figure 1 represents what might happen in such situations. First, VM2 resides on Host A, then VM2 is moved to Host B. Note that the node that owns the SAN storage LUN 2 changes from Host A to Host B during this move. To ensure that your high availability solution will meet your availability needs, consider carefully where the VMs will be placed. You need to think about both capacity and performance.
Figure 1 A virtual machine and its storage move to a new host (Click the image for a larger view)
The capacity of the nodes should be sufficient to host all the VMs and to allow for x number of nodes to fail or be taken out of active cluster participation. (X represents the number of nodes you want the cluster to be able to tolerate losing and still be able to host all of the VMs. When deciding on capacity, you can choose to have some nodes that don't regularly host VMs, holding them in reserve. Alternatively, you can spread the VMs across all nodes, ensuring that each node has enough extra capacity to be able to successfully take ownership and start the VMs if any x number of nodes were to fail.
For daily performance reasons, it may be desirable to spread the VMs across all nodes of the cluster. If nodes are held in reserve and don't host any VMs, then the nodes hosting the VMs will have more resources in use and that may reduce the performance of the VMs as well as the management partition. Spreading the VMs across the nodes reduces the load that each is carrying and can provide better performance for the VMs and management partitions. It can, however, make capacity planning more challenging. Management software, such as System Center Virtual Machine Manager 2008, can help by providing calculations on capacity for node failure and VM placement.
Guest Availability focuses on making the workload that is being run inside a VM highly available. Common workloads include file and print servers, IIS, and LOB applications. Analyzing high-availability needs and solutions for workloads inside VMs is very much the same as on standalone servers. The solution will depend on the specific workload.
Some workloads can achieve high availability through Windows Network Load Balance (NLB), which allows multiple servers to be part of a pool with a common network name. Clients make a connection request using that virtual network name and the connection is made to one of the nodes of the NLB cluster. A typical scenario that uses NLB clustering is building Web farms with IIS, where each individual system has IIS with the same Web pages and access to the same data. NLB provides load balancing as well as the ability to remove servers from membership for maintenance or in case of a problem with the server, and therefore provides a level of high availability. If a Hyper-V VM is running Windows Server 2008 (or an earlier version of Windows Server that includes NLB), the guest can be a member of an NLB cluster with other guests on the same or different Hyper-V host(s).
Guests that are running Windows Server 2008 can use the Windows Failover Cluster feature to provide high availability for their workloads. There are several advantages of using Windows failover clustering inside of a guest (guest clustering):
Workload Health Monitoring The Windows Failover Cluster has a resource monitor that makes calls to the resource DLL associated with the cluster. Each resource has health monitoring that tests the application or service being managed by the resource to ensure that it is operating correctly. These checks are commonly called the isAlive/looksAlive checks. If the resource fails one of these calls, the resource itself will fail. Depending on how its properties are configured, the resource may try to restart the service or application, or it may be moved to another node in the Windows Failover Cluster.
Virtual Machine Maintenance If the configuration of the VM needs to be changed, or if the OS or software needs to be updated or changed, the workload can be moved to another node of the cluster and the VM either shut down or updated with minimal interruption to the end users.
Host Machine Maintenance If the physical machine hosting a Hyper-V VM needs maintenance or software updates and other members of the Windows Failover Cluster are located on different Hyper-V hosts, the workload in the VM can be moved to another node of the cluster and that VM can be shut down to accommodate the changes or reboots of the physical server.
Virtual or Host Machine Failure If there is a failure of the physical Hyper-V host or the virtual machine guest, the other nodes of the Windows Failover Cluster will detect that the cluster member is no longer responding or participating in the cluster and the surviving nodes will bring online the applications or services that had been running on the failed VM.
Making VMs Highly Available
Configuring a virtual machine to be highly available is as simple going through the HA Role Wizard under Failover Cluster Management. Hyper-V virtual machines have several key components that need to be considered when they are being managed as highly available. Let's take a look some of the important concepts and general prerequisites.
Failover Cluster Nodes Each physical server that is part of a failover cluster is called a node. For host clustering, the Failover Cluster service runs in Windows Server 2008 on the parent partition of the Hyper-V system. This allows the VMs that are running in child partitions on the same physical servers to be configured as highly available virtual machines. The virtual machines that are configured for HA will be shown as resources in the Failover Cluster Management console.
HA Storage Highly available virtual machines can be configured to use Virtual Hard Disks (VHDs), passthrough disks, and differencing disks. To enable the movement of virtual machines between failover cluster nodes, there needs to be storage (appearing as disks in Disk Management) that can be accessed by any node that might host the VM and that is managed by the Failover Cluster service. Passthrough disks should be added to the failover cluster as disk resources, and VHD files must be on disks that are added to the failover cluster as disk resources.
Virtual Machine Resource This is a failover cluster resource type that represents the virtual machine. When the virtual machine resource is brought online, a child partition is created by Hyper-V and the OS in the virtual machine is started. The offline function of the virtual machine resource removes the VM from Hyper-V on the node where it was being hosted and the child partition is removed from the Hyper-V host. If the virtual machine is shut down, stopped, or put in saved state, this resource will be put in the offline state.
Virtual Machine Configuration Resource This is a failover cluster resource type that is used to manage the configuration information for a VM. There is one virtual machine configuration resource for each virtual machine. A property of this resource contains the path to the configuration file that contains all the information needed to add the virtual machine to the Hyper-V host. Access to the configuration file is required for a virtual machine resource to start. Because the configuration is managed by a separate resource, a VM resource's configuration can be modified even when the VM is offline.
Virtual Machine Services and Applications Group For a service or application to be made highly available through failover clustering, multiple resources must be hosted on the same failover cluster node. To ensure that these resources are always on the same node, and they interoperate appropriately, the resources are put into a group that the Windows Server 2008 Failover Cluster refers to as "Service or Application." The virtual machine resource and the virtual machine configuration resource for a VM are always in the same Services or Applications group. There may also be one or more physical disk (or other storage type) resources containing VHDs or configuration files or passthrough disks in a Service or Applications group.
Resource Dependencies It is important to ensure that the virtual machine configuration resource is brought online before the virtual machine resource is brought online (started), and that the virtual machine configuration resource is taken offline after the virtual machine resource is taken offline (stopped). Setting the properties of the virtual machine resource so that it is dependent on the virtual machine configuration resource ensures this online/offline order. If there is a storage resource that contains the file for the virtual machine configuration resource or virtual machine resource, then the resource should be made dependent on that storage resource(s). For example, if the virtual machine uses VHD files on disk G: and disk H:, the virtual machine resource should be dependent on the configuration file resource and the resource for disk G: and the resource for disk H:.
Here are the three prerequisites for making Hyper-V virtual machines highly available using the Windows Server 2008 Failover Cluster feature:
- 1.The Windows Server 2008 Failover Cluster feature must be configured for each node of the cluster. For more information on configuring and managing Failover Clusters, see the "Hyper-V Resources" sidebar.
- 2.The Hyper-V role must be installed. Hyper-V updates should be installed and the role configured for each node of the failover cluster (again, see the "Hyper-V Resources" sidebar). Hyper-V has an update package that installs the Hyper-V server components and another one that installs the Hyper-V management console. Once the update is installed for the Hyper-V server components, the role can be added through Server Manager or ServerManagerCMD.
- 3.You need to have shared storage available to the virtual machines. The storage could be managed by the failover cluster as a built-in physical disk resource type, or you could use a third-party solution to manage the shared storage. Of course, the third-party solution must support Windows Server 2008 Failover Clusters.
Now let's see how a high availability solution is set up. The first step is to set up a virtual machine. On one of the nodes of the failover cluster that has the Hyper-V role installed, configure a virtual machine using the Hyper-V Manager (see Figure 2). This could be a new VM that is configured manually, or you could import a pre-existing one. The VHDs should be located on a disk that is managed by the Windows Server 2008 Failover Cluster and that is currently online on the node the VM is being configured on.
Figure 2 Configuring a
virtual machine (Click the image for a larger view)
Now put the virtual machine in the stopped state by shutting down, turning off, or saving state. Only virtual machines in the stopped state can be configured for management by the failover cluster.
Open the Failover Cluster Manager console (shown in Figure 3) on any server that is running the Windows Server 2008 Failover Cluster role, or on a Windows Vista client running the Remote Server Administration Tools (RSAT). Connect to the failover cluster by choosing the Manage a cluster… action and then selecting a node or the cluster name, or by selecting the option to connect to the cluster on the node that the console is running on.
Figure 3 A virtual machine in Failover Cluster Manager (Click the image for a larger view)
From the Failover Cluster Manager console, select the Configure a Service or Application… action. This will open the High Availability Wizard, which takes you through configuring services, applications, or virtual machines to be managed by the failover cluster. In the Select Service or Application page of the wizard, choose Virtual Machine, and then select Next.
The Select Virtual Machine page will display all the virtual machines that are configured on any node of the failover cluster. Select a virtual machine, then select Next. The Confirmation page of the wizard will show any warnings or errors. In this step, the virtual machine configuration is checked to verify that it can be configured as an HA resource and that the nodes should be able to host it. Selecting Next here adds the virtual machine to the failover cluster as a highly available resource.
The Summary page provides information about the results of adding the virtual machine as a highly available resource, including any warnings. A View Report… button can show you details of the tasks that were done to make the virtual machine highly available, as well as any warnings or errors. Finally, select Finish to close the High Availability Wizard.
As the window in Figure 3 shows, the Failover Cluster Manager console will list an object with the default name of Virtual Machine(x) in the left pane under the cluster name and Services or Applications. Choose the virtual machine from that tree structure and the resources that are part of that Services or Applications group will show in the center pane of the console. If any other virtual machine had its files on the same storage as the one that was chosen, it will also be added to the group. Virtual machine and virtual machine configuration resources will be shown for each virtual machine that was put in the group.
The information pane for the Services or Applications group displays the Status, Alerts, Preferred Owners, and Current Owner information for the group. The Owner node is the node where the virtual machine is currently configured or running. Select the Move Virtual Machine(s) to another node action to have the virtual machine taken offline and then brought online on another node. It's generally a best practice to move a VM to each node that may host it in the failover cluster to verify that the move is successful and the VM will start and run.
Here are some key points to keep in mind when you are setting up virtual machines for high availability:
Storage If virtual machines have VHD files on the same shared disk, even if they are located on different volumes on the same disk, they will be placed in the same Service or Application group. One of the advantages of sharing the disk is that it enables better use of available shared storage space.
However, the trade-off is that anytime a virtual machine is moved, whether due to automated recovery from a problem with the virtual machine or because of an administrator's choice, all the virtual machines in the group will be moved.
Driver Letters and GUIDs Volumes can be created without assigned drive letters. Virtual machines can use those volumes and the volumes can be managed by the failover cluster. If a disk resource has volumes that use GUIDs instead of drive letters, the GUID will be shown in Cluster Management. When you are creating virtual machines and specifying the path for the VHDs, it is very important to make sure that the GUID in the path matches the GUID that is showing in Cluster Management for the volume. If it does not match, the virtual machine may not start successfully (online) on other nodes of the failover cluster.
There are several situations that might cause the GUIDs to not match. If a volume was brought online on nodes before they were added as a failover cluster-managed disk resource, the volume can have a different GUID on each node. It's also possible for a volume to have multiple GUIDs on a single node. When a disk is added to a failover cluster as a physical disk resource, the volume GUIDs that are in use on the node that has the disks online will be noted in the properties for the disk resource.
The GUIDs for the volumes will be added to a node when the disk resource is brought online. This ensures that the particular GUID that the failover cluster has noted for the volume is a valid path on any node that brings the disk online. That node may have other GUIDs that are also associated with the same volume. Therefore, a user could find a GUID that is valid for the volume on that node, but it is not the same GUID that the failover cluster is ensuring is used for the volume on other nodes. The symptom of this problem is that a virtual machine resource, usually the Configuration resource, fails to come online and displays an error message indicating that the path is invalid. The path in the error message shows the GUID that is not the cluster-managed GUID for the volume.
Mount Points Volumes that are mounted to a folder in another volume, instead of assigning a drive letter or using a GUID, are valid for use with Hyper-V and failover clusters. Since both the volume that is mounted and the volume that is hosting the mount point must be on the same failover cluster node, you must have all the disks that are part of the mount point in the same Failover Cluster Service or Application group.
If the volumes are on the same disks, obviously this is not an issue. However, it is definitely an issue if the volumes are on different disks. It's equally obvious, but still worth mentioning, that both the volume that is mounted and the host of the mount point must be shared storage that is configured to be managed by the failover cluster.
Differencing Disks All VHD files that are part of differencing disks need to be on shared storage in the same Services or Applications group as the VM using the differencing disks. In its simplest configuration, a differencing disk involves two VHDs. One VHD is the parent, and it has a set of data that is used as a base. The other VHD is a child associated with the parent.
When first used, the differencing disk appears just like the parent. If data is located on the parent, it's read from that VHD. All writes happen on the child VHD. If data is located on the child, then a read for that data would reference the child VHD.
If a VM is somehow configured so that the child VHD is on shared storage but the parent VHD is either not on the shared storage in the same group or on a locally attached storage device, then the VM would fail to start (online) if it is moved to another node. The High Availability Wizard should check to make sure this is configured correctly in the VM and provide an error message if it detects this problem, but it's worth noting the requirement in the case where the VM configuration has changed.
Steven Ekren is a Senior Program Manager with the Windows Server Failover Cluster and High Availability team. Steven spent 12 years with Microsoft Support where he assisted enterprise customers with implementing and troubleshooting Windows Server Failover Clusters and Virtualization technologies including Windows Hyper-V, System Center Virtual Machine Manager, Microsoft Virtual Server, and Microsoft Virtual PC.