Microsoft IT Deploys Windows Server 2008 Failover Clusters for File Services
Technical Case Study
Published: April 10, 2008
Microsoft Information Technology (Microsoft IT) uses failover clustering in the Windows Server® 2008 operating system to support users worldwide. Microsoft IT found the solution easy to plan and deploy, especially because of built-in migration tools. The result is a set of Windows Server 2008 clusters that support more users through increased reliability and features.
Technical Case Study, 1.42 MB, Microsoft Word file
Products & Technologies
Microsoft seeks to provide highly available file services to users worldwide. The solution must keep TCO low while offering high performance and it must be easy to deploy, migrate, and manage.
Microsoft IT chooses Windows Server 2008 failover clusters to provide highly available file services that support global users from several different namespaces and directories. Microsoft IT deployed the solution worldwide by using cluster migration tools in Windows Server 2008, and a small group of administrators manages the solution remotely.
The Microsoft IT division supports the daily IT operations of a large global corporation that has demands similar to those of many other organizations of the same size. These demands include the requirement of network, servers, and applications to be available with a very small amount of time set aside for maintenance. Supporting a global infrastructure means that administrators work within limited maintenance windows, with the infrastructure in use around the clock in many locations worldwide. Failover clustering in Windows Server 2008 provides the ability to meet these demands while improving on previous cluster technologies.
Within Microsoft IT, several groups support the daily operations of Microsoft Corporation worldwide. One key group is the File Services Utility (FSU) group, which manages resources and provides services from four sites worldwide. With Windows Server 2008 clustered file services operating from Redmond, Dublin, Singapore, and Kawaguchi, FSU provides thousands of worldwide users access to approximately 200 terabytes of data. This data includes all types of file share resources, which typically range in size from 20 gigabytes (GB) to 300 GB.
Each FSU site consists of several Windows Server 2008 cluster nodes that work together to provide redundant access to files for each region. The regions are similar in their structure and server composition to provide a straightforward method for management of cluster nodes. One of the sites has a single pair of servers in an active/passive configuration, as shown in Figure 1. In this configuration, each server is online, but only one server actively supports requests. If a hardware or software failure occurs on the active node, the passive node takes control and resumes file services. This method provides for highly available file services.
Figure 1. Active/passive cluster configuration
The other FSU sites use two active nodes that share a single passive node, also known as an active/active/passive configuration, as shown in Figure 2. This design provides more active resources to users, while a single failover node supports the active nodes. This configuration helps to reduce the cost of server deployments, but it also increases complexity through increased administrative overhead and design requirements.
Figure 2. Active/active/passive cluster configuration
Microsoft IT chose to use Windows Server 2008 failover clusters for global file services for a number of reasons. As with most IT departments, the major considerations for Microsoft IT are availability, flexibility, and performance; cost through procurement, migration, and maintenance are also important. The Windows Server 2008 clusters offer file services that provide a mission-critical service. These file services are crucial to the jobs of thousands of users worldwide, necessitating the requirement for availability—the file services must be available around the clock, every day of the year. Migration is a major concern for many IT groups, with a focus on minimizing impact to users and reducing risk when moving the users from one cluster technology to another. Maintenance is also a key issue; ensuring that the system is designed to be easily maintained helps to decrease maintenance efforts, and therefore the total cost of ownership (TCO) for the file services.
Windows Server 2008 offers the ability to meet all of these challenges and requirements. Some of the key benefits of Windows Server 2008 failover clusters are:
- Volumes under Windows Server 2008 clusters never stay in an unprotected state. Clusters use SCSI-3 persistent reservations.
- The Cluster service no longer requires a domain user account to run password updates and associated account maintenance.
- The node names, cluster name, and network names are Active Directory® Objects.
- Kerberos-only authentication is the default authentication mechanism.
- Full support is included for Volume Shadow Copy Service (VSS) for easier backups.
- Windows Server 2008 clusters use SCSI-3 persistent reservations but respect cluster disk signatures that Windows Server 2003 clusters use, enabling easier and safer migrations from previous cluster releases.
The migration to Windows Server 2008 clusters involved managing key areas that are essential to maintaining service. These key areas help to ensure that the process, software, server hardware, and storage hardware are migrated safely. The key areas are:
- Protection of disk resources during migration
- Availability of service during migration
- Clearly defined and reliable rollback plan by keeping one or more Windows Server 2003 cluster nodes in reserve
- Clearly communicated and managed migration process
- Scheduled migrations during hours of least impact to users
The deployment of Windows Server 2008 clusters for file services gives users distributed access to multiple file shares through clusters with a Distributed File System (DFS) namespace. The deployment of clusters in Redmond, Dublin, Singapore, and Kawaguchi provides services globally, around the clock. Having global, centralized file services requires an approach that is demanding for normal computing environments and places emphasis on storage, networking, server, and application configurations. Microsoft IT upgraded the FSU sites from Windows Server 2003 clusters one at a time, and it easily applied lessons learned from the early deployments to the successive migrations. Microsoft IT also reinforced existing best practices during the process.
Multiple vendors support all FSU clusters through Fibre Channel connectivity to both traditional Fibre Channel storage and storage area network (SAN) storage. Each site includes several storage arrays that the FSU team presents through the cluster nodes through Fibre Channel switches from multiple manufacturers. It is notable that Microsoft IT did not require any new storage capacity or storage switch technology for the upgrade and deployment to the Windows Server 2008 clusters. Each server takes advantage of Multipath Input/Output (MPIO) for multipath access to the storage arrays. Although Windows Server 2008 fully supports 4-GB Fibre Channel cards, this deployment primarily used 2-GB cards because all servers were re-used from the previous Windows Server 2003 clusters.
The Windows Server 2008 clusters use built-in Gigabit Ethernet ports for user access and private network connections for cluster communications. The result is a cluster platform that, in Microsoft IT's experience, was simple to manage and provision and did not require complex networking configuration. It should be noted, however, that as part of the migration to Windows Server 2008 clusters, Microsoft IT used a phased in-place upgrade process that re-used existing physical network connections in addition to existing IP and namespace resources.
Windows Server 2008 clustering allows file services to be provided to disjoined network namespaces, which means that DFS replicas can be serviced on hosts that are in a different domain. This allows the file service clusters to provide root-level shares to DFS, which can then provide file-level and folder-level access to users in different domains. This is a key advantage for large, global deployments that span multiple Active Directory domains, and it enables users to access the resources that are based in one domain while administrators perform server management from a different domain. This key benefit is valuable for Windows Server 2008 clusters that the FSU group deploys.
Reasons for Choosing Windows Server 2008
The migration to Windows Server 2008 did not require the purchase of any new server, network, or Fibre Channel infrastructure, because of the increased performance and extensive driver support in Windows Server 2008. Microsoft IT migrated clustered file servers to Windows Server 2008 one FSU site at a time around the world. An in-place upgrade helped to speed the migration process by removing the need to procure and deliver new hardware worldwide.
Windows Server 2008 clustering improves on the high-availability capabilities of Windows Server 2003 clustering by adding support for additional features such as dynamic hardware partitioning for certain Windows Server 2008 approved hardware platforms. These features also help to make use of advanced hardware for mission-critical applications.
The FSU group typically does not handle server applications with extremely high disk demands, but users worldwide require performance in an equally demanding way. Users of the file share services expect the same high-performance access to their files and documents whenever they need them. Although this performance is not typically measured in disk I/O metrics, users and enterprise applications are equally demanding of file resources. The Windows Server 2008 clusters enable DFS to serve multiple namespaces and domains, enabling Microsoft IT to provide highly available file services regardless of internal namespace.
Servers have increased in power and reliability in recent years, providing more features for redundancy, such as multiple power supplies and large arrays of redundant disks. Although these new features have reduced the amount of effort that administrators require for certain kinds of maintenance, planned outages are still necessary for software and firmware updates and routine hardware replacement. This deployment focused on the re-use of existing server hardware; however, Windows Server 2008 provides support for dynamic hardware partitioning, which enables the replacement of disk, processor, and random access memory (RAM) without the need to take the server offline. Windows Server 2008 also places an emphasis on reducing the installation footprint for the operating system, installing only the components required for its role. The result is increased performance and reduced maintenance costs for servers on the most basic hardware platforms.
The FSU group uses failover clusters to reduce the downtime that administrators require for server maintenance. In the failover cluster model, one node provides clustered services such as file, print, messaging, or database services, while at least one additional node waits in a passive state to take over if a failure occurs. The process has been highly effective against server failures with past releases of Windows Server, and Windows Server 2008 continues to use these proven methods while providing new features and optimizations. One key benefit to failover clustering is that administrators can take the passive node offline momentarily for updates without affecting users. In this approach, administrators take the passive node offline and update it with new drivers or application updates. That server is not currently serving users, and it can be safely restarted.
Figure 3 illustrates the update process for the active/passive configuration.
Figure 3. Update process for active/passive configuration
After the passive node returns from being restarted, administrators can check it for errors and return it to service in the cluster group. During a maintenance window, the next step is to move the cluster resources from the current active node to the current passive node. In this step, the roles are reversed, with the typical standby node now hosting active connections to cluster resources. The downtime for file share resources is measured in seconds, and many users are not even aware that the change has occurred.
After administrators complete this process, they can update the remaining server without affecting users, because the server is now the passive node. After administrators complete all updates, they can reverse the server process to move cluster resources back to the assigned node. Although this step is not required, it is considered a best practice to help clearly define the active and passive nodes during normal production hours. The move of cluster resources back to the original active node still takes only a matter of seconds.
The end result is that downtime for the users is measured in only the time required to move cluster resources from one node to the other, which enables administrators to take in-depth and highly detailed efforts for server upgrades and updates without having to affect the users or uptime metrics. This approach also provides a key opportunity for performing in-place upgrades—Microsoft IT this phased in-place update process for the deployment of the Windows Server 2008 failover clusters.
Windows Server 2008 cluster nodes cannot coexist in the same cluster with Windows Server 2003 nodes, so an organization must use a migration approach that focuses on moving cluster resources rather than a prolonged coexistence. Windows Server 2008 clusters support 16 nodes, an increase beyond the previous maximum of eight nodes under Windows Server 2003. The clusters that Microsoft IT deployed did not require additional nodes. Furthermore, an in-place upgrade approach proved to be highly effective and efficient, and it provided a safe and reliable path for rollback to Windows Server 2003 clusters in the event of a problem.
Microsoft IT's general approach was to first move all cluster resources to a single node. Microsoft IT completed this move in a matter of seconds, and users did not notice the change. For example, on a Windows Server 2003 cluster with two active nodes and a single passive node, Microsoft IT moved all resources from one active node to the other. This momentarily created an active/passive/passive configuration. After Microsoft successfully moved all resources to a single active node, it shut down the two passive nodes. To reduce the risk of any changes to shared Fibre Channel storage, Microsoft IT masked the Fibre Channel storage on the two shut-down nodes, preventing them from accessing any of the cluster's shared storage. Then, the team installed the two shut-down nodes with Windows Server 2008 Enterprise and configured them as a new active/passive cluster.
As mentioned earlier, Windows Server 2008 and Windows Server 2003 clusters cannot participate in the same cluster. This means that after Microsoft IT built the Windows Server 2008 cluster, it unmasked Fibre Channel storage to the new nodes and ran the Cluster Validation tests. After Microsoft IT confirmed that the new nodes could access all storage and ran the full Cluster Validation tests successfully, it migrated the Windows Server 2003 cluster resources and applications through the wizard on the Windows Server 2008 nodes.
After Microsoft IT migrated all resources to the new Windows Server 2008 active/passive cluster, it shut down and migrated the remaining Windows Server 2003 cluster just as it did for the other two nodes, or it retained the nodes until administrators were comfortable that rollback was unnecessary. It is important to note that if Microsoft IT had encountered an error at any time during the migration, it could have easily reverted resources to the Windows Server 2003 cluster. The clear ability to roll back to the previous environment enables a confident approach to migrating critical company file shares.
Figure 4 illustrates the failover process for the active/passive configuration.
Figure 4. Failover process for active/passive configuration
There are two approaches to migrating clusters from previous releases to Windows Server 2008. Previously, the only viable approach for the demanding environments that clusters serve was to build a complete new set of clustered servers and to provision new storage. This approach required the procurement of additional hardware, even if the current hardware was satisfactory for use. In addition, limited options existed for migrating resources, so administrators had to manually re-create each cluster resource on the new servers. Although this approach is still valid, and many customers continue to use this approach as an opportunity to consolidate or refine what services are offered on Windows Server clusters, Microsoft IT was able to use an in-place upgrade that did not require the purchase of new hardware or storage. The migration also enabled complete regression back to Windows Server 2003 clusters, although Microsoft IT never needed this regression.
As mentioned earlier, Windows Server 2008 clusters have the capability to provide services to multiple namespaces at the same time, which further extends the business value to multiple business groups. In addition, DFS namespaces can be exported, deleted, and re-imported, when all domain controllers are running Windows Server 2008 and Active Directory is running in Windows Server 2008 functional mode. This capability provides for future scalability and supports Access-Based Enumeration (ABE), which enables users to view only shares that they have access to.
Management of the Microsoft cluster resources worldwide is very efficient; only three operations staff manage all of the clusters and the provisioning process for file share requests from users and groups. These three administrators not only handle daily permissions and new provisioning requests, but also handle all server maintenance issues. Through the use of Windows Server 2008 clusters, they can perform updates with little to no impact to users, by means of proven cluster update methodologies.
The relatively small management team manages servers in a highly distributed environment, with servers and storage in the United States, Ireland, Singapore, and Japan. Even in a large infrastructure, remote management of resources requires the server applications to be robust and flexible. The management console for Windows Server 2008 clusters is an excellent fit for these requirements; it enables a small team of administrators to remotely manage many servers worldwide.
The management interface for Windows Server 2008 clusters employs many additional features, from new cluster capabilities to using the Microsoft Management Console (MMC) version 3.0 interface. This integrated interface enables not only management of cluster groups and resources, but also backups by means of VSS and management of DFS. This capability enables administrators to manage a cluster and related resources in a single console rather than opening multiple consoles for different components, as was previously needed. In addition, the administrators can easily access cluster dependency reports and cluster events from the MMC. This greatly improved interface offers a more robust and detailed view for daily cluster administration.
Through the deployment of Windows Server 2008 failover clusters for the Microsoft data centers worldwide, Microsoft IT learned several key lessons.
Windows Server 2008 supports a configuration with multiple active nodes and a single passive node. However, the Microsoft IT discovered that migrations required far less planning and effort overall when single active/passive pairs—consisting of a single active node and single passive node—are deployed. The later cluster deployments reflect this finding.
Although use of this configuration is a key benefit in terms of simplicity and support, there were no drawbacks to running previous configurations, and Microsoft IT left these configurations in place. In addition, the use of active/passive pairs provides for scale by using a common, repeatable introduction of the standard configuration to grow as needed.
The migration wizard for Windows Server 2008 clustering proved to be essential to the smooth migration from Windows Server 2003 clusters to Windows Server 2008. This wizard was the only migration tool that Microsoft IT used for all FSU cluster servers for the migration worldwide, and it proved to be very effective.
One limitation of the cluster migration wizard is that it is unable to migrate resources that are in the core cluster group (Cluster Name, Cluster IP, and Quorum disk) or that have dependencies on the cluster resource name. Because deploying cluster applications that are configured with dependencies on cluster group resources is not a best practice, the migration team accepted this limitation.
Before an organization uses the migration wizard, it should resolve dependencies on cluster resources so that the application resources do not rely on core cluster group resources such as cluster network name. In addition, the core cluster group should contain only resources that the cluster nodes themselves use, and any other resources that might have been placed in the cluster resource group must be reconfigured prior to migration.
The migration technique that the migration team used for the migration from Windows Server 2003 to Windows Server 2008 proved to be valuable for the team. This technique enabled the in-place upgrade of Windows Server 2003 cluster nodes to Windows Server 2008.
For this process, the migration team moved all resources in the cluster group to a single server. Depending on the environment, this relocation left one or more servers in a passive state. The team then shut down these passive nodes and masked the shared storage from the node's Fibre Channel ports. The team then installed the servers with Windows Server 2008 and configured them as a new cluster group. After the team completed the server builds, it re-presented (unmasked) the storage to the new cluster, and it used the migration wizard. This approach enabled a full rollback to Windows Server 2003 if needed, because the team took the last Windows Server 2003 node offline only after it successfully migrated all cluster resources to the Windows Server 2008 cluster.
Windows Server 2003 uses different methods for labeling shared storage that clustered resources use. Windows Server 2003 uses cluster disk signatures, whereas Windows Server 2008 uses SCSI-3 persistent reservations. This approach means that Windows Server 2008 clusters do not interfere with Windows Server 2003 signatures when they are imported to a Windows Server 2008 cluster. Therefore, the disk signatures are at very little risk of change if a need arises to revert to a Windows Server 2003 cluster during a migration.
Windows Server 2008 clustering offers many of the same services as previous releases, but the key differentiator for Microsoft IT was the ability to provide services to multiple namespaces. This feature has enabled Microsoft IT to extend services and resources to groups that were previously unable to use the clusters under Windows Server 2003. As a result, the Windows Server 2008 clusters can serve more users and lower the TCO for the organization. Using centralized, multi-domain shares along with Windows Server 2008 clusters also helps to further centralize server management, reducing the number of resources that are required for monitoring and daily administration.
Windows Server 2008 clusters also provide numerous additional features that extend beyond the file services that FSU provides, so that other groups can use the same benefits.
An organization can use several best practices for deploying Windows Server 2008 clusters. The following practices can help to ease migration efforts and also to maintain a fully supported Windows Server 2008 cluster:
- Storage Masking
As part of the migration from Windows Server 2003 clusters to Windows Server 2008 clusters, an organization should mask the shared storage from the Windows Server 2008-based servers during setup and initial configuration. This practice prevents an administrator from accidentally installing the operating system on a cluster disk resource, or even accessing it and creating a disk lock. Unmasking can easily occur on the storage array or with zoning at the Fibre Channel switch. An administrator must unmask storage before it runs the Windows Server 2008 cluster validation wizard and before it uses the Windows Server 2008 cluster migration wizard to migrate resources from the Windows Server 2003 cluster node.
- Resource Dependencies
A key component to a successful cluster is clearly mapped and properly configured resource dependencies on Windows Server 2003 prior to migration. When correctly defined, dependencies provide the cluster with key information about the order in which resources should be brought online and on what resource failure requires attention or failover.
- Phased Migrations
The approach of phased migrations incorporates incremental upgrades or additions to the environment, rather than moving all resources at once. The migration team at Microsoft benefited from phasing each site's file server cluster to Windows Server 2008 one data center at a time. Through this approach, the migration team learned valuable lessons about migration techniques and applied them to the next site, thereby providing a more gradual and informed migration approach.
Microsoft IT's FSU group migrated to Windows Server 2008 clusters, meeting the group's key business requirements. The group met the requirement of increased availability through a more robust cluster solution that benefits from new features and optimizations. The business also places a high value on performance, which Windows Server 2008 offers through support for larger file systems, larger amounts of physical memory, more cluster nodes, and improved network performance via Server Message Block (SMB) version 2.0. Finally, the ability of Windows Server 2008 to support multiple namespaces addresses the issue of TCO by extending cluster services to a larger group of users and customers than was possible with the previous solution.
The result is 20 servers worldwide that provide almost 200 terabytes of data to thousands of users while maintaining a highly efficient and virtually transparent upgrade path from previous releases.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, visit any of the following sites:
© 2008 Microsoft Corporation. All rights reserved.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Active Directory, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.