Failover Cluster Step-by-Step Guide: Validating Hardware for a Failover Cluster

Applies To: Windows Server 2008, Windows Server 2008 R2

In Windows Server® 2008 and Windows Server 2008 R2, the way that clusters are qualified is changing significantly with the introduction of the cluster validation wizard. The cluster validation wizard is a feature that is integrated into failover clustering in Windows Server 2008 and Windows Server 2008 R2. With the cluster validation wizard, you can run a set of focused tests on a collection of servers that you intend to use as nodes in a cluster. This cluster validation process tests the underlying hardware and software directly, and individually, to obtain an accurate assessment of how well failover clustering can be supported on a given configuration.

Note

A cluster validation report is required by Microsoft® Customer Support Services (CSS) as a condition of Microsoft supporting a given configuration.

In this guide

  • What defines a supported failover cluster configuration in Windows Server 2008?

  • Previous methods for qualifying clusters

  • What is cluster validation?

  • How to run the cluster validation wizard for a failover cluster

  • Understanding validation results

  • Specific validation scenarios

  • Understanding the validation tests required for your scenario

  • Future of cluster validation

  • Frequently asked questions

What defines a supported failover cluster configuration in Windows Server 2008?

For a failover cluster in Windows Server 2008 or Windows Server 2008 R2 to be considered an officially supported solution by Microsoft Customer Support Services (CSS), the solution must meet the following criteria.

  • All hardware and software components must meet the qualifications for the appropriate logo. For Windows Server 2008, this is the “Certified for Windows Server 2008” logo. For Windows Server 2008 R2, this is the “Certified for Windows Server 2008 R2” logo. For more information about the Logo Programs, see the Microsoft Web site at:

    https://go.microsoft.com/fwlink/?LinkId=111561

  • The fully configured solution (servers, network, and storage) must pass all tests in the validation wizard, which is part of the failover cluster snap-in.

The Microsoft support policy is also described in Knowledge Base article 943984, "The Microsoft Support Policy for Windows Server 2008 Failover Clusters" on the Microsoft Web site at:

https://go.microsoft.com/fwlink/?LinkId=111552

Note

For additional information about how this applies to multi-site or geographically dispersed clusters, see Multi-site or geographically dispersed clusters, later in this section.

Previous methods for qualifying clusters

Before the introduction of cluster validation in Windows Server 2008, the process for defining whether Microsoft supported a particular cluster solution was handled in a completely different manner. For a proposed solution to qualify, it had to appear in its entirety on the Windows Server Catalog Web site (https://go.microsoft.com/fwlink/?LinkId=111553). Before a proposed solution could appear on this Web site, the vendor was required to run a series of tests (provided by Microsoft) on the hardware and then upload the results to Microsoft. After the vendor verified the solution, it would be listed on the Windows Server Catalog Web site. This offered a method for customers to find out whether a proposed cluster solution was verified to work with a particular operating system.

For more information about this method for qualifying clusters, see Knowledge Base article 309395, "The Microsoft support policy for server clusters, the Hardware Compatibility List, and the Windows Server Catalog" on the Microsoft Web site at:

https://go.microsoft.com/fwlink/?LinkId=112287

This method for qualifying clusters introduced several challenges for hardware partners. Since the entire end-to-end solution had to be qualified, including every component down to the driver and firmware level, if just one component changed after the certification process, the entire solution had to be retested and submitted, because it was then considered to be a different solution.

One common scenario was for a customer to check the Windows Server Catalog Web site, purchase a listed solution and then after a few months, find that the manufacturer had released a host bus adapter (HBA) driver that was recommended for installation. This would create a quandary because installing the driver would mean that the cluster solution was technically unsupported by Microsoft (because it was not listed in its entirety on the Windows Server Catalog Web site). However, not installing the updated driver meant that issues that the latest driver corrected were not addressed on this system, which could lead to other problems later.

Overall this was situation was not ideal. Something had to change and make this process easier for everyone involved: Microsoft, our customers, and the hardware vendors. That was a primary reason why Microsoft introduced cluster validation in Windows Server 2008.

What is cluster validation?

With the cluster validation wizard, you can run a set of focused tests on a collection of servers that are planned for use as cluster nodes. The cluster validation process tests the underlying hardware and software directly, and individually, to obtain an accurate assessment of how well failover clustering can be supported on a given configuration.

Important

Before you create a failover cluster, we strongly recommend that you run all tests in the cluster validation wizard.

Cluster validation is intended to catch hardware or configuration problems before the cluster goes into production. Cluster validation helps to ensure that the solution you are about to deploy is truly dependable. Cluster validation can also be performed on configured failover clusters as a diagnostic tool.

Considerations for performing cluster validation on an existing cluster

When you perform cluster validation on an already configured cluster, you might not always run all tests. If you include storage tests in the set of tests you run, there are different considerations to keep in mind than if you do not include storage tests. This section outlines the main considerations:

  • Considerations when including storage tests: When cluster validation is performed on an already configured cluster, if the default tests (which include storage tests) are selected, only disk resources that are in an Offline state or are not assigned to a clustered service or application will be used for testing the storage. This builds in a safety mechanism, and the cluster validation wizard warns you when storage tests have been selected but will not run on storage in an Online state, that is, storage used by clustered services or applications. This is by design to avoid disruption to highly available services or applications that depend upon these disk resources being online.

    One scenario where Microsoft CSS may request you to run validation tests on production clusters is when there is a cluster storage failure that could be caused by some underlying storage configuration change or failure. By default, the wizard warns you if storage tests have been selected but will not be run on storage that is online, that is, storage used by clustered services or applications. In this situation, you can run validation tests (including storage tests) by creating or choosing a new logical unit number (LUN) from the same shared storage device and presenting it to all nodes. By testing this LUN, you can avoid disruption to clustered services and applications already online within the cluster and still test the underlying storage subsystem.

    If a failover cluster passed the full set of validation tests and has no future hardware or software changes, then it will continue to be a supported configuration. However, when you perform routine updates to software components such as drivers and firmware, it may be necessary to re-run the validation wizard to ensure that the current configuration of the failover cluster is supported. The following guidelines can help in this process:

    • All components of the storage stack should be identical across all nodes in the cluster. It is required that multipath I/O (MPIO) software and Device Specific Module (DSM) software components be identical. It is recommended that the mass-storage device controllers—that is, the host bus adapter (HBA), HBA drivers, and HBA firmware—that are attached to cluster storage be identical. If you use dissimilar HBAs, you should verify with the storage vendor that you are following their supported or recommended configurations.

    • To minimize impact to highly available applications and services, a best practice is to keep a small LUN available to allow the validation wizard to run tests on available storage without negatively impacting clustered services and applications. This way, if Microsoft CSS requests you to run a full set of cluster validation tests, the wizard will follow the default behavior and run tests on the available storage (the new LUN only).

  • Considerations when not including storage tests: System configuration tests, inventory tests, and network tests have very low overhead, and can be performed without significant effect on servers in a cluster.

    Microsoft CSS may request you to run the cluster validation on a production cluster as part of normal troubleshooting procedures (not focused on storage). In this scenario, you will use the wizard to inventory hardware and software, perform network testing, and validate system configuration. There may be certain scenarios in which only a subset of the full tests are needed. For example, if troubleshooting a problem with networking on a production cluster, Microsoft CSS may request that you run only the hardware and software inventory and the network tests.

How to provide a validation report when obtaining support from Microsoft

Microsoft will help you collect the validation report through the Microsoft Support Diagnostic Tool (MSDT), which is the replacement for the MPSReports data collection utility. Microsoft CSS will send the MSDT via e-mail with instructions on how to capture the data. In some situations, Microsoft CSS may request that the contents of the C:\Windows\Cluster\Reports folder be zipped and sent in for analysis. Either method will collect the required cluster validation report.

How to run the cluster validation wizard for a failover cluster

To validate a new or existing failover cluster

  1. Identify the server or servers that you want to test and confirm that the failover cluster feature is installed:

    • If the cluster does not yet exist, choose the servers that you want to include in the cluster, and make sure you have installed the failover cluster feature on those servers. To install the feature, on a server running Windows Server 2008 or Windows Server 2008 R2, click Start, click Administrative Tools, click Server Manager, and under Features Summary, click Add Features. Use the Add Features wizard to add the Failover Clustering feature.

    • If the cluster already exists, make sure that you know the name of the cluster or a node in the cluster.

  2. Review network or storage hardware that you want to validate, to confirm that it is connected to the servers. For more information, see https://go.microsoft.com/fwlink/?LinkId=111555.

  3. Decide whether you want to run all or only some of the available validation tests. For detailed information about the tests, see the topics listed in https://go.microsoft.com/fwlink/?LinkId=111554.

    The following guidelines can help you decide whether to run all tests:

    • For a planned cluster with all hardware connected: Run all tests.

    • For a planned cluster with parts of the hardware connected: Run System Configuration tests, Inventory tests, and tests that apply to the hardware that is connected (that is, Network tests if the network is connected or Storage tests if the storage is connected).

    • For a cluster to which you plan to add a server: Run all tests. Before you run them, be sure to connect the networks and storage for all servers that you plan to have in the cluster.

    • For troubleshooting an existing cluster: If you are troubleshooting an existing cluster, you might run all tests, although you could run only the tests that relate to the apparent issue.

Important

If a clustered service or application is using a disk when you start the wizard, the wizard will prompt you about whether to take that clustered service or application offline for the purposes of testing. If you choose to take a clustered service or application offline, it will remain offline until the tests finish.

  1. In the failover cluster snap-in, in the console tree, make sure Failover Cluster Management is selected and then, under Management, click Validate a Configuration.

  2. Follow the instructions in the wizard to specify the servers and the tests, and run the tests.

    Note that when you run the cluster validation wizard on unclustered servers, you must enter the names of all the servers you want to test, not just one.

    The Summary page appears after the tests run.

  3. While still on the Summary page, click View Report to view the test results.

    To view the results of the tests after you close the wizard, see SystemRoot**\Cluster\Reports\Validation Report** date and time**.html** where SystemRoot is the folder in which the operating system is installed (for example, C:\Windows).

  4. To view Help topics that will help you interpret the results, click More about cluster validation tests.

    To view Help topics about cluster validation after you close the wizard, in the failover cluster snap-in, click Help, click Help Topics, click the Contents tab, expand the contents for the failover cluster Help, and click Validating a Failover Cluster Configuration.

Understanding validation results

After the validation wizard has completed, the Summary Report will display the results. All tests must pass with either a green check mark or in some cases a yellow triangle (warning). The following table shows the symbols in the summary and tells what they mean:

Symbol Meaning

The corresponding validation test passed, indicating that this aspect of the cluster can be supported.

The corresponding validation test produced a warning, indicating that this aspect of the cluster can be supported, but it might not meet the recommended best practices and should be reviewed. Microsoft CSS might ask you to investigate or address the problem if it appears to be directly linked to the issue that you are troubleshooting.

The corresponding validation test failed, and this aspect of the cluster is not supported. You must correct the problem before you can create a failover cluster that is supported.

The corresponding validation test was canceled. This can occur when the test depended on another test that did not complete successfully.

When looking for problem areas (red Xs or yellow question marks), in the part of the report that summarizes the test results, click an individual test to review the details. Also review the summary statement for information about whether or not the cluster is a supported configuration.

After you take action to correct the problem, you can rerun the wizard as needed to confirm that the configuration passes the tests.

What to do if validation tests fail

In most cases, if any tests in the cluster validation wizard fail, then Microsoft does not consider the solution to be supported. There are exceptions to this rule, such as the case with multi-site (geographically dispersed) clusters where there is no shared storage. In this scenario the expected result of the validation wizard is that the storage tests will fail. This is still a supported solution if the remainder of the tests complete successfully.

The type of test that fails is a guideline to the corrective action to take. For example, if the storage test "List all disks" fails, and subsequent storage tests do not run (because these would also fail), contact the storage vendor to troubleshoot. Similarly, if a network test related to IP addresses fails, consult with your network infrastructure team. Not all warnings or errors indicate a need to call Microsoft CSS. Most of the warnings or errors should result in working with internal teams or with a specific hardware vendor.

For information about correcting failures listed in a validation report, see the previous section, Understanding validation results.

After the issues have been addressed and resolved, it is necessary to re-run the cluster validation wizard. It is required in order to be a supported configuration that all tests are run and completed successfully without failures.

Multi-site or geographically dispersed clusters

Failover cluster solutions that do not have a common shared disk and instead leverage data replication between nodes might not pass the cluster validation "storage" tests. This is a common configuration in cluster solutions where nodes are stretched across geographic regions. If a cluster solution does not require external storage to fail over from one node to another, it does not need to pass the "storage" tests to be a fully supported solution.

For more information on multi-site or geographically dispersed clusters, see the following whitepaper (https://go.microsoft.com/fwlink/?LinkId=112125).

Logos for Windows Server 2008 and Windows Server 2008 R2

Designed for line-of-business and mission-critical applications, the "Certified for Windows Server 2008" and "Certified for Windows Server 2008 R2" logos indicate that the application or hardware has been independently tested to meet the highest bar for stability, security, reliability, availability, Windows operating system fundamentals, and platform compatibility.

Hardware components that can run Windows Server 2008, Windows Server 2008 R2, or both, are eligible to receive the corresponding logo or logos. A logo covers each of the individual server hardware components such as the host bus adapter (HBA) or network adapter, and each associated driver or firmware revision is eligible for the appropriate logo. Components such as routers, hubs, or switches are not eligible to receive a logo.

Specific validation scenarios

The following lists describe scenarios in which validation is needed or useful.

  • Validation before the cluster is configured

    • A set of servers ready to become a failover cluster: This is the most straightforward validation scenario. The hardware components (systems, networks, and storage) are connected, but the systems are not functioning as a cluster. Running tests in this situation has no impact on availability.

    • Cloned or imaged systems: With systems that you have cloned or imaged to different hardware, you must run the cluster validation wizard as you would with any other new cluster. We recommend that you run the wizard just after you connect the hardware components and install the failover cluster feature, before the cluster begins being used by clients.

    • Virtualized servers: With virtualized servers in a cluster, run the cluster validation wizard as you would with any other new cluster. The requirement for running the wizard is the same regardless of whether you have a "host cluster" (where failover will occur between two physical computers), a "guest cluster" (where failover will occur between guest operating systems all on the same physical computer), or some other configuration that includes one or more virtualized servers.

  • Validation when the cluster has only one node: You might want to run a limited number of validation tests on a single server that you intend to use in a cluster. Some tests cannot be run in this situation: tests that confirm that the software and software updates match between servers, and storage tests that simulate failover between nodes. When you bring one or more servers into the configuration, you must run the cluster validation wizard again so that all tests can complete. In other words, you must have at least two nodes in a cluster before you can complete the cluster validation process.

  • Validation after the cluster is configured and in use

    • For confirmation that the cluster is supported, or to rule out configuration problems: If you need support and it is necessary to rule out configuration problems with hardware, drivers, and basic system configuration, Microsoft CSS might require you to provide the report from the cluster validation wizard. If you have not already run the wizard and saved the report, you might need to take the cluster offline to run the wizard. The report shows whether your configuration is supported and can help with troubleshooting the issues on the cluster.

    • Before adding a node: When you add a server to a cluster, we strongly recommend that you start by connecting the server to the cluster networks and storage and then run the cluster validation wizard, specifying both the existing cluster nodes and the new node.

    • When attaching new storage: When you attach new storage to the cluster (different from exposing a new LUN in existing storage), you must run the cluster validation wizard to confirm that the new storage will function correctly. To minimize the impacts to availability, we recommend that you run the wizard after attaching the storage but before beginning to use any of the new LUNs in clustered services or applications.

    • When making changes that affect firmware or drivers: If you want to upgrade or make other changes to the cluster that would require changing the firmware or drivers, you must run the cluster validation wizard to confirm that the new combination of hardware, firmware, drivers, and software supports failover cluster functionality. If the change affects firmware or drivers for the storage, we recommend that you keep a small LUN available (unused by clustered services and applications) so that you can run the storage validation tests without taking your services and applications offline. For more information about running storage tests on a small unused LUN, see Considerations for performing cluster validation on an existing cluster.

    • After restoring a system from backup: After you restore a system from backup, run the cluster validation wizard to confirm that the system can function correctly as part of a cluster. The system is not considered a supported system until the validation tests are run.

Understanding the validation tests required for your scenario

You do not always need to run all tests in the cluster validation wizard when making a change to your cluster. This section lists the kinds of changes you might make to a cluster and the corresponding tests to run.

Important

To begin the process of adding hardware (such as an additional server) to a failover cluster, connect the hardware to the failover cluster. Then run the cluster validation wizard, specifying all servers that you want to include in the cluster. The wizard tests cluster connectivity and failover, not just isolated components (such as individual servers).

Categories of validation tests

  • Full: The complete set of tests. This requires some cluster downtime.

  • Single LUN: The complete set of tests, where you run the storage tests on only one LUN. The LUN might be a small LUN that you set aside for testing purposes, or the witness disk (if your cluster uses a witness disk). This validates the storage subsystem, but not specifically each individual LUN or disk. You can run these validation tests without causing downtime to your clustered services or applications.

  • Omit storage tests: The system configuration, inventory, and network tests, but not the storage tests. You can run these validation tests without causing downtime to your clustered services or applications.

  • None: No validation tests are needed.

Server Changes

Change Validation tests required

Physically replacing or changing a server used in the cluster

Full

Adding or removing CPU processors

None

Adding or removing RAM memory on a server

None

Adding, removing, or replacing a network adapter

Omit storage tests

Updating firmware or an existing network driver

Omit storage tests

Changing the BIOS settings or firmware version

None

Adding or changing peripheral devices other than networking or storage components, such as CD-ROM / DVD drives, tape drives, video cards, sound, and USB devices

None

Operating System Changes

Change Validation tests required

Changing from 32-bit to 64-bit operating system on a computer that can run either

Full

Applying operating system service pack, software updates, or hotfixes that affect the storage stack

Single LUN

Applying software updates or hotfixes that do not affect the storage stack

Omit storage tests

Installing an application that has no kernel mode or filter drivers

None

Changing or adding new kernel mode drivers

Single LUN

Cluster Configuration Changes

Change Validation tests required

Adding a new node to the cluster

Full

Adding a new node that uses dissimilar hardware

Full

Removing a node from the cluster

None

Changing the quorum configuration

None

Shared Storage Changes

Change Validation tests required

Changing or adding a storage array

Full

Adding another SCSI hardware RAID unit of the same type, as long as that unit uses an HBA that is already in the configuration

Single LUN

Making a minor (0.x) revision to the storage firmware

Single LUN

Making a major (X.0) revision to the storage firmware

Single LUN

Presenting a new disk or LUN to a cluster

Full, but test new LUNs only

SAN (Switch / Hub) Changes

Change Validation tests required

Adding or replacing a Fibre Channel switch or hub

Full

Changing the number of ports within a switch block

None

Making a minor (0.x) revision to the Fibre Channel switch firmware

Single LUN

Making a major (X.0) revision to the Fibre Channel switch firmware

Single LUN

Changing a switch configuration or zoning

Full, but test changed LUNs only

Host Bus Adapter (HBA) Changes

Change Validation tests required

Replacing an HBA (same or different type)

Full

Adding a new HBA (same or different type)

Single LUN

Changing the HBA firmware or BIOS

Single LUN

Changing the HBA driver version

Single LUN

Multi-path Software Changes

Change Validation tests required

Changing from single path to multi-path or multi-path to single path

Full

Adding a path

Single LUN

Removing a path

Single LUN

Updating the device specific module (DSM) version

Single LUN

Changing to a DSM of a different type, for example, a DSM from a different provider

Single LUN

Multi-site Cluster Changes

Change Validation tests required

Modifying the networks that connect the nodes

Omit storage tests

Making a minor (0.x) version change in the data replication software

Single LUN

Making a major (X.0) version change in the data replication software, or changing to a different type of replication software

Full

Networking Changes

Change Validation tests required

Modifying network firmware, software, and/or hardware

Omit storage tests

Future of cluster validation

The cluster validation wizard provides an accurate picture of how well failover clustering can be supported on a given configuration. There may be future updates to the cluster validation wizard that provide additional functionality or correct issues that may be discovered. In that case, you might need to re-run the updated cluster validation wizard and pass all tests in order for your solution to continue to be supported. This may result in some solutions that previously passed to fail. The issues reported in the updated tests will need to be addressed in the same manner in which issues identified today with the currently available version of the cluster validation wizard are addressed. For more information, see Understanding validation results and What to do if validation tests fail, earlier in this document.

Frequently asked questions

  • Q.  If a cluster passes all tests in the cluster validation wizard, is it supported?

    A.  If all hardware and software components in the cluster meet the qualifications for the "Certified for Windows Server 2008" or “Certified for Windows Server 2008 R2” logo, and the cluster passes the validation tests, then it is considered to be supported by Microsoft CSS for failover clustering.

  • Q.  Will failover cluster solutions be listed in the Windows Catalog?

    A.  No, Microsoft will not maintain a list of vendor solutions for failover clusters. However, many vendors list recommended failover cluster solutions and components on their Web sites.

  • Q. Does this new support policy also apply to Windows Server 2003?

    A. No, this is for Windows Server 2008 and Windows Server 2008 R2 only. The current support policies for previous versions of Windows will continue as they exist today.

  • Q. How does Microsoft CSS check if the solution has been validated?

    A. The cluster validation wizard generates a simple HTML report that clearly displays whether a solution has passed all tests. This report will be collected as part of the standard diagnostics utility, MSDT.

  • Q. What if I make a change to the cluster configuration, like add a node? Does the validation wizard have to be run again?

    A. Yes, the cluster validation wizard should be run any time a change is made to an existing failover cluster, as defined by Understanding the validation tests required for your scenario earlier in this document.