Windows Server 2008 R2: Troubleshooting Failover Clusters

When failure is not an option, configuring failover clusters in Windows Server can help ensure near-consistent availability.

John Marlin

Windows Server has changed over the years, with different versions, different levels of support and different tactics for troubleshooting. The current support policy is that, for Windows Server 2008 or Windows Server 2008 R2 Failover Clustering solution to be considered officially supported solutions by Microsoft Customer Support Services (CSS), they must meet the following criteria:

  • All hardware and software components must meet the qualifications to receive a “Certified for Windows Server 2008 R2” logo.
  • The fully configured solution must pass the Validate test in Failover Cluster Management.

By ensuring you have a version with official support, you have the best chance of everything working. There can always be issues with hardware vendors, or Microsoft may need to get involved to assist with some configurations, but chances are you should at least be good to get started. Here’s a look at some of the more common issues with Windows Server 2008 R2 Failover Clustering, and how to accurately troubleshoot those problems.

The Changing Cluster

The way Clusters are qualified has changed significantly in Windows Server 2008 R2 with the introduction of the Cluster Validation wizard, which is integrated into Failover Clustering. The Cluster Validation wizard lets you run a set of focused tests on a collection of servers that you intend to use as nodes in a Cluster.

This validation process tests the underlying hardware and software directly and individually. This will provide an accurate assessment of how well a given configuration will support Failover Clustering. If you use it on a running Cluster, it can also let you know if you’re meeting best practices. You should run it when you add new hardware or drivers to the Cluster.

For those who love scripting, Failover Clustering now has Windows PowerShell support. This is something with which you should start becoming more familiar, as CLUSTER.EXE is no longer being updated. If you don’t know what the cmdlets are and what they mean, you can run the command Get-Help *Cluster*. This will give you a list that describes the commands, like this:

Name                         Synopsis
----                             --------
New-Cluster               Create a new failover cluster. Before you can create a
                                  cluster, you must...

If you don’t know how to use the command, you can use Get-Help New-Cluster –Examples to see samples, such as this:

NAME

New-Cluster

SYNOPSIS

Create a new failover cluster. Before you can create a cluster, you
must connect the hardware (servers, networks, and storage), and run
the validation tests.

-------------------------- EXAMPLE 1 --------------------------

C:\PS>New-Cluster -Name cluster1 -Node node1,node2,node3,node4

Name
----
cluster1

Description
-----------
This command creates a four-node cluster named cluster1, using default
settings for IP addressing.

When receiving events in Windows, it’s always a good idea to really understand what these mean. Some are not as descriptive as you’d like. A list of all events you may see, including event descriptions, is available online.

Event Logs Lead the Way

If you do encounter a problem, Cluster Events is one of the first places you should start looking. Any Critical, Error or Warnings given off will be in the System Event Log. The Informational messages (such as a group going offline, moving a group to another node and so on) will be in the Cluster Operational Channel. You can see these events in Event Viewer / Application and Services Logs / Microsoft / Windows / FailoverClustering.

If you aren’t sure what the problem was with a particular Service/Application Group or resource, you can view it in Failover Cluster Management. If you’re highlighted on a particular group, select “Show Critical Events for this application.” If you’re highlighted on a specific resource, select “Show the critical events for this resource.”

This will open the System Event Log and filter for the specific group or resource. It will give you all instances found in the System Event Log for all nodes in the Cluster. This could be beneficial, as it will show you all this from one location.

Once you’ve identified the resource, you can go to the System Event Logs to see if there are other contributing factors. Don’t be distracted by the symptom—focus on a root cause. For example, if a Network Name or IP Address fails, are there any other network-type events that could contribute to this (TCPIP stack fails, network card malfunctions and so on)?

Cluster Debug logging has changed to event tracing sessions. There’s no more CLUSTER.LOG. The system now writes to extract, transform and load (ETL) files located in the %WinDir%\System32\winevt\logs folder. From these ETL files, you can generate a single CLUSTER.LOG to be viewed from all three. This is a “snapshot” in time, however. In other words, when you generate a Cluster.log, it’s no longer writing to the Cluster.log file itself. Each time you generate one on a node, it will overwrite the current one and replace it with the new one.

You can generate logs with the Windows Powershell command Get-ClusterLog. This is going to go out to all nodes of the Cluster and create the file for each node in the %WinDir%\Cluster\Reports folder. Depending on the number of nodes and the size of the files, you may want to consider some additional switches.

Say you have a nine-node Cluster and want to get all the logs. You can use the –Destination switch to have them all generated and copy them to a specific location. This will give you a single place to get them. It will also tag the name of the node as part of the filename (for example, Get-ClusterLog –Destination c:\logs will create Node1_Cluster.log, Node2_cluster.log and so on in the C:\LOGS folder).

Another consideration if this is an easily reproducible problem: use the –Timespan switch (in minutes). Simply reproduce the problem on a node and run Get-ClusterLog –Timespan 5 –Node Node1. This will generate a Cluster.log for only Node1 and only capture the last five minutes.

Here are some tips for this level of troubleshooting:

  • The log is verbose and complex. It should not be the first place to start looking.
  • Make sure it captures at least three days’ worth of data. That way if you have a failure on Friday evening, the data will still be there when you arrive on Monday. Each log is 100MB in size. If you need to increase the size, use the Windows Powershell command Set-Clusterlog –Size 200 (or whatever size in megabytes you specify).
  • Some applications are “noisy” or “chatty” in the logs. You may need to increase the log size if so.
  • The Cluster Debug Log is generated as GMT, so you’ll need to convert the times to match when the actual local time event occurred.
  • Depending on what you want to see, use –Destination or –Timespan.

Next month, we’ll take you through some common troubleshooting scenarios.

John Marlin

John Marlin is a senior support escalation engineer in the Commercial Technical Support Group. He has been with Microsoft for more than 19 years, with the last 14 years focusing on cluster servers.