Configuring failover clusters in Windows Server can help ensure near-consistent availability. Here are several potential troubleshooting scenarios.
Last month, I looked at some of the more common issues with Windows Server 2008 R2 Failover Clustering, and examined how to accurately troubleshoot those problems.
Remember the current support policy is that, for a Windows Server 2008 or Windows Server 2008 R2 Failover Clustering solution to be considered an officially supported solution by Microsoft Customer Support Services (CSS), it must meet the following criteria:
Here are several scenarios that may help expedite or inform your next troubleshooting efforts. These represent some of the more common issues in supported Windows 2008 R2 Failover Clusters, as well as the steps you may need to take to resolve them.
The Cluster Name Object (CNO) is very important, because it’s the common identity of the Cluster.
It’s created automatically by the Create Cluster wizard and has the same name as the Cluster. Through this account, it creates other Cluster Virtual Computer Objects (VCOs) as you configure new services and applications on the Cluster. If you delete the CNO or take permissions away, it can’t create other objects as required by the Cluster until it’s restored or the correct permissions are reinstated.
As with all other objects in Active Directory, there’s an associated objectGUID. This is how Failover Cluster knows you’re dealing with the correct object. If you simply create a new object, a new objectGUID is created as well. What we need to do is restore the correct object so that Failover Cluster can continue with its normal operations.
When troubleshooting this, we need to find out two things from the Cluster resource. From Windows PowerShell, run the command:
Get-ClusterResource "Cluster Name" | Get-ClusterParameter CreatingDC,objectGUID
This will retrieve the needed values. The first parameter we want is the CreatingDC. When Failover Cluster creates the CNO, we note the domain controller (DC) upon which it was created. For any activity we need to do with the Cluster (create VCOs, bring names online and so on), we know to go to this DC to get the object and security. If it isn’t found on that DC or that DC isn’t available, we’ll search any others that respond, but we know to go here first.
The second parameter is the objectGUID to ensure we’re talking about the correct object. For our example, the Cluster Name is CLUSTER1, the Creating DC is DC1 and the objectGUID is 1a3cf049cf79614ebd94670560da6f04, like so:
Object Name Value
------ ---- -----
Cluster Name CreatingDC \\DC1.domain.com
Cluster Name ObjectGUID1a3cf049cf79614ebd94670560da6f04
We’d need to log on to the DC1 machine and run Active Directory Users and Computers. If there’s a current CLUSTER1 object, we can check to see if it has the proper attributes. One note about this is the display you’ll see. Active Directory attribute editor is initially not going to show you the GUID shown here, as it’s not displaying it in hexadecimal format.
What you’re initially going to see is 49f03c1a-79cf-4e61-bd94-670560da6f04. The hexadecimal format does a switch and it works in pairs, which is a little confusing. If you take the first eight pairs of numbers and do the switch, 49f03c1a becomes 1a3cf049. By switching the next two pairs, 79cf becomes cf79, and then 4e61 becomes 614e. The remaining pairs stay the same.
You must bring up the properties of the objectGUID in the attribute editor to see it in the hexadecimal format that Failover Clustering sees. Because it’s not the proper object, we must first delete the object to take it out of the picture to restore the proper one.
There are several ways of restoring the object. We could use an Active Directory Restore, a utility such as ADRESTORE or the new Active Directory Recycle Bin (if running a Windows 2008 R2 DC with an updated schema). Using the new Active Directory Recycle Bin makes things much easier and is the most seamless process for restoring deleted Active Directory objects.
With the Active Directory Recycle bin, we can search to find the object to restore with the Windows PowerShell command:
Get-ADObject –filter 'isdeleted –eq $true –and samAccountName –eq "CLUSTER1$"' –includeDelectedObjects –property * | FormatListsamAccountName,objectGUID
That command is going to search for any deleted object with the name CLUSTER1 in the Active Directory Recycle Bin. It will give us the account name and objectGUID. If there are multiple items, it will show them all. When we see the one we want, we’d display it as this:
samAccountName : CLUSTER1$ objectGUID:49f03c1a-79cf-4e61-bd94-670560da6f04
Now we need to restore it. After we delete the incorrect one, the Windows PowerShell command to restore it would be:
Restore-ADObject –identity 49f03c1a-79cf-4e61-bd94-670560da6f04
This will restore the object in the same location (organizational unit, or OU) and keep the same permissions and computer account password known by Active Directory.
This is one of the benefits of the Active Directory Recycle Bin when compared to something like a utility such as ADRESTORE. Using ADRESTORE, you’d have to reset the password, move it to the proper OU, repair the object in Failover Clustering and so on.
With the Active Directory Recycle Bin, we simply bring the Cluster Name resource online. This is also a better option than doing a restore of Active Directory, especially if there have been new computer/user objects created, if there are old ones that no longer exist and would have to be deleted again, and so on.
First, let’s quickly recap the definition of Cluster Shared Volumes (CSVs). CSVs simplify the configuration and management of Hyper-V virtual machines (VMs) in Failover Clusters. With CSV on a Failover Cluster that runs Hyper-V, multiple VMs can use the same LUN (disk), yet fail over (or move from node to node) independently of one another. The CSV provides increased flexibility for volumes in clustered storage. For example, you can keep system files separate from data to optimize disk performance, even if the system files and the data are contained within virtual hard disk (VHD) files.
In the properties for all network adapters that carry cluster communication, make sure “Client for Microsoft Networks” and “File and Printer Sharing for Microsoft Networks” are enabled to support Server Message Block (SMB). This is required for CSV. The server is running Windows Server 2008 R2, so it automatically provides the version of SMB that’s required by CSV, which is SMB2. There will be only one preferred CSV communication network, but enabling these settings on multiple networks helps the Cluster have resiliency to respond to failures.
Redirected Access means all I/O operations are going to be “redirected” over the network to another node that has access to the drive. There are basically three reasons why a disk is in Redirected Access mode:
In our scenario, we’ve ruled out Option 1 and Option 2. This leaves us with Option 3. If we look in the System Event Log, we’d see the event “Event ID: 5121” from Failover Clustering.
Here’s the definition of that log entry: Cluster Shared VolumeCSV ‘Cluster Disk x’ is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network through the node that owns the volume. This may result in degraded performance. If redirected access is turned on for this volume, please turn it off. If redirected access is turned off, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.
Taking that stance, we’d also look right before this event for any hardware-related event. So we’d look for events like 9, 11 or 15 that point to a hardware or communication issue. We’d look in Disk Management to see if we could physically see the disk. In most cases, we’ll see some other errors. Once we correct the problem with the back end, we can bring the disk out of this mode.
Keep in mind that the CSV will remain running as long as at least one node can communicate with the storage-attached network. This is why it would be in a “redirected” mode. All writes to the drive are sent to the node that can communicate and the Hyper-V VMs will continue running. There may be a performance hit on those VMs, but they’ll continue to run. So we’ll never really be out of production, which is a good thing.
There’s only one “true” owner of the drive and it’s called the Coordinator Node. Any type of metadata writes to the drive would be done by this node only.
When you open Explorer or Disk Management, it’s going to want to open the drive so it can do any metadata writes (if that’s the intention). Because of this, any drive it doesn’t own will get redirected over the network to the Coordinator node. This is different than the drive being in “redirected access.”
When troubleshooting this, Failover Cluster Management will show the drive as online. So first you should look to see what events are logged. In the System Event Log, you could see these events from Failover Clustering:
Event ID: 5120
Cluster Shared Volume ‘Cluster Disk x’ is no longer available on this node because of ‘STATUS_BAD_NETWORK_PATH(c00000be).’ All I/O will temporarily be queued until a path to the volume is reestablished.
Event ID: 5142
Cluster Shared Volume ‘Cluster Disk x’ is no longer accessible from this cluster node because of error ‘ERROR_TIMEOUT(1460).’ Please troubleshoot this node’s connectivity to the storage device and network connectivity.
Those event logs are timing out trying to get over the network to the Coordinator Node. So then you’d look to see if there are any other errors in the System Event Log that would point to network connectivity between the nodes. If there are, you need to resolve them. Things such as a malfunctioning or disabled network card can cause this.
Next, you’d want to check basic network connectivity between the nodes. What you first need to verify is the network over which your CSV traffic is traveling. The way Failover Clustering chooses the network to use for CSV is by highest metric value. This is different from how Windows identifies the network.
The Failover Cluster Network Fault Tolerance adapter (NETFT) has its own metric system it uses internally. All networks it detects have a default gateway, and will be given the metric of 10000, 10100, as it goes along. All networks that don’t have a default gateway start at 1000, 1100 and so on. Using Windows PowerShell, you can use the command Get-ClusterNetwork | FT Name, Metric, Role to see how the NETFT adapter has defined them. You’d see something similar to:
CSV Traffic 1000
With these four networks, the network I’ve identified as the CSV is called CSV Traffic. The IP Address I’m using for it 184.108.40.206 for Node1 and 220.127.116.11 for Node2, so I would try basic network connectivity with PING between the IP Addresses.
The next step is to attempt an SMB connection using the IP Addresses. This is just what Failover Clustering is going to do. A simple NET VIEW \\18.104.22.168 will suffice to see if there’s a response. What you should get back is either a list of shares or a message: “There are no entries in the list.”
This indicates that you could make a connection to that share. However, if you get the message “System error 53 has occurred. The network path was not found,” this indicates a TCP/IP configuration problem with the network card.
Having “Client for Microsoft Networks” and “File and Printer Sharing for Microsoft Networks” enabled on the network card are required to use CSV. If they aren’t, you’ll get this problem of hanging Explorer. Select these and you’re good to go.
In Windows 2003 Server Cluster and below, unchecking these options was the recommended procedure. This is no longer the case moving forward and you can see how it can break.
There are a few other factors you’ll need to consider. If your Cluster Nodes are experiencing Resource Host Subsystem (RHS) failures, you must first think about the nature of RHS and what it’s doing. RHS is the Failover Cluster component that does a lot of resource health checking to ensure everything is functioning. For an IP Address, it will ensure it’s on the network stack and that it responds. For disks, it will try to connect to the drive and do a DIR command.
If you experience an RHS crash, you’ll see System Event Log IDs 1230 and 1146. In the event 1230, it will actually identify the resource and the resource DLL it uses. If it crashes, it means the resource is not responding as it should and may be deadlocked. If this were to crash on a disk resource, you’d want to look for disk-related errors or disk latencies. Running a Performance Monitor would be a good place to start. Updating drivers/firmware of the cards or the back end may be something to consider as well.
You’re also going to be doing some user mode detections. Failover Clustering conducts health monitoring from kernel mode to a user mode process to detect when user mode becomes unresponsive or hung. To recover from this condition, clustering will bug-check the box. If it does, you’d see a Stop 0x0000009E. Troubleshooting this would entail reviewing the dump file it creates to look for hangs. You’d also want to have Performance Monitor running and look for anything appearing as hanging, memory leaks and so on.
Failover Clustering is dependent on Windows Management Instrumentation (WMI). If you’re having problems with WMI, you’re going to have problems with Failover Clustering (creating and adding nodes, migrating and so on). Run checks against WMI, such as WBEMTEST.EXE, or even remote WMI scripts.
One script you can attempt from Windows PowerShell is (where NODE1 is the name of the actual node):
get-wmiobjectmscluster_resourcegroup -computer NODE1 -namespace "ROOT\MSCluster"
This will make a WMI connection to the Cluster and give you information about the groups.
If that fails, you have some WMI issues. The WMI Services may be stopped, so you may need to restart them. The WMI repository may also be corrupt (use the Windows PowerShell command winmgmt /salvagerepository to see if it’s consistent), and so on.
Here are some troubleshooting points to remember:
Failover Clustering is designed to detect, recover from and report problems. The fact that the Cluster is telling you there is or was a problem does not mean the Cluster caused it. As some people say: “Don’t shoot the messenger.”
John Marlin is a senior support escalation engineer in the Commercial Technical Support Group. He has been with Microsoft for more than 19 years, with the last 14 years focusing on Cluster Servers.