Group and resource failure problems

Article
10/08/2009

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

Group and resource failure problems

What problem are you having?

A resource fails, but is not brought back online.
You cannot bring a resource online.
You cannot bring the default physical disk resource online in Cluster Administrator.
In Disk Management, you do not see the disk for the group that is online on that node.
You are unable to manually move a group, or it does not fail over to another node when it is supposed to.
A group failed over but did not fail back.
The entire group failed and has not restarted.
All nodes are functioning, but resources fail back repeatedly.
The Cluster service does not successfully fail over resources.
You fail over a resource group from one node to another, but it automatically fails back.
The Network Name resource fails when you change to a system locale that is different than the input language used by the Network Name resource.
The Message Queuing resource fails to handle message activity correctly which may result in resource failures.
A third-party resource fails to come online in a mixed-version cluster or while upgrading a cluster.

A resource fails, but is not brought back online.

Cause: A resource may depend on another resource that has failed.

Solution: In the resource Properties dialog box, make sure that the Do not restart check box is clear. If the resource needs another resource to function, and if the second resource fails, confirm that the dependencies are correctly configured.

You cannot bring a resource online.

Cause: The resource is not properly installed.

Solution: Make sure the application or service associated with the resource is properly installed.

Cause: The resource is not properly configured.

Solution: Make sure the properties are set correctly for the resource.

Cause: The resource is not compatible with server clusters.

Solution: Not all applications can be configured to fail over in a cluster. For more information, see Choosing applications to run on a server cluster.

Cause: The resource is generating a specific error.

Solution: Review the system Event Log (look for ClusSvc entries under the Source column) to see if that resource is generating a specific error message.

You cannot bring the default physical disk resource online in Cluster Administrator.

Most cluster configuration problems result from improper configuration of the shared storage bus or the restart of servers.

Cause: You may not have restarted the servers after installing the Cluster service.

Solution: Make sure that you restarted all servers after installing the Cluster service.

When the servers are restarted, the signature of each disk in the cluster storage is read, and the registries are updated with the signature information.

Cause: There may be hardware errors or transport problems.

Solution: Make sure that there are no hardware errors or transport problems.

Using Event Viewer (on the Start menu, under Programs and Administrative Tools (Common)), look in the event log for disk I/O error messages or indications of problems with the communications transport.

Cause: You may not have waited long enough for the registries to be updated.

Solution: Make sure that you waited long enough for the registries to be updated.

Cluster Administrator takes a backup of the registry when it starts up. However, it can take up to a minute after the second server restarts for the disk signatures to be written to the registries. Wait a minute, and then click Refresh.

Cause: One or more adapters on the shared storage bus are configured incorrectly.

Solution: Make sure that the adapters are configured correctly.

Cause: The shared storage bus exceeds the maximum cable length.

Solution: Make sure that the shared storage bus does not exceed the maximum cable length.

Cause: The disk is not supported.

Solution: Make sure that the disk hardware or firmware revision level is not outdated.

Cause: The bus adapter is not supported, or the adapter hardware or firmware revision level is outdated.

Solution: Make sure that the bus adapter is supported, and that the adapter hardware or firmware revision level is current.

Cause: If you move your storage bus adapter to another I/O slot, add or remove bus adapters, or install a new version of the bus adapter driver, the cluster software may not be able to access disks on your shared storage bus

Solution: To accommodate these changes, make sure that your shared storage bus adapter has been properly reconfigured.

Cause: The operating system is incorrectly configured to access the shared storage bus.

Solution: Verify that the operating system can detect the shared storage bus adapter.

In Disk Management, you do not see the disk for the group that is online on that node.

Where?

Computer Management/Storage/Disk Management

Cause: You may not be looking at the right disks.

Solution: Make sure that you are looking at the right disks.

If you have not labeled your disks or assigned fixed drive letters to them, you may not recognize which disks are part of the cluster and which ones are not. Label your disks in a meaningful manner and assign fixed drive letters to all partitions.

Cause: There may have been hardware problems.

Solution: Make sure that there have not been any hardware problems.

Run Event Viewer and check for disk I/O error messages or indications of hardware problems.

You are unable to manually move a group, or it does not fail over to another node when it is supposed to.

Cause: The fail over node may not be designated as a possible owner for all resources in the group that you want to fail over.

Solution: Make sure that the fail over node is designated as a possible owner for all resources in the group you want to fail over.

Check the ownership configuration in the group resource Properties dialog box. If the node is not set as a possible owner for all resources in the group, the node cannot own the group, so failover will not occur. To fix this, make the node a possible owner for all resources in the group.

Cause: A resource in the group may be continually failing.

Solution: Determine if a resource in the group is continually failing.

If the node can, it will bring the resource back up without failing over the group. If the resource continually fails but does not fail over, make sure that the resource property Restart and affect the group is selected. Also, check the Restart Threshold and Restart Period settings, which are also in the resource Properties dialog box.

A group failed over but did not fail back.

Cause: The group will only fail back if the node the group was running on itself failed and then rejoined the cluster. If the group, but not the node, failed, then the group will fail over to another node, but will not fail back to the original node.

Cause: The failback policies of both the group and the resources may not be properly configured.

Solution: Make sure that the Prevent failback check box is clear in the group Properties dialog box. If the Allow failback check box is selected, be sure to wait long enough for the group to fail back. Check these settings for all affected resources within a group. Because groups fail over as a whole, one resource that is prevented from failing back affects the entire group.

Cause: The node to which you want the group to fail back is not configured as the preferred owner of the group.

Solution: Make sure that the node to which you want the group to fail back is configured as the preferred owner of the group. If not, the Cluster service leaves the group on the node to which they failed over.

The entire group failed and has not restarted.

Cause: A node is offline.

Solution: Make sure that the node is not offline.

If the node on which the group had been running is offline, check that another node is a possible owner of the group and of every resource in the group.

Cause: The group has failed repeatedly.

Solution: The group may have exceeded its failover threshold or its failover period. Try to bring the resources online individually (following the correct sequence of dependencies) to determine which resource is causing the problem. Or, create a temporary resource group (for testing purposes) and move the resources to it, one at a time.

All nodes are functioning, but resources fail back repeatedly.

Cause: Power may be intermittent or failing.

Solution: Ensure that your power is not intermittent or failing. You can correct this by using an uninterruptable power supply (UPS) or, if possible, by changing power companies.

The Cluster service does not successfully fail over resources.

Cause: Cluster storage device is not properly configured.

Solution: Verify that the cluster storage device is properly configured and that all cables are properly connected.

You fail over a resource group from one node to another, but it automatically fails back.

Cause: One or more resources fail to come online on the new node.

Solution: Use a process of elimination to determine which resource is failing to come online. For more information, see article Q303431, "Explanation of Why Server Clusters Do Not Verify that Resources will Work Properly on All Nodes" in the Microsoft Knowledge Base.

The Network Name resource fails when you change to a system locale that is different than the input language used by the Network Name resource.

Cause: The system locale must be the same on all nodes of a cluster and on the computer used to connect to the cluster.

Solution: Change the system locale. For more information, see Connect to a cluster with Cluster Administrator.

The Message Queuing resource fails to handle message activity correctly which may result in resource failures.

Cause: Each instance of Message Queuing on a server maps 4 MB of the system view space when handling message activity. This results in a default limit of three active, working instances of Message Queuing on a cluster node. In a server cluster with three Message Queuing resources, a node could have four concurrent Message Queuing services running (the service running on the local node plus the three services associated with the Message Queuing resources.) In this scenario, message activity could be limited, resulting in resource failures.

Solution: Increase the system view space memory pool on each node of a server cluster with three or more Message Queuing resources. (We also recommend that you increase the system view space memory pool even for nodes running fewer than three Message Queuing resources.)

Open Registry Editor.
Open the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management.
Create a new DWORD value called SystemViewSize.
Calculate and enter the data for this DWORD value using the following formula: (16 + (the number of Message Queuing resources x 4)).

For example, the calculation result for a cluster with three Message Queuing resources is 28.
Reboot each node.

A third-party resource fails to come online in a mixed-version cluster or while upgrading a cluster.

Cause: If a resource uses a cryptographic provider not supplied by Microsoft to export (encrypt) and import (decrypt) resource data (cluster and cluster application cryptographic checkpoints), the default encryption key lengths may be different in the Windows 2000 and the Windows Server 2003 family operating systems. The result is that the resource might fail to come online and the cluster and event logs might contain cryptographic checkpoint synchronization errors for that resource.

Solution: Use the cluster.exe "CSP" private property to set the key length and effective key length for the third-party cryptographic provider that encrypts and decrypts data for the failing resource type.

Open Command Prompt.
Type clusterClusterName**"CSP"=key_length,effective_key_length:MULTISTR**

ClusterName is the name of the cluster, CSP is the name of the cryptographic provider, and key_length and effective_key_length are the key and effective key lengths for the RC2 encryption algorithm, in bits. For more information on using cluster.exe, see Cluster.
Depending on the resource, either bring the resource online or recreate the resource to add the new cryptographic checkpoint.

Note

Review the documentation for your cryptographic provider to obtain valid values for the following RC2 encryption algorithm parameters: key_length and effective_key_length. Also review the cryptographic provider documentation for the correct procedure for adding the cryptographic checkpoint.

For information about how to obtain product support, see Technical support options.

Group and resource failure problems