General administrative problems

Article
10/08/2009

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

General administrative problems

What problem are you having?

The Cluster service fails and the node cannot detect the network.

An IP address added to a group in the cluster fails.

An IP address resource is unresponsive when taken offline, for example you are unable to query its properties.

You receive the error: "RPC server is unavailable."

Cluster Administrator cannot open a connection to a node.

An application starts but cannot be closed.

A resource group has failed over but will not fail back.

All nodes appear to be functioning correctly, but you cannot access all of the drives from one node.

Cluster Administrator update delays.

Cluster Administrator stops responding when a node fails.

Cannot connect to cluster from recent file list.

Node performance is sluggish and node fails.

The cluster log contains numerous resource informational messages (for example, Entered LooksAlive, Entered Open, Entered Offline).

The Cluster service fails to start and returns an error code of ERROR_SHARING_VIOLATION (32) with event ID 1144 (NM_EVENT_REGISTER_NETWORK_FAILED).

You cannot manually restore the cluster database on a local node by copying the systemroot\cluster\CLUSDB file from another node.

The Cluster service fails and the node cannot detect the network.

In this case, you probably have a configuration problem. Check the following:

Cause: Have you made any configuration changes recently?

Solution: If the node was recently configured, or if you have installed some resource that required you to restart the computer, make sure that the node is still properly configured for the network.
Cause: Is the node properly configured?

Solution: Check that the server is properly configured for TCP/IP. Also check that the appropriate services are running. If the node recently failed, there is an instance of failover; but, if the other nodes are misconfigured as well, the failover will be inadequate and client access will fail.

An IP address added to a group in the cluster fails.

Cause: The Internet protocol (IP) address is not unique.

Solution: The IP address must be different from every other group IP address and every other IP address on the network.
Cause: The IP address is not a static IP address.

Solution: The IP addresses must be statically assigned outside of a DHCP scope, or they must be reserved by the network administrator.

An IP address resource is unresponsive when taken offline, for example you are unable to query its properties.

Cause: You may not have waited long enough for the resource to go offline.

Solution: If an IP Address resource is unresponsive when taken offline, make sure that you wait long enough for the resource to go offline.

Certain resources take time to go offline. For example, it can take up to three minutes for the IP Address resource to go fully offline.

You receive the error: "RPC server is unavailable."

Cause: The server may not be operational, or the Cluster service and the RPC services may not be running.

Solution: If you receive the error "RPC Server is unavailable," make sure the server is operational and that both the Cluster service and the RPC services are running. Also, check the name resolution of the cluster; it is possible that you are using the wrong name or that the name is not being properly resolved by WINS or DNS.

Cluster Administrator cannot open a connection to a node.

Cause: The node may not be running.

Solution: If Cluster Administrator cannot open a connection to a node, make sure that the node is running. If it is, confirm that both the Cluster service and the RPC services are running.

An application starts but cannot be closed.

Cause: You may not have taken a resource offline using Cluster Administrator.

Solution: When you bring resources online using Cluster Administrator, you must also take those resources offline using Cluster Administrator; do not attempt to close or exit the application from the application interface.

A resource group has failed over but will not fail back.

Cause: The hardware and network configurations may not be valid.

Solution: Make sure that the hardware and network configurations are valid.

If any interconnect fails, failover can occur because the Cluster service does not detect a heartbeat, or it may not even register that the node was ever online. In this case, the Cluster service fails over the resources to the other nodes in the server cluster, but it cannot fail back because that node is still down.
Cause: The resource group may not have been configured to fail back immediately, or you are not troubleshooting the problem within the allowable failback hours for the resource.

Solution: Make sure that the resource group is configured to fail back immediately, or that you are troubleshooting the problem within the allowable failback hours for the resource group.

A group can be configured to fail back only during specified hours. Often, administrators prevent failback during peak business hours. To check this, use Cluster Administrator to view the resource failback policy.
Cause: You restarted the node to test the failover policy for the group instead of pressing the reset button.

Solution: Make sure that you press the reset button on the node. The resource group will not failback to the preferred node if you shutdown, then restart the node. For more information on testing failback policies, see Test node failure.

All nodes appear to be functioning correctly, but you cannot access all of the drives from one node.

Cause: The shared drive may not be functioning.

Solution: Confirm that the shared drive is still functioning.

Try to access the drive from another node. If you can do that, check the cable from the device to the node that you cannot perform the access. If the cable is not the problem, restart the computer and then try again to access the device. If you cannot access the drive, check your configuration.
Cause: The drive has completely failed.

Solution: Determine (from another node) whether the drive is functioning at all. You may have to restart the drive (by restarting the computer) or replace the drive.

The hard disk with the resource or a dependency for the resource may have failed. You may have to replace a hard disk. You may also have to reinstall the cluster.

Cluster Administrator update delays.

Cause: If you run Cluster Administrator from a remote computer, it may not display the correct (updated) cluster state when the cluster network name fails over from one node to another node. This can result in Cluster Administrator displaying a node as being online, when it is actually offline.

Solution: To work around this problem, restart Cluster Administrator.

You can avoid this problem by connecting to clusters through node names. However, if the node you are connected to fails, Cluster Administrator stops responding until the RPC connection times out.

Cluster Administrator stops responding when a node fails.

Cause: The Cluster Administrator may be slow in doing dynamic updates.

Solution: If Cluster Administrator stops responding when a node fails, make sure that Cluster Administrator is not just slow in doing dynamic updates. If the Cluster service is running on a remaining node, Cluster Administrator is either not responding or is updating very slowly. There are two ways to see if the Cluster service is running on a remaining node:
Use the TCP/IP Ping utility to ping the cluster name on a remaining node.
In Control Panel, double-click Services, and check whether the Cluster service is running.

Cannot connect to cluster from recent file list.

Cause: Files listed in the Cluster Administrator recent file list (both on the File menu and in the Open Connection to Cluster dialog box) have the cluster name appended to the path. For example, instead of Webclust1, the recent file list may list C:\Windows\Cluster\Webclust1. This problem occurs when Microsoft Visual C++ version 5.0 is installed.

Solution: To work around this problem, manually type the cluster name when you open the connection.

Node performance is sluggish and node fails.

Cause: CPU may be overloaded.

Solution: Check that your node is not processor-bound. That is, that the CPU is not running at 100-percent utilization. If you try to run too many resources for the node capacity, you can overload the CPU.

Also, review the size of your paging file. If the paging file is too small, the Cluster service can detect this as a node failure and fail over the groups.

The cluster log contains numerous resource informational messages (for example, Entered LooksAlive, Entered Open, Entered Offline).

Cause: One or more of your Generic Script resources fills the cluster log with multiple copies of Entered LooksAlive, Entered Open, Entered Offline messages.

Solution: When creating a script for a Generic Script resource, do not use the LogInformation method when calling the LooksAlive function. For more information, see the Microsoft Platform Software Development Kit (SDK).

The Cluster service fails to start and returns an error code of ERROR_SHARING_VIOLATION (32) with event ID 1144 (NM_EVENT_REGISTER_NETWORK_FAILED).

Cause: The Internet Assigned Numbers Authority (IANA)-assigned port (3343) used by the cluster network driver (ClusNet) is bound to another process, preventing the Cluster service from starting.

Solution: Use port scanning and process termination utilities to identify and end the process that is bound to port 3343.

To do this:

Open Command Prompt.
Navigate to the %systemroot%\system32 directory.
Type netstat -a -o.

This will display all listening and connected ports and the process ID of each process bound to that port. Port 3343 will appear for each cluster network on the node.

Notes
- The -a option indicates that all connections and listening ports are supposed to be displayed. Server clusters uses UDP so the ports are normally in listening mode rather than in connections.
- The -o option indicates that the owning process ID is supposed to be displayed.
Type tasklist.

This will display the IDs for all the processes running on the node, including the process ID that matches the Cluster service (ClusSvc.exe).
Type taskkill /pid ID to terminate the process(es) bound to port 3343 that do not match the ID for the Cluster service.

You cannot manually restore the cluster database on a local node by copying the systemroot\cluster\CLUSDB file from another node.

Cause: If the cluster registry hive is already locked and loaded by the Cluster service, the operating system will prevent you from copying a local CLUSDB file or overwriting an existing CLUSDB file on another node.

Solution: Stop the Cluster service. Then unload the HKEY_LOCAL_MACHINE\Cluster hive before restoring the cluster database file.

To do this:

Open Command Prompt.
Type net stop clussvc to stop the Cluster service.
Use the Registry Editor to unload the hive under HKEY_LOCAL_MACHINE\Cluster. For more information, see Unload a hive from the registry.

The operating system will now allow you to copy the CLUSDB file from a node and manually restore it to another node.

For information about how to obtain product support, see Technical support options.

General administrative problems