Node-to-node connectivity problems

Article
10/08/2009

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

Node-to-node connectivity problems

What problem are you having?

I cannot complete a cluster on the first node.
When the resources fail over and the nodes do not detect each other, there is no connectivity between the nodes or with the cluster storage device.
Quorum resource does not start.
Quorum resource fails.
Quorum log becomes corrupted.
Additional node cannot join the cluster.
Nodes cannot connect to the cluster drives.
The cluster quorum disk (containing the quorum resource) becomes disconnected from all nodes in a cluster and you are later unable to add the nodes back to the cluster.

I cannot complete a cluster on the first node.

Cause: Windows Server 2003, Enterprise Edition or Windows Server 2003, Datacenter Edition is incorrectly installed.

Solution: Make sure that Windows Server 2003, Enterprise Edition or Windows Server 2003, Datacenter Edition is correctly installed.

Cause: Your hardware is not listed in the Cluster category on the Microsoft Windows Catalog.

Solution: Make sure that your hardware is listed on the Microsoft Windows Catalog. See the Windows Catalog at Support resources. Search for "Cluster".

If any of the hardware you are using for your cluster is not on this list, consider replacing those components with hardware that is listed.

Cause: You have chosen individual components from the Windows Catalog instead of systems.

Solution: Do not choose individual components of your cluster from the Microsoft Windows Catalog. Microsoft supports only systems chosen from this list.

Cause: Your primary Internet protocol (IP) address is invalid.

Solution: If the node uses DHCP to obtain noncluster IP addresses, use Ipconfig.exe to verify that you have a valid primary IP address for all network adapters. If the second IP address listed (the subnet mask) is 0.0.0.0, your primary address is invalid.

When the resources fail over and the nodes do not detect each other, there is no connectivity between the nodes or with the cluster storage device.

Cause: The Remote Procedure call (RPC) service is not running.

Solution: On each node, use Services in Control Panel to confirm that the RPC service is running.

Cause: The nodes do not have RPC connectivity.

Solution: Verify that the nodes have RPC connectivity.

You can determine this by using a network analyzer (such as Network Monitor), or you can use RPCPing (available on the Microsoft Exchange Server CD)

Quorum resource does not start.

Cause: The resource is not physically connected to the server.

Solution: Make sure that the resource is physically connected to the server.

Quorum resource fails.

Cause: The disk on the shared bus holding the quorum resource has failed.

Solution: If the disk on the shared bus holding the quorum resource fails and cannot be brought online, the Cluster service cannot start. To correct this situation, you must use the fixquorum option to start the Cluster service on a single node, and then use Cluster Administrator to configure the Cluster service to use a different disk on the shared bus for the quorum resource.

When fixquorum is specified, the Cluster service starts without a quorum resource, and does not bring the quorum disk online. A node cannot join a cluster when the Cluster service is running with the fixquorum option.

For instructions on how to change the disk that the Cluster service uses for the quorum resource, see Use a different disk for the quorum resource.

Quorum log becomes corrupted.

Cause: This may occur for a variety of reasons.

Solution: If the quorum log is corrupted, the Cluster service attempts to correct the problem by resetting the log file. In this case, the Cluster service writes the following message in the system log:

The log file [name] was found to be corrupt. An attempt will be made to reset it.

If the quorum log cannot be reset, the Cluster service cannot start.

If the Cluster service fails to detect that the quorum log is corrupted, the Cluster service may fail to start. In this case, there may be an "ERROR_CLUSTERLOG_CORRUPT" message in the system log.

To correct this, you must use the noquorumlogging option when starting the Cluster service to temporarily run the Cluster service without quorum logging. You can then correct the disk corruption and delete the quorum log, as necessary. When noquorumlogging is specified, the Cluster service brings the quorum disk online, but disables quorum logging. You can then run Chkdsk on your quorum disk to detect and correct disk corruption.

For instructions on how to recover from a corrupted quorum log or quorum disk, see Recover from a corrupted quorum log or quorum disk.

Additional node cannot join the cluster.

Cause: The cluster configuration on the node may not have been completely removed if the node was previously evicted.

Solution: At a command prompt, type cluster [cluster name] nodenode name**/forcecleanup**.

When an additional node fails to join a cluster, improper name resolution is often the cause. The problem may exist because of invalid data in the WINS cache. You may also have the wrong binding on the WINS or DNS Services for the additional node.

If WINS or DNS is functioning correctly on all nodes:

Cause: You may not be using the proper cluster name, node name, or IP address.

Solution: Confirm that you are using the proper cluster name, node name, or IP address.

When joining a cluster, you can specify the cluster name, the computer name of the first node, or the IP address of either the cluster or the first node.

Cause: The Cluster Name resource may not have started.

Solution: Confirm that the Cluster Name resource started.

Use Cluster Administrator on the first node to ensure that the Cluster Name resource is running.

Cause: The Cluster service may not be running on the first node.

Solution: Confirm that the Cluster service is running on the first node and that all resources within the Cluster Group are online before installing a second node.

The Cluster service may not have yet started when you attempted to join the cluster.

Cause: Network connectivity may not exist between the nodes.

Solution: Confirm that network connectivity exists between the nodes.

Make sure TCP/IP is properly configured on all nodes.

Cause: You may not have IP connectivity to the cluster address.

Solution: Confirm that you have IP connectivity to the cluster address and that the IP address is assigned to the correct network.

If you cannot ping the IP address of the cluster, run Cluster Administrator on the first node and ensure the cluster IP Address resource is running. Also, use Cluster Administrator to ensure that the cluster has a valid IP address and subnet mask (click Cluster Group, right-click Cluster IP Address, and click Properties), and that the IP address does not conflict with an IP address that is already in use on the network. If the address is not valid, change it, take the Cluster IP Address resource offline, and then bring it online again. If the IP address is not assigned to the correct network, use Cluster Administrator to correct the problem.

If your cluster nodes use DHCP to obtain noncluster IP addresses, use Ipconfig.exe to verify that you have a valid primary IP address for the networks in question. If the second IP address listed (the subnet mask) is 0.0.0.0, your primary address is invalid.

Cause: The network role may have changed from Use for all communication.

**Solution:**Confirm that the network role has not been changed from Use for all communication.

A network role is initialized to Use for all communications but can be changed by the administrator. After verifying that you have IP connectivity to the cluster address and that the IP address is assigned to the correct network, use Cluster Administrator to confirm that at least one of the networks connected between the nodes is enabled to Use for all communications. To verify the role in Cluster Administrator, open the Networks folder, right-click Network, and then click Properties. Use for all communications enables the network to have both node-to-node communication and client-to-cluster communication.

For more information on networks, see Server cluster networks. For more information on network roles, see Configuring cluster network hardware. To use Cluster Administrator to reset the network's role to Use for all communications, see Enable a network for cluster use.

Nodes cannot connect to the cluster drives.

Cause: The same drive letters may not have been assigned to the cluster drives on all nodes.

Solution: Confirm that the cluster drives are assigned the same drive letters on all nodes.

To do so, run Disk Management on each node and make sure that identical drive letters are assigned to all cluster drives.

Where?

Computer Management/Storage/Disk Management

Cause: The node may not be physically connected to the cluster drive.

Solution: Confirm that the node is physically connected to the cluster drive.

If it is not, shut down all nodes and the cluster drive. Connect the nodes to the shared storage bus. Then, start the cluster drive and start the first node. After the Cluster service starts on the first node, start the additional nodes, and attempt to connect to the cluster drive.

Cause: If you are using a shared SCSI bus, the SCSI devices may not have unique IDs.

Solution: Verify that each SCSI device has a unique ID. SCSI controller IDs are preset to seven. Reset one SCSI controller ID to six.

Cause: The controllers on the shared storage bus may not be correctly configured.

Solution: Confirm that the controllers on the shared storage bus are correctly configured (with all cards configured to transfer data at the same rate).

Cause: The devices and controllers may not match.

Solution: Confirm that your devices and controllers match.

For example, do not use a wide connection controller on one node and a narrow connection controller on another node. It is also recommended that all fibre channel controllers be homogeneous, so do not use different brands of controllers together.

The cluster quorum disk (containing the quorum resource) becomes disconnected from all nodes in a cluster and you are later unable to add the nodes back to the cluster.

Cause: The cluster configuration on the nodes may not have been completely removed.

Solution #1: Use the cluster.exe node /force[cleanup] command to evict the nodes from the cluster. For more information, see Evict a node from the cluster.

Solution #2: Use the Cluster service fixquorum start up parameter to start the Cluster service. Only one node at a time can be started with this command. You cannot join any other nodes to the node started using this command. For more information, see Recover from a corrupted quorum log or quorum disk.

For information about how to obtain product support, see Technical support options.

Node-to-node connectivity problems