The Cluster Configuration File on a Node is Corrupt

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

To understand this topic in context, see the flowchart in Troubleshooting Quorum Resource Problems.

If you try to start the Cluster service with the /fixquorum option on one node at a time and discover that this fails on one node although it succeeds on another, the cluster configuration file on the node from which you cannot start the Cluster service might be corrupt. This is a separate problem from problems with the quorum resource, but it is covered here because you might discover it while troubleshooting a problem with the quorum resource.

Cause

There are a variety of possible causes. Like any files, the cluster configuration file (CLUSDB) on a node can become corrupted.

Solution

First, correct any problems with the quorum resource and get the cluster running again (without the problem node). Analyze the local disk on the problem node and take appropriate corrective steps, as described in the procedure that follows. If the local disk on the node appears to be free of problems or has been fixed, try to join the cluster from that node. In some cases, the node will be able to join, and the cluster configuration file on the node will be re-created automatically. However, if it appears that a corrupt cluster configuration file is preventing the node from joining the cluster, use the procedure in this section to get a good copy of the cluster configuration database from the quorum resource and copy it to the node.

To analyze the state of the local disk on a node that cannot be started with the /fixquorum option

  1. On the node on which you cannot start the Cluster service with the /fixquorum option, try to load and unload the cluster registry hive (collection of keys, subkeys, and values) by carrying out the following steps:

    1. At a command prompt, type:

      regedit

    2. Click HKEY_LOCAL_MACHINE and make sure it is selected.

    3. On the File menu, click Load Hive.

    4. Browse to the Systemroot**\Cluster** folder, and then double-click CLUSDB. When you are prompted for a key name, type cluster.

    5. If you can successfully load the hive, under HKEY_LOCAL_MACHINE, make sure that Cluster is selected and then, on the File menu, click Unload Hive. When prompted for confirmation, click Yes. At this point, you might need to perform additional troubleshooting. One step to take is to compare the cluster hive on this node with the cluster hive on another node. Another step is to make sure the cluster hive is unloaded and then try starting the Cluster service again on the node. A third step is to check the system event log for errors that indicate that incorrect permissions or user rights are preventing the Cluster service account from accessing or loading the local registry hive.

    6. If you cannot load the hive, it is corrupt. Continue to the next step.

  2. If CLUSDB is corrupt, or if you want to check the condition of the local disk, run a disk utility that provides preliminary information about whether the disk appears to have problems. For example, to find out whether the dirty bit is set on the disk (indicating the file system may be in an inconsistent state), you can type the following at a command prompt:

    fsutil dirty query d**:**

    where d**:** is the disk you are interested in. Another option is to use the Vrfydsk.exe utility, which is available from Windows Server 2003 Resource Kit Tools at the Microsoft Web site. Using information from the previous step along with information about the age and previous history of the local disk, assess whether the disk probably contains errors and needs to be fixed. To fix the disk, at a command prompt, type:

    chkdsk d**: /f**

    or

    chkdsk d**: /f /r**

    where d**:** is the disk you want to fix. Note that on a large disk volume, this command will take a long time to run, especially if the /r option is used.

  3. If information from the previous steps and other disk symptoms indicate that the local disk is failing, replace the disk, restore from backup, and assess whether further troubleshooting is needed.

If the local disk on the node appears to be free of problems or has been fixed, try to join the cluster from that node. In some cases, the node will be able to join, and the cluster configuration file on the node will be re-created automatically. However, if it appears that a corrupt cluster configuration file is preventing the node from joining the cluster, use the following procedure to get a good copy of the cluster configuration database from the quorum resource and copy it to the node.

To replace a corrupt cluster configuration file on a cluster node

  1. In Cluster Administrator, view the functioning nodes in the cluster, and find the node that owns the quorum resource.

  2. From the node that owns the quorum resource, view the files on the quorum resource, and locate the Chkxxx.tmp file.

  3. On the problem node, which is not joined to the cluster at the moment, in the systemroot**\cluster** folder, locate the CLUSDB file (which you have determined is corrupt) and then rename it.

  4. Copy the Chkxxx.tmp file to the systemroot**\cluster** folder on the problem node, and then rename that file CLUSDB.

    If the problem has been corrected on the node, you will be able to start the Cluster service with no start parameters.