Cluster Troubleshooting

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

David Libertone

Chapter 9 from Windows NT Cluster Server Guidebook, published by Prentice Hall

9-13

Troubleshooting is one of the most aggravating tasks an administrator must perform, but at the same time, one of the most gratifying. All the time invested and mistakes made are forgotten as the adrenaline rush from solving a new problem occurs. As we gloat over our success, we ponder whether we should document our victory or let the same problem challenge the troubleshooting skills of someone else.

In my opinion, troubleshooting should be a methodical process. I have seen many individuals troubleshoot by randomly trying various combinations until the problem goes away. This usually works, but being methodical can narrow down the possible cause of the problem faster. I use what I once heard called "the divide and conquer" method, which involves making one change that eliminates one possible cause of the problem. For example, if a client cannot access a server, you may want to use the "ping" utility to verify potential connectivity between the client and the server. If the ping utility is successful, the network is functional, and you move on to another test. If the test fails, the network may be having a problem. Next, attempt to ping devices between the client and the server such as routers to isolate the segment or router that may be malfunctioning.

Successful troubleshooting requires knowledge in many areas, such as NetBIOS, network protocols, network hardware, computer hardware, Windows NT security, and Window NT performance. To test and eliminate various components as potential problems, it is necessary to know what tool to use. For example, the ping utility tests the ability of two computers to communicate via TCP/IP. The "NET VIEW \\computername" command, however, tests NetBIOS connectivity. If TCP/IP is the only protocol loaded, the "net view" command, if successful, proves that both TCP/IP and NetBIOS are functional.

Portions of this chapter will repeat material from previous chapters. Some new information will also be provided. The goal of this chapter is to give a cluster administrator consolidated information on how to troubleshoot various cluster problems that could arise.

Troubleshooting Tools

Successful troubleshooting begins by using the correct tool for the problem. Windows NT provides many such tools. Each one is useful for testing one or more components of the operating system.

Disk Administrator

A successful installation of Cluster Server requires that both cluster members assign the same drive letters for the partitions on the shared SCSI bus. This is accomplished with the Disk Administrator utility. The same feature of Disk Administrator that allows the administrator to properly configure drive letters also allows drive letter problems to be introduced. If an administrator changes drive letter assignments to a device on the shared SCSI bus, but fails to perform the same configuration on the other cluster member, problems will result. The Disk Administrator utility can be used to view drive letter assignments. If a mismatch in drive letters occurs, use the "Assign drive letter" option under "Tools" to again make the drive letters consistent between cluster members. Disk Administrator can also be used to determine which cluster member has control of a disk. See Figure 9-1. This is useful when there are problems bringing resources, such as the quorum resource, online. The cluster member that has control of the disk will properly display the partition information of the disk. The cluster member that does not have control of the device will display an entry for the device with the text, "Configuration information not available."

Cc767159.9-1(en-us,TechNet.10).gif

Figure 9-1: Examining Disk configuration with Disk Administrator

Task Manager

The Task Manager program allows the administrator to perform a quick view of the operating system and its current load. Task Manager is invoked by right-mouse clicking on the taskbar and selecting the "Task Manager" option. Task Manager allows three different views of the operating system. The first one is the "Applications" view, which shows what Windows applications are running. This view can be used to terminate a malfunctioning application by highlighting the application and selecting the "End Task" button. This view is the equivalent of the Windows task list.

The second view offered by Task Manager is the "Processes" view. See Figure 9-2. The "Processes" view displays all processes currently running on the system. The amount of activity that is displayed for a process is the amount of activity by the process since it was created. If the Task Manager utility is closed, it has no effect on the counters. For example, the CPU time displayed for a process will be the total amount of CPU time the process has consumed since the process was created. Processes created by the Cluster Server software can be monitored for activity. The processes pertinent to the Cluster are:

  • Clussvc.exe (This is the main cluster service.)

  • Clusprxy.exe

  • Resrcmon.exe

Cc767159.9-2(en-us,TechNet.10).gif

Figure 9-2: Displaying process list with Task Manager

Use the "Processes" display to determine if any of the cluster related processes are logging any CPU activity. The cluster processes will not log any activity unless a request is sent to the process. For example, the clussvc process logs CPU activity when the Cluster Administrator utility is executed and makes a connection to the Cluster Service. A resource monitor logs CPU time when one of the resources it manages has activity. There is other process-related data that can be displayed, such as the number of page faults. To add or remove columns from the window, use the "View" menu option and then the "Select Columns" option. Make changes to the fields displayed by selecting and de-selecting the appropriate fields.

Cc767159.9-3(en-us,TechNet.10).gif

Figure 9-3: Examining system performance with Task Manager

The "Performance" display gives a quick overview of the load on various system resources such as processor and memory consumption. See Figure 9-3. Use this display to determine if the system is being overloaded. If it is, further analysis with Performance Monitor may be necessary.

Useful items in the performance display include the size of the file cache and the number of handles. The file cache keeps certain data read from the disk in memory. If this data is needed again, a disk I/O is saved. When free memory is limited, the size of the file cache is reduced. The impact is more physical disk activity, which is slow compared to retrieving the same data from memory.

The amount of available memory is important. There should always be at least 4 to 6MB of free memory on a system.

Services Option in Control Panel

Cc767159.9-4(en-us,TechNet.10).gif

Figure 9-4: Displaying service list with the Services program

The Services option in Control Panel can be used to verify that the Cluster related services are running. See Figure 9-4. These services include the Cluster Server, the Remote Procedure Call service, and the Time Service. The Cluster Server and RPC Service should have a status of "Started." The exception is the Time Service, which is started by the Cluster Server software when necessary. Services have configuration information that is accessed by mouse-clicking twice on the specific service. See Figure 9-5. For example, if it becomes necessary to change the password for the account under which the Cluster Server service is running, it must be changed in User Manager and also in the Service properties.

Figure 9-5: Displaying service properties

Figure 9-5: Displaying service properties

Services also have the option of accepting one or more startup parameters. The Cluster Service has parameters to fix certain problems that may arise. The parameters are discussed in more detail later. To pass a startup parameter to a service, do the following:

  1. Stop the service by highlighting the service and selecting the "Stop" option.

  2. Enter the data to be supplied to the service in the Startup Parameters box on the Services screen. See Figure 9-4.

  3. Restart the service by selecting the "Start" option.

Startup parameters are not saved. If the service is stopped and started, it will run without the startup parameter. To restart the service with the startup value, the data must be re-entered into the Startup Parameters box.

Cc767159.9-6(en-us,TechNet.10).gif

Figure 9-6: Displaying recent system events with Event Viewer

Event Viewer

The Event Viewer utility displays the Windows NT logging files. It is located in the Administrative Tools group. There are actually three logs that can be viewed through Event Viewer. The Cluster Server software writes messages into the system log. See Figure 9-6.

The cluster administrator should review the logs in Event Viewer regularly, even when there are no noticeable problems. If there is an entry in the event log, the detail can be displayed by double-clicking on the entry. See Figure 9-7.

Cc767159.9-7(en-us,TechNet.10).gif

Figure 9-7: Displaying event details

Net Helpmsg

The command "net helpmsg error-number," when issued from a command prompt, displays the text message of the error number supplied. This is very useful because often, program error handling displays only an error number. See Figure 9-8.

Cc767159.9-8(en-us,TechNet.10).gif

Figure 9-8: Translating an error number

In this example, the error number to be translated is 5. The error text for error number 5 is "Access is Denied." This utility works for most errors encountered.

Cc767159.9-9(en-us,TechNet.10).gif

Figure 9-9: Displaying the file shares offered by a network name resource

Net View

The command, "net view \\computername," or, "net view \\network_name," tests the ability to connect to a server with NetBIOS. See Figure 9-9. This is useful in testing whether file share resources are available or testing the validity of any cluster network name resource. If the command fails, it does not prove that NetBIOS is not functioning. The problem could be with NetBIOS name resolution. To test whether NetBIOS name resolution is the problem, issue the command, "net view \\ip_address." This command is new with Window NT V4.0. If this command works and the first one does not, NetBIOS is functional, but NetBIOS name resolution is not working properly.

Ping Utility

The ping utility is a TCP/IP connection test. Since the cluster server software works only with TCP/IP protocol, it is the only protocol that must be tested to troubleshoot cluster problems. To test whether TCP/IP is functional between a client and the cluster, or between cluster members, issue the command, "ping host_name," where host_name is the name of the system that is encountering connection problems. If the command is successful, TCP/IP is functional. If the command is not successful, some of the possible problems are:

  • An invalid TCP/IP address, subnet mask, or router entry on the server or client

  • A network failure between the client and the server

  • Host name resolution is not working properly.

To determine if host name resolution is the problem, issue the command, "ping address," where address is the TCP/IP address of the system being tested. If pinging by TCP/IP address is successful, but pinging by name fails, the problem is with host name resolution.

A slight anomaly occurs when testing network name and IP address resources with the ping utility. In Figure 9-10, the IP address resource of 131.107.2..225 is tested. Also, the network name of MTXNAME, which has a dependency on the same IP address resource is tested. The test is successful, but notice the TCP/IP address that replies to the test. It is not the TCP/IP address of the resource, but the actual TCP/IP address of the cluster member that currently owns the tested resources.

Cc767159.9-10(en-us,TechNet.10).gif

Figure 9-10: Testing an IP address resource with the "ping" command

Performance Monitor

Performance monitor is covered extensively in a previous chapter, but there are a couple of counters to mention in regards to troubleshooting.

Network Monitor

Network Monitor is a graphical utility that functions like a network sniffer. It can provide information such as the overall load on a network segment and the source and destination addresses of each packet. This is useful in determining whether data is moving between the client and the server. The network monitor utility is covered more thoroughly in the chapter on performance.

Windows NT Diagnostics

Cc767159.9-11(en-us,TechNet.10).gif

Figure 9-11: Displaying device IRQs with the Diagnostics program

Windows NT Diagnostics has numerous screens to view various operating system-related data. See Figure 9-11. One display I find useful is the Resources display. This tab displays the settings of the devices for which Windows NT has assigned an IRQ and loaded a device driver. If there are problems getting Windows NT to recognize a device, check Windows NT Diagnostics to determine whether another device is using the same hardware settings and is stopping the device in question from being loaded.

Cluster Logging

Cc767159.9-12(en-us,TechNet.10).gif

Figure 9-12: Setting the cluster logging environment table

The Cluster Server can write detailed information to a file regarding significant events or problems. This is known as cluster logging. Cluster logging is not enabled by default, probably due to the extra load it can introduce to the disk. To enable cluster logging, an environment variable must be defined. To define this variable, access the System option from Control Panel and select the Environment tab. There are two sets of environment variables displayed. See Figure 9-12. The system variables are environment variables that are available system-wide. The user environment variables are available only to the current logged-on user. To enable cluster logging, do the following:

  • Select the System Environment area by mouse-clicking anywhere in the System Variables display.

  • Erase the data in the Variable and Data boxes and add the system environment variable CLUSTERLOG with a value that represents the path and filename the Cluster Server software should use to record events and error messages, and select the "Set" button. Verify that the entry appears in the System Variables window.

  • Restart the system for cluster logging to take affect.

The environment variable must be a system environment variable. A user variable will not work because the Cluster Service is usually running under a different user name. Make sure to enable cluster logging on all nodes in the cluster. The cluster log is overwritten each time the cluster member is restarted. The cluster log file has a maximum size of 8MB. If the log file reaches this limit, cluster server will start overwriting the data in the file. The 8MB limit on the log file can be overridden by adding the value ClusterLogSize in the HKEY_LOCAL_MACHINE \System \CurrentControlSet

ClusSvc\Parameters. The ClusterLogSize parameter has a type of DWORD, and it should specify the maximum size for the log file.

Cluster Troubleshooting

Troubleshooting on a cluster requires familiarity with both the Windows NT and cluster configurations. Also, knowledge of any applications, such as SQL Server, that are integrated into the cluster is an absolute necessity. There is no way to list every possible problem you will encounter and the resolution to that problem, because every implementation will be slightly different. Also, there will be problems that have never occurred before that you may encounter, as clusters become more widely implemented. What follow are some checklists for possibly resolving various problems. Also, some of the standard issues and problems that have been prevalent with Cluster Server to date are included.

Windows NT Configuration Checklist

The following is a basic checklist of configuration rules to determine whether the target Windows NT computers can install and form a Windows NT cluster:

  • Windows NT Server, Enterprise Edition and Service Pack 3 must be installed on both nodes.

  • Both computers must be members of the same domain. Valid configurations include:

  • Both computers are member servers of a domain.

  • Both computers are backup domain controllers in a domain.

  • One computer is the primary domain controller, the other is a backup domain controller.

  • Both computers must be in the same domain.

  • A computer can be a member of only one cluster.

  • Each member must have a common SCSI bus, and a disk not on the shared bus to store the operating system.

  • The shared SCSI device must be formatted NTF.

Windows NT Procedure Checklist

Various tasks that can be performed on a standalone server without severely impacting the operating system can have very negative affects on the cluster. Some of these tasks include:

  • Repartitioning ­ If the partition scheme of a disk on the shared SCSI bus is changed, make sure both cluster members are rebooted to update their disk information.

  • Repartitioning ­ Make sure all disk resources are removed before repartitioning the target disk.

  • Computer Names ­ The name of a cluster member cannot be changed after installing the cluster server software. To change the computer name, the cluster server software must be removed before the name can be changed. Then, reinstall cluster server, joining the already existing cluster.

  • TCP/IP addresses ­ The TCP/IP address of an IP address resource should not be changed if a Network Name resource has the IP address as a dependency. The network name and IP address are automatically registered with WINS, and unexpected results might occur if the address is changed.

  • Do not modify logical drive letters after cluster server has been installed.

  • Make sure to re-apply Service Pack 3 whenever files are loaded from the Windows NT Server, Enterprise Edition distribution after the original operating system installation. If service packs are not reapplied, problems previously repaired by the service pack can reappear. A more serious potential problem is that an operating system component might cease to function totally. When in doubt, reapply the service pack.

Installation Problems

Installation problems are usually due to the shared SCSI bus configuration.

Cluster Server Installation Fails on First Node

If the cluster installation fails on the first node of the cluster, check the following:

  • Does the cluster name used already exist? A removed or aborted installation may have already registered the cluster name and IP address with a WIN Server. Check the WINS database, and if there is an entry for the cluster name, remove it.

    If the installation fails when the username and password for the Cluster Service are supplied, verify that:

    • the username and password are accurate

    • the account does not have the "User must change password at next logon" box checked

    • the account has the privilege to log on as a service

    • the account has administrative privilege.

  • If the installation fails to display any shared SCSI devices, either the SCSI bus and devices are not configured properly or the Windows NT system drive is on the shared SCSI bus. Cluster Server does not allow the bus that contains the disk on which the operating system is installed to be used as a shared bus.

Cluster Server Installation Fails on Second Node

If the Cluster Server installation fails on the second node, check the following:

  • Is the first node of the cluster running? Verify a successful installation of the first node by running Cluster Administrator.

  • Is the cluster name resource reachable? Open a command prompt on the second node and ping the cluster name and cluster IP address. If this fails, there is a network problem, possibly either with the TCP/IP address or subnet on one of the cluster members, or the cluster address and subnet mask.

  • Is there an outdated entry for the cluster name in the WINS database from a previous installation? If so, delete it.

SCSI device problems

Because the cluster server uses SCSI configurations in a very atypical manner, many problems arise in this area. Most problems appear at hardware configuration time.

SCSI Bus or SCSI Device Not Recognized

If problems occur while attempting to get the SCSI bus or a SCSI device recognized by the hardware, check the following:

  • If the entire bus is not functional, verify the following conditions:

    • Is the SCSI bus properly terminated?

    • Have SCSI cabling specifications, such as distance limitations, been violated?

    • Are both SCSI controllers on the shared bus exactly the same type? It is not guaranteed that two different SCSI controllers will support the shared bus. In fact, I have never been successful with that type of configuration.

    • No two devices, including controllers, can share the same SCSI ID.

    • The SCSI controllers in the cluster members that support the shared SCSI bus must be configured at SCSI ids 6 and 7.

    • Is the SCSI controller recognized by Windows NT? Use Windows NT Diagnostics to verify that the SCSI controller is known by the operating system. If not, two possible problems are an IRQ conflict or a plug-and-play problem.

    • If there are multiple SCSI controllers in a cluster member, make sure that only the SCSI controller that contains the cluster member's boot disk has its BIOS enabled. If the BIOS is enabled on multiple controllers, very obscure errors can occur. For example, someone received the message, "Not enough disk space," when attempting to load Internet Information Server even though there was over 1gb free on the installation drive.

  • If a specific SCSI device is not functional, check the following conditions:

    • Does the SCSI device have power? External SCSI devices must be powered on before the operating system boots in order to detect their existence.

    • Verify that the SCSI ID does not conflict with another device on the bus. The SCSI ids assigned to the various devices can usually be viewed with the software used to configure the SCSI controller.

    • If the messages, "Device not Ready," or, "Device timeout," appear after a long delay when the second cluster member is booting, disable the option on the SCSI controller to scan for SCSI devices. The first cluster member has taken control of the SCSI bus, and the second computer is attempting to detect the devices on the bus but is getting blocked by the first system. Disabling the SCSI device scan has no negative impacts on Windows NT or Cluster Server.

Cluster Member Connectivity Problems

Disks Do Not Fail Over Successfully

  • Verify that the SCSI bus is connected to both cluster members.

  • Run Disk Administrator on the cluster member the disk will not fail over to and check the configuration. The physical disk should appear in Disk Administrator with the message that configuration information is not available. This at least proves that Window NT is aware of the disk's existence.

When the Cluster Server software is installed, it displays SCSI disks on all buses other than the system SCSI bus. There can be only one shared SCSI bus, so all other devices are considered to be local. The problem is that Cluster Server does not know which bus will be the shared bus, so it displays all options. The installation by default configures all SCSI devices found on any non-system SCSI bus as a shared disk. It is the responsibility of the administrator to remove devices from the Cluster Server configuration for all except one bus. If this is not done, resources could be defined on disks, and the disks will not fail over between cluster members because they are not on the shared SCSI bus.

resources are sometimes intentionally configured on local disks, but the application the resource offers must be installed on local disks on both cluster members. For example, if both cluster members have a local drive D:, a resource can be created with its file location as the D: drive. In this case, the application must be installed on the local disk of each cluster member. The cluster concept is used to fail over the availability of the application and not data, since data is on a local disk and is unavailable to be moved between cluster members.

Quorum Resource Fails

If the device that holds the quorum resource fails and cannot be brought online, the Cluster Service will not start. It can be started with a special parameter that starts the Cluster Service without a quorum resource. Then the administrator can use the Cluster Administrator utility to select a new quorum resource. To correct a quorum resource failure, implement the following:

  1. Shut down one cluster member. Only one node should be running.

  2. Use the Services option from Control Panel to stop the Cluster Service if it is running.

  3. In the Startup Parameters box, enter "-fixquorum," then start the Cluster Service.

  4. Use the Cluster Administrator utility to modify the properties of the cluster and select a new quorum resource.

  5. Use the Services option in Control Panel to stop and restart the Cluster Service. This clears the fixquorum parameter that was passed. It is not necessary to clear anything from the Startup Parameters box, because anything entered is not saved.

  6. Reboot the second cluster member.

This works as long as there is more than one physical disk on the shared SCSI bus. The fixquorum parameter does not bring the quorum disk online. Therefore, it is not possible to move the quorum resource from one partition to another on the same disk, since the disk is offline.

Quorum Disk or Quorum Log is Corrupted

If the quorum disk or quorum log becomes corrupted, the cluster server software will attempt to correct the problem by resetting the log file. This can be determined by examining the Window NT event log and looking for the message, "The log file quolog.log was found to be corrupt." The source of the message is the Cluster Service. If the quorum log cannot be reset, the Cluster Service will fail to start. If the Cluster Server software fails to determine that the quorum log is corrupt and starts, the message, "ERROR_CLUSTERLOG_CORRUPT," will be entered in the cluster log. To correct this problem, do the following:

  1. Use the Service option from Control Panel to stop the Cluster Service if it is started. Do this on both cluster members.

  2. On one node, enter "-noquorumlogging" in the Startup Parameters box for the Cluster Service and start the service. This starts the Cluster Server software without quorun logging, which means that the cluster files on the quorum disk will not be open.

  3. Run a disk repair utility, such as CHKDSK, against the quorum disk. If the disk shows errors, allow CHKDSK to fix them. If CHKDSK reports no errors, the quorum log itself is probably corrupted. Delete the file quolog.log and any temporary files from the MSCS directory on the quorum disk.

  4. Use the Services program to stop and restart the Cluster Service.

The only potential problem with the above procedure is that the quorum log stores cluster configuration changes until they can be communicated to all nodes. When the Cluster Service is configured to start without a quorum log, it is possible that recent configuration changes to the cluster could be lost. But, since the quorum log is corrupted anyway, starting the cluster with a quorum log is the best solution.

Second Node Cannot Connect to Shared Devices

When the second cluster member is started, it establishes a connection to the shared devices. This can be verified by running the Disk Administrator utility on the second node. The shared SCSI devices should be included with the caption, "Configuration information is not available." If the shared disks fail to appear:

  • Verify that the drive letters assigned to the drives are the same on both cluster members.

  • Perform all the SCSI device and bus checks discussed previously.

Client ­ Cluster Connectivity Problems

All communications between clients and the cluster members will occur via TCP/IP. Most connection issues can usually be attributed to TCP addressing or name resolution problems.

Client Cannot Connect to Virtual Servers

A virtual server consists of a TCP/IP address and a network name. If a client is having problems connecting to a virtual server:

  1. Attempt to ping the TCP/IP addresses of both cluster members and the cluster IP address. If the test fails, there is a network problem, possibly TCP/IP addressing.

  2. Attempt to ping the TCP/IP address associated with the IP address resource the virtual server uses. If this test fails, but step 1 is successful, there is a problem with the IP address resource. Check to see if it is online, and make sure that the address has not been changed.

  3. Attempt to ping the network name of the virtual server. If the client is on a different subnet from the cluster members, this will test name resolution mechanisms such as WINS and DNS. If the client on a remote subnet fails this test, verify that a name resolution mechanism is available and that an entry for the virtual server network name exists.

  4. If the client has problems accessing file shares, verify that the user has been granted access to the share and is not getting "Access denied" messages.

Clients Cannot Access a Group That has Failed Over

If a client is using a resource, and the resource fails over to the other cluster member, communications will be temporarily interrupted by the cluster transition. Also, depending on the application, the client may need to manually reconnect. This is application dependent. If the client cannot reconnect to the resource, verify that it is online. The cluster software has the capability to use two network adapters, one for client access and the other for cluster communications. It is possible that the network adapter used for client access on the second cluster member is misconfigured or not functional. The cluster will be able to fail over the resources on its private network segment, but the resources will be unavailable to clients.

Clients Cannot Access a File Share Resource

If a client cannot access a file share resource, consider the following:

  • File share resources use a virtual server. Perform the troubleshooting for virtual servers discussed earlier.

  • Verify that the user has access permissions to the share.

  • If the message, "Conflicting credentials," appears, the user is attempting to establish connections to the same server using different usernames and passwords. Windows NT does not support that feature.

Group and Resource Failure Problems

Group and resource failure problems occur when the resource and group failure mechanism does not function as expected.

A Resource Fails but is not Brought Back Online

When a resource fails, the cluster server will attempt restart it unless:

  • The "Don't Restart" option is selected in the Advanced page of the resource properties.

  • A resource dependency is offline.

  • The resource has reached its failure threshold. A resource has a threshold defining how many failures to accept for the resource and when to restart it. If the threshold is reached, and the resource cannot be moved to another cluster member, the resource will go into a "Failed" state and must be brought online manually.

A Group Cannot be Brought Online

When a group is brought online, the Cluster Service attempts to bring all the resources in the group online. If one or more resources cannot be brought online, the group will have a warning symbol next to it denoting this fact. The resource failures must be examined individually. If none of the resources in a group can be brought online, verify access to the disk on the shared SCSI bus that the group uses. Perform all the SCSI device troubleshooting is necessary.

A Group will not Move or Fail Over to Another Node

If a group will not automatically or manually move to another cluster member, check the following questions:

  • Can the other node accept all the resources in the group? The cluster member must be configured as a possible owner of every resource in the group.

  • Do the properties of the resources have the "Affect the group" option selected? This option notifies the group to fail to the other cluster member. Also check the threshold for the resources. The resource threshold defines how many times the resource should be restarted on the same node before it is failed over to another cluster member. There is also a group threshold value that defines how many total resource failures can occur before the group is failed over to another member. For example, let's assume a group with 6 resources. If each resource fails twice, no resource has reached the default threshold of three. However, the total of twelve failures does exceed the default group threshold of ten and the group will be failed over to another cluster member.

  • Is the group failing over to another cluster member, then immediately failing back? The Cluster Server software allows group to be returned to their preferred owner in the cluster, if one is defined.

A Group Fails Over but will not Fail Back

If a group successfully fails over to another cluster member but does not automatically failback to the original cluster member, consider the following:

  • Make sure that the Prevent Failback option is not selected in the group properties.

  • If failback is enabled, is it configured to occur only during specific hours of the day. If so, has that time occurred?

  • Are preferred owners defined for the group? The Cluster Server software will failback groups to their preferred owners only.

A Group Fails Immediately When Brought Online

If a group fails immediately when it is brought online, one or more resources are not starting properly, reaching their threshold, and are affecting the group. Instead of bringing the group online, bring the resources online one at a time to determine which resource or resources are the cause of the problem.

General Cluster Issues

The Cluster Service will not Start

If the Cluster Service fails to start, it could be due to a problem with the account used by the Cluster Service. An easy method to verify that the account and password are valid is to log in with them. If the system rejects the login attempt, possibly the password has been changed and not updated in the service properties. Reset the password in both User Manager and in the service properties. If this is the case, make sure that the account has not been locked out by the operating system. Use the User Manager program to verify that the "Account locked out" option is cleared. If it is selected, clear it.

The next most common problem is that the password has expired. This is easily recognized during a login attempt. If Windows NT requests a password change during the login process, this will stop the service from starting, since the service has no capability to respond to the operating systems request. The problem can be resolved by again using the User Manager program. Make sure that the "User must change password at Next Logon" box is cleared and that the "Password never expires" box is checked for the service account.

The Message, "RPC Server is Unavailable," is Displayed

This message can occur when Cluster Administrator is used to connect to a cluster. Two possible causes are listed below.

  • Has the system just completed rebooting? If so, the Cluster Server software probably has not started yet. Wait a minute or two and try the Cluster Administrator utility again.

  • Attempt to connect to the cluster by TCP/IP address instead of the cluster name when entering Cluster Administrator. If the connection by TCP/IP address is successful, the problem may be in the WINS or DNS databases. Verify that there are no invalid entries for the cluster in the WINS and DNS servers.

Cluster Administrator Fails to Connect to a Node

If the Cluster Administrator utility cannot establish a connection to a node:

  • Make sure that the Cluster Service and RPC Service are both started.

  • Attempt to connect by TCP/IP address. If this succeeds, name resolution is not working.

A Running Application Cannot be Closed

If a windows application is configured as a generic application resource in the cluster, when the resource is brought online, the application will open on the desktop. If the application is closed on the desktop, the Cluster Server software will automatically restart it and the application will reappear on the desktop. To properly close the application, use the Cluster Administrator utility and take the resource offline.

Troubleshooting by Resource Type

This section provides various tests to perform when troubleshooting a specific resource type, such as a file share resource.

Troubleshooting a Physical Disk Resource

If one or more of the cluster members will not recognize a physical disk resource or bring the disk resource online, check the following items:

  • The disk on the shared SCSI bus should not be repartitioned if the cluster has disk resources referencing the physical disk. To repartition a disk, first remove any disk resources for the disk in Cluster Administrator. This could require that the quorum device be relocated if the disk to be repartitioned currently is the quorum resource.

  • If the disks have been repartitioned, both cluster members must be rebooted to recognize the changes.

  • Make sure drive letters for the disks on the shared SCSI bus are consistent on all cluster members.

  • The Cluster Server software stores disk signatures for the disks on the shared SCSI bus in the registry. For this reason, it is not possible to restore a backup of a Windows NT system running Cluster Server to another computer. The disk signatures will not match and the Cluster Server software cannot access the devices on the shared bus. The new cluster member will need Windows NT and Cluster Server installed. Then any applications which the Cluster Server will offer as resources can be restored.

  • When the second server in a cluster boots, registry information from the existing cluster member is written to the registry of the joining cluster member. This may include updated disk signature information. The registry information should update successfully within 60-90 seconds. If one or two disk signature error messages have been logged, but the cluster is functioning properly, this is probably the cause of the message.

Troubleshooting an IP Address Resource

Although TCP/IP networking can be very complex, an IP address resource is fairly simple because it has no dependencies and the data, which consists of a TCP/IP address and subnet mask, is easy to troubleshoot using standard TCP/IP testing procedures.

The most common problem with IP address resources is misconfigured data for the IP address resource. This can be either the TCP/IP address or subnet mask. Verify that the subnet mask is proper and that the TCP/IP address is in the proper subnet. If necessary, reconfirm the data with the network administrator, or whoever is responsible for handing out TCP/IP addresses. The one test that can be used for an IP address resource is the ping utility. Use it to test access to the IP address resource from a computer on the same subnet and also from a computer on a remote subnet. If the local test is successful, but the remote test fails, this is probably an invalid subnet, assuming the physical network is functional. The Cluster Server software does not complain if addresses and subnet masks are configured for the IP address resource. In fact, an IP address resource with an address for an entirely different subnet can be configured and brought online successfully. It would be nice if the software compared the IP address settings against the TCP/IP configuration at the operating system level and warned of any discrepancies.

Troubleshooting a Network Name Resource

To troubleshoot a network name resource, check the following items:

  • Network name resources are used as NetBIOS names and host names. They have a dependency on an IP address resource, so the first check should be that the IP address resource is online.

  • If there is no noticeable problem with the IP address resource, try to ping the network name. If this is successful, TCP/IP is functioning properly. A delay of approximately 60-90 seconds before the ping test is successful shows that the system initiating the test is configured to use a DNS Server, and there is no entry in the DNS database for the network name resource. Although there may be entries in a WIN server for the network name, DNS is checked before WINS when a ping test is issued. This can be confusing, because if the "net view \\network_name" command is used to test the network name, it may respond more rapidly because the "net view" command uses NetBIOS name resolution, which does not use a DNS until last in its name resolution sequence.

  • If network name resources are constantly created and deleted (perhaps this is a test cluster), another potential problem exists. The cluster server software automatically registers network name resources with the WIN Server configured for the cluster member. If the WIN Server then replicates its database to other WIN Servers, and subsequently the network name is deleted or modified by the cluster administrator, some WIN Servers will have wrong information in their databases regarding the network name. Always check the WIN Server that is used by the system that is experiencing problems with the network name. If necessary, delete the WINS database entry or force a WINS database replication to occur.

Troubleshooting a File Share Resource

To research a file share resource problem, check the following items:

  • File share resources have dependencies on a network name and physical disk resource, so the first step in troubleshooting is to check the functionality of these two resources. Make sure both dependent resources are online.

  • If the problem occurs when bringing the file share resource online, verify that the directory exists. Also check the local file security if the directory is on an NTFS formatted partition. If there is no access to the directory with NTFS security, the Cluster Service cannot bring the resource online.

  • If users are encountering problems, such as saving or writing to files, the problem may also be at the NTFS permission level. Even if the file share resource is created with the proper user access, the permissions at the NTFS level can possibly restrict access further. The actual access which users will have to file share resources will be the most restrictive permissions granted to the file share resource and to the files and directories via NTFS.

Troubleshooting a Generic Service Resource

Generic services are a pretty simple resource. The complicated work has been performed by making a program run as a service. If a generic service resource is not functioning properly, check the following items:

  • Is the generic service attempting to run a service that does not support running in a cluster such as DNS, DHCP or WINS? These services cannot be configured as a generic service resource.

  • If the service logs in with a specific account, manually attempt to login with the account to make sure that the password has not been changed or that the password has not expired.

  • If the generic service resource functions properly on one cluster member, but fails on the other, does the service require information from the registry that may not be getting replicated properly?

Troubleshooting a Generic Application Resource

To troubleshoot a generic application resource, check the following items:

  • A generic application resource does not require any dependent resources, but there is a good chance that it has a dependency on a physical disk resource. Make sure the disk resource is functioning properly.

  • If the application works on one cluster member, but not on another, check to see if the application stores information in the registry. If it does, check the properties of the resource to verify that registry replication has been configured.

  • Since virtually any program can be installed as a generic application resource, this introduces the possibility of configuring a malfunctioning program to run as a cluster resource. Run the program interactively and observe its behavior. Does it open a window? Does it end in error? Does it run to completion and end? If the answer is "yes" to any of these questions, check the following when troubleshooting the resource:

  • If the program is a Windows application, the checkbox, "Allow application to interact with desktop" must be checked. If the box is not selected, and the application is a Windows application, it does not fail when brought online. It will be running in the background, not having been able to open a window.

  • If the application resource continues to restart and eventually goes into a failed state, it could be one of two problems. The cluster server software will restart any resource that fails. If the application ends with an error, this is considered a failure and the application is restarted. If the application ends normally, this is also considered a resource failure and the application is restarted. In either case, the application will eventually reach its failure threshold and either be failed over to another cluster member or placed into a failed state.

Troubleshooting a Print Spooler Resource

To troubleshoot a print spooler resource, check the following items:

  • A print spooler resource is dependent on a physical disk and network name resource. Verify that these dependent resources are functioning properly, in the same group as the print spooler resource, and are online.

  • Make sure that access to the disk and directory used by the print spooler has not been restricted through NTFS permissions. Also check that the disk that contains the spool directory is not full. This will cause print jobs to hang.

  • Check the LPR port mapping for the print device in question. The LPR port must be created for each cluster member. If the print spooler works from on cluster member, but not the other, this could be the problem.

  • The printer driver must be manually loaded on each cluster member. If the print spooler functions properly on only one of the cluster members, this could also be the problem.

Troubleshooting an IIS Virtual Root Resource

To troubleshoot an IIS virtual root resource, check the following items:

  • An IIS virtual root resource has a dependency on an IP address resource. verify that it is functioning properly.

  • If the IIS resource does not work with a domain name, such as www.ucicorp.com, does it work by using the TCP/IP address? If so, the problem is with name resolution and DNS.

  • If the resource is a WWW or FTP virtual root, the allowed access can be Read or Execute. The granted access must be "execute" in order for a client to run a program in the directory.

  • If the resource functions properly on only one of the cluster members, verify that the directory used is on a shared disk. If it is a local disk, the directory must exist on both cluster members.

Troubleshooting an SQL Server Resource

To troubleshoot an SQL Server resource, check the following items:

  • Clustering support for SQL Server uses a network name and disk resource as dependencies. Verify that they are functioning and online.

  • When clustering support for SQL Server is installed, it replaces the standard SQL services of MSSQL Server and SQL Executive. Use the Service option from Control Panel to determine whether the original SQL services have been started, and stop them if necessary. The proper method to start and stop a clustered SQL Server is to take the SQL associated resources offline or bring them online.

  • If the SQL virtual server resource functions properly on one of the cluster members, but not another, make sure that the username and password used for the SQL Server service account are identical on both nodes.

Troubleshooting a Distributed Transaction Coordinator Resource

To troubleshoot a distributed transaction coordinator resource, check the following items:

  • Verify that the problem is not with the Transaction Server software or with the database server, such as SQL Server, that the transaction coordinator is accessing.

  • A distributed transaction coordinator resource has dependencies on a disk and a network name resource. Verify that both dependent resources are functioning properly.

Troubleshooting a Message Queue Server Resource

To troubleshoot a message queue server resource, check the following items:

  • Verify that the problem is not with the Message Queue Server software.

  • A message queue server resource has a dependency on a disk and a network name resource. Verify that both of these resources are functioning properly.

  • The message queue server software has various settings that can stall the message queue. This is specific to the software and is not a cluster issue.

Copyright © Prentice Hall, Inc. 1999. All rights reserved.

We at Microsoft Corporation hope that the information in this work is valuable to you. Your use of the information contained in this work, however, is at your sole risk. All information in this work is provided "as -is", without any warranty, whether express or implied, of its accuracy, completeness, fitness for a particular purpose, title or non-infringement, and none of the third-party products or information mentioned in the work are authored, recommended, supported or guaranteed by Microsoft Corporation. Microsoft Corporation shall not be liable for any damages you may sustain by using this information, whether direct, indirect, special, incidental or consequential, even if it has been advised of the possibility of such damages. All prices for products mentioned in this document are subject to change without notice.

International rights = English only.