MS Cluster Server Troubleshooting and Maintenance

Article
02/20/2014

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

Published: May 6, 1999

By Martin Lucas, Microsoft Premier Enterprise Support

Abstract
Introduction
Chapter 1: Preinstallation
Chapter 2: Installation Problems
Chapter 3: Post-Installation Problems
Chapter 4: Administrative Issues
Chapter 5: Troubleshooting the shared SCSI bus
Chapter 6: Client Connectivity Problems
Chapter 7: Maintenance
Appendix A: MSCS Event messages
Appendix B: Using AND Reading THE Cluster Logfile
Appendix C: Command-Line Administration
For More Information

Abstract

This white paper details troubleshooting and maintenance techniques for Microsoft® Cluster Server version 1.0. Because cluster configurations vary, this document discusses techniques in general terms. Many of these techniques can be applied to different configurations and conditions.

Introduction

This white paper discusses troubleshooting and maintenance techniques for the first implementation of Microsoft® Cluster Server (MSCS) version 1.0. The initial phase of the product supports a maximum of two servers in a cluster, which are often referred to as nodes. Since there are so many different types of resources that may be managed within a cluster, it may be difficult at times for an administrator to determine what component or resource may be causing failures. In many cases, MSCS can automatically detect and recover from server or application failures. However, in some cases, it may be necessary to troubleshoot attached resources or applications.

Clustering and Microsoft Cluster Server (MSCS)

The term clustering has been used for many years within the computing industry. Clustering is a familiar subject to many users, and seems very complicated based on earlier implementations that were large, complex, and sometimes difficult to configure. Earlier clusters were a challenge to maintain without extensive training, and without an experienced administrator.

Microsoft has extended the capabilities of the Microsoft® Windows NT® Server operating system through the Enterprise Edition. Microsoft® Windows NT® Server, Enterprise Edition, contains Microsoft Cluster Server (MSCS). MSCS adds clustering capabilities to Windows NT, to achieve high availability, easier manageability, and greater scalability.

Chapter 1: Preinstallation

MSCS Hardware Compatibility List (HCL)

The picture above shows part of the installation process that mentions the importance of using certified hardware for clusters. MSCS uses industry standard hardware. This allows hardware to be easily added or replaced as needed. Supported configurations will use only hardware validated using the MSCS Cluster Hardware Compatibility Test (HCT). These tests are above and beyond the standard compatibility testing for Microsoft Windows NT, and are quite intensive. Microsoft supports MSCS only when MSCS is used on a validated cluster configuration. Validation is available only for complete configurations as tested together. The MSCS HCL is available on the Microsoft Web site at: https://support.microsoft.com/kb/131900.

Configuring the Hardware

The MSCS installation process relies heavily on properly configured hardware. Therefore, it is important that you configure and test each device before you run the MSCS installation program. A typical cluster configuration consists of two servers, two network adapters each, local storage, and one or more shared SCSI buses with one or more disks. While it is possible to configure a cluster using only one network adapter in each server, you are strongly encouraged to have a second isolated network for cluster communications. For clusters to be certified, they must have at least one isolated network for cluster communications. The cluster may also be configured to use the primary non-isolated network for cluster communications if the isolated network fails. The cluster nodes must communicate with each other on a time-critical basis. Communication between nodes is sometimes referred to as the heartbeat. Because is important that the heartbeat packets be sent and received in a timely manner, only PCI-based network adapters should be used, because the PCI bus has the highest priority.

Figure 1:

The shared SCSI bus consists of a compatible PCI SCSI adapter in each server, with both systems connected to the same SCSI bus. One SCSI host adapter uses the default ID 7, and the other uses ID 6. This ensures that the host adapters have the highest priority on the SCSI bus. The bus is referred to as the Shared SCSI bus, because both systems share exclusive access to one or more disk devices on the bus. MSCS controls exclusive access to the device through the reserve and release commands in the SCSI specification.

Other storage subsystems may be available from system vendors as an alternative to SCSI, which, in some cases, may offer additional speed or flexibility. Some of these storage types may require installation procedures other than those specified in the Microsoft Cluster Server Administrator's Guide. These storage types may also require special drivers or resource DLLs as provided by the manufacturer. If the manufacturer provides installation procedures for Microsoft Cluster Server, use those procedures instead of the generic installation directions provided in the Administrator's Guide.

Installing the Operating System

Before you install Microsoft Windows NT Server, Enterprise Edition, you must decide what role each computer will have in the domain. As the Administrator's Guide indicates, you may install MSCS as a member server or as a domain controller. The following information focuses on performance issues with each configuration:

The member server role for each cluster node is a viable solution, but may have a few drawbacks. While not incurring overhead from performing authentication for other systems within the domain, this configuration remains vulnerable to loss of communication with domain controllers on the network. Node to node communications and various registry operations within the cluster require authentication from the domain. At times, during normal operations, the need to receive authentication may occur. Member servers rely on domain controllers elsewhere on the network for this type of authentication. Lack of connectivity with a domain controller may severely affect performance, and may also cause one or more cluster nodes to stop responding until connection with a domain controller has been re-established. In a worst case scenario, loss of network connectivity with domain controllers may cause complete failure of the cluster.
The primary domain controller to backup domain controller (PDC to BDC) configuration is a better alternative than the member server option, because it removes the need for the cluster node to be authenticated by an external source. If an activity requires authentication, either of the nodes can supply it. Thus, authentication is not a failure point as it is in the member server configuration. However, primary domain controllers may require special configuration in a multihomed environment. Additionally, the domain overhead may not be well distributed in this model because one node may have more domain activity than the other one.
The BDC to BDC configuration is the most favorable configuration, because it provides authentication, regardless of public network status, and, the overhead associated with domain activities is balanced between the nodes. Additionally, BDCs are easier to configure in a multihomed environment.

Configuring Network Adapters

In a typical MSCS installation, each server in the cluster, referred to as nodes, will have at least two network adapters; one adapter configured as the public network for client connections, the other for private communications between cluster nodes. This second interface is called the cluster interconnect. If the cluster interconnect fails, MSCS (if so configured) will automatically attempt to use the public network for communication between cluster nodes. In many two-node installations, the private network uses a crossover cable or an isolated segment. It is important to restrict network traffic to only cluster communications on this interface. Additionally, each server should use PCI network adapters. If you have any ISA, PCMCIA, or other bus architecture network adapters, these adapters may compete for attention of the CPU in relationship to other faster PCI devices in the system. Network adapters other than PCI may cause premature failover of cluster resources, based on delays induced by the hardware. Complete systems will likely not have these types of adapters. Keep this in mind, if you decide to add adapters to the configuration.

Follow standard Windows NT configuration guidelines for network adapter configuration. For example, each network adapter must have an IP address that is on a different network or subnet. Do not use the same IP address for both network adapters, although they are connected to two distinctly different physical networks. Each adapter must have a different address, and the addresses cannot be on the same network. Consider the table of addresses in Figure 2 below.

Adapter 1 (Public Network)	Adapter 2 (Private Network)	Valid Combination?
192.168.0.1	192.168.0.1	NO
192.168.0.1	192.168.0.2	NO
192.168.0.1	192.168.1.1	YES
192.168.0.1	10.0.0.1	YES

Figure 2

In fact, because of isolation of the private network, you can use just about whatever matching IP address combination you like for this network. If you want to, you can use addresses that the Internet Assigned Numbers Authority (IANA) designates for private use. The private use address ranges are noted in Figure 3.

Address Class	Starting Address	Ending Address
Class A	10.0.0.0	10.255.255.255
Class B	172.16.0.0	172.31.255.255
Class C	192.168.0.0	192.68.255.255

Figure 3

The first and last addresses are designated as the network and broadcast addresses for the address range. For example, on the reserved Class C address, the actual range for host addresses is 192.168.0.1 through 192.68.255.254. Use 192.168.0.1 and 192.168.0.2 to keep it simple, because you'll have only two adapters on this isolated network. Do not declare default gateway and WINS server addresses for this network. You may need to consult with your network administrator on use of these addresses, in the event that they may already be in use within your enterprise.

When you've obtained the proper addresses for network adapters in each system, use the Network utility in Control Panel to set these options. Use the PING utility from the command prompt to check each network adapter for connectivity with the loopback address (127.0.0.1), the card's own IP address, and the IP address of another system. Before you attempt to install MSCS, make sure that each adapter works properly and can communicate properly on each network. You will find more information on network adapter configuration in the Windows NT Online documentation, the Windows NT Server 4.0 Resource Kit, or in the Microsoft Knowledge Base.

The following are related Microsoft Knowledge Base articles regarding network adapter configuration, TCP/IP configuration, and related troubleshooting:

164015	Understanding TCP/IP Addressing and Subnetting Basics
102908	How to Troubleshoot TCP/IP Connectivity with Windows NT
151280	TCP/IP Does Not Function After Adding a Second Adapter
174812	Effects of Using Autodetect Setting on Cluster NIC
175767	Expected Behavior of Multiple Adapters on Same Network
170771	Cluster May Fail If IP Address Used from DHCP Server
168567	Clustering Information on IP Address Failover
193890	Recommended Wins Configuration for MSCS
217199	Static Wins entries cause the Network Name to go offline.
201616	Network card detection in Microsoft Cluster Server

Configuring the Shared SCSI Bus

In a normal configuration with a single server, the server has a SCSI host adapter that connects directly to one or more SCSI devices, and each end of the SCSI bus has a bus terminator. The terminators help stabilize the signals on the bus and help ensure high-speed data transmission. They also help eliminate line noise.

Configuring Host Adapters

The shared SCSI bus, as used in a Microsoft cluster, differs from most common SCSI implementations in one way: the shared SCSI bus uses two SCSI host adapters. Each cluster node has a separate SCSI host adapter for shared access to this bus, in addition to the other disk controllers that the server uses for local storage (or the operating system). As with the SCSI specification, each device on the bus must have a different ID number. Therefore, the ID for one of these host adapters must be changed. Typically, this means that one host adapter uses the default ID of 7, while the other adapter uses ID 6.

Note: It is important to use ID 6 and 7 for the host adapters on the shared bus so that they have priority over other connected devices on the same channel. A cluster may have more than one shared SCSI bus as needed for additional shared storage.

SCSI Cables

SCSI bus failures can be the result of reduced quality cables. Inexpensive cables may be attractive because of the low price, but may not be worth the headache associated with them. An easy comparison between the cheaper cables and the expensive ones can be done by holding a cable in each hand, about 10 inches from the connector. Observe the arc of the cable. The higher quality cables don't bend very well in comparison. These cables use better shielding than the other cables, and may use different gauge wire. If you use the less expensive cables, you may spend more supporting them than it would cost to buy the better quality cables in the first place. This shouldn't be much of a concern for complete systems purchased from a hardware vendor. These certified systems likely have matched cable sets. In the event you ever need to replace one of these cables, consult with your hardware vendor.

Some configurations may use standard SCSI cables, while others may use Y cables (or adapters). The Y cables are recommended for the shared SCSI bus. These cables allow bus termination at each end, independent of the host adapters. Some adapters do not continue to provide bus termination when turned off, and also cannot maintain bus termination if they are disconnected for maintenance. Y cables avoid these points of failure and help achieve high availability.

Even with high quality cables, it is important to consider total cable length. Transfer rate, the number of connected SCSI devices, cable quality, and termination may influence the total allowable cable length for the SCSI bus. While it is common knowledge that a standard SCSI bus using a 5-megabit transfer rate may have a maximum total cable length of approximately 6 meters, the maximum length decreases as the transfer rate increases. Most SCSI devices on the market today achieve much higher transfer rates and demand a shorter total cable length. Some manufacturers of complete systems that are certified for MSCS may use differential SCSI with a maximum total cable length of 25 meters. Consider these implications when adding devices to an existing bus or certified system. In some cases, it may be necessary to install another shared SCSI bus.

SCSI Termination

Microsoft recommends active termination for each end of the shared SCSI bus. Passive terminators may not reliably maintain adequate termination under certain conditions. Be sure to have an active terminator at each end of the shared SCSI bus. A SCSI bus has two ends and must have termination on each end. For best results, do not rely on automatic termination provided by host adapters or newer SCSI devices. Avoid duplicate termination and avoid placing termination in the middle of the bus.

Drives, Partitions, and File Systems

Whether you use individual SCSI disk drives on the shared bus, shared hardware RAID arrays, or a combination of both, each disk or logical drive on the shared bus needs to be partitioned and formatted before you install MSCS. The Microsoft Cluster Server Administrator's Guide covers the necessary steps to perform this procedure. In most cases, a drive contains only one partition. Some RAID controllers can partition arrays as multiple logical drives, or as a single large partition. In the case of a single large partition, you will probably prefer to have a few logical drives for your data: one drive or disk for each group of resources, with one drive designated as the quorum disk.

If you partition drives at the operating system level into multiple partitions, remember that all partitions on shared disks move together from one node to another. Thus, physical drives are exclusively owned by one node at a time. In turn, all partitions on a shared disk are owned by one node at a time. If you transfer ownership of a drive to another node through MSCS, the partitions move in tandem, and may not be split between nodes. Any partitions on shared drives must be formatted with the NTFS file system, and must not be members of any software-based fault tolerant sets.

CD-ROM Drives and Tape Drives

Do not connect CD-ROM drives, tape drives, or other non-physical disk devices to the shared SCSI bus. MSCS version 1.0 only supports non-removable physical disk drives that are listed on the MSCS HCL. The cluster disk driver may or may not recognize other device types. If you attach unsupported devices to the shared bus, the unsupported devices may appear usable by the Windows NT operating system. However, because of SCSI bus arbitration between the two systems and the use of SCSI resets, these devices may experience problems if attached to the shared SCSI bus. These devices may also create issues for other devices on the bus. For best results, attach the noncluster devices to a separate controller not used by the cluster.

Preinstallation Checklist

Before you install MSCS, there are several items to check to help ensure proper operation and configuration. After proper configuration and testing, most installations of MSCS should complete without error. The following checklist is fairly general. It may not include all possible system options that you need to evaluate before installation:

Use only certified hardware as listed on the MSCS Hardware Compatibility List (HCL).
Determine which role these servers will play in the domain. Will each server be a domain controller or a member server? Recommended role: backup domain controller (BDC).
Install Microsoft Windows NT Server, Enterprise Edition, on both servers.
Install Service Pack 3 on each server.
Verify cables and termination of the shared SCSI bus.
Check drive letter assignment and NTFS formatting of shared drives with only one server turned on at a time.
If both systems have ever been allowed to access drives on the shared bus at the same time (without MSCS installed), the drives must be repartitioned and reformatted prior to the next installation. Failure to do so may result in unexpected file system corruption.
Ensure only physical disks or hardware raid arrays are attached to the shared SCSI bus.
Make sure that disks on the shared SCSI bus are not members of any software fault tolerance sets.
Check network connectivity with the primary network adapters on each system.
Evaluate network connectivity on any secondary network adapters that may be used for private cluster communications.
Ensure that the system and application event logs are free of errors and warnings.
Make sure that each server is a member of the same domain, and that you have administrative rights to each server.
Ensure that each server has a properly sized pagefile and that the paging files reside only on local disks. Do not place pagefiles on any drives attached to the shared SCSI bus.
Determine what name you will use for the cluster. This name will be used for administrative purposes within the cluster and must not conflict with any existing names on the network (computer, server, printer, domain, and so forth). This is not a network name for clients to attach to.
Obtain a static IP address and subnet mask for the cluster. This address will be associated with the cluster name. You may need additional IP addresses later for groups of resources (virtual servers) within the cluster.
Set multi-speed network adapters to a specific speed. Do not use the autodetect setting if available. For more information, see the Microsoft Knowledge Base article 174812.
Decide the name of the folder and location for cluster files to be stored on each server. The default location is %WinDir%\Cluster, where %WinDir% is your Windows NT folder.
Determine what account the cluster service (ClusSvc) will run under. If you need to create a new account for this purpose, do so before installation. Make the domain account a member of the local Administrators group. Though the Domain Admins group may be a member of the Administrators group, this is not sufficient. The account must be a direct member of the Administrators group. Do not place any password restrictions on the account. Also ensure the account should also have the Logon as a service and Lock pages in memory rights.

Installation on systems using custom disk hardware

If your hardware uses other than standard SCSI controllers and requires special drivers and custom resource types, use the software and installation instructions as provided by the manufacturer. Use of the standard installation procedures for MSCS will fail on these systems as they require additional device drivers and DLLs as supplied by the manufacturer. These systems also require special cabling.

Chapter 2: Installation Problems

The installation process for Microsoft Cluster Server (MSCS) is very simple compared to other network server applications. The MSCS installation completes within a short timeframe. Installation usually lasts just a few minutes. For a software package that does so much, the speed with which MSCS installs might surprise you. In reality, MSCS is more complex behind the scenes, and installation depends greatly on the compatibility and proper configuration of the system hardware and networks. If the hardware configuration is not acceptable, it is not unusual to expect installation problems. After installation, be sure to evaluate the proper operation of the entire cluster prior to installing additional software.

MSCS Installation Problems with the First Node

Is Hardware Compatible?

It is important to use certified systems for MSCS installations. Use systems and components from the MSCS Hardware Compatibility List (HCL). For many, the main reason for installing a cluster is to achieve high availability of their valuable resources. Why compromise availability by using unsupported hardware? Microsoft supports only MSCS installations that use certified complete systems from the MSCS Hardware Compatibility List. If the system fails and you need support, if the hardware isn't supported, high availability may be compromised.

Is the Shared SCSI Bus Connected and Configured Properly?

MSCS relies heavily on the shared SCSI bus. You must have at least one device on the shared bus for the cluster to store the quorum logfile and act as the cluster's quorum disk. Access to this disk is vital to the cluster. In the event of a system failure or loss of network communication between nodes, cluster nodes will arbitrate for access to the quorum disk to determine which system will take control and make decisions. The quorum logfile holds information regarding configuration changes made within the cluster when another node may be offline or unreachable. The installation process requires at least one device on the shared bus for this purpose. A hardware RAID logical partition or separate physical disk drive will be sufficient to store the quorum logfile and function as the quorum disk..

To check proper operation of the shared SCSI bus, consult the section "Troubleshooting shared SCSI bus" later in this document.

Install Windows NT Server, Enterprise Edition, and Service Pack 3

MSCS version 1.0 requires Microsoft Windows NT Server, Enterprise Edition, version 4.0 with Service Pack 3 or later. If you add network adapters or other hardware devices and drivers later, it's important to reapply the service pack to ensure that all drivers, DLLs, and system components are of the same version. Hotfixes may require reapplication if they are overwritten. Check with Microsoft Product Support Services or the Microsoft Knowledge Base regarding applied hotfixes, and to determine whether the hotfix needs to be reapplied.

Does the System Disk Have Adequate Free Space to Install the Product?

MSCS requires only a few megabytes to store files on each system. The Setup program prompts for the path to store these files. The path should be to local storage on each server, not to a drive on the shared SCSI bus. Make sure that free space exists on the system disk, both for installation requirements and for normal system operation.

Does the Server Have a Properly Sized System Paging File?

If you've experienced reduced system performance or near system lockup during the installation process, check the Performance tab using the System utility of the Control Panel. Make sure the system has acceptable paging file space (the minimum space required is the amount of physical RAM plus 11 MB.), and that the system drive has enough free space to hold a memory dump file, should a system crash occur. Also, make sure pagefiles are on local disks only, not on shared drives. Performance Monitor may be a valuable resource for troubleshooting virtual memory problems.

Do Both Servers Belong to the Same Domain?

Both servers in the cluster must have membership in the same domain. Also, the service account that the cluster service uses must be the same on both servers. Cluster nodes may be domain controllers or domain member servers. However, if functioning as a domain member server, a domain controller must be accessible for cluster service account authentication. This is a requirement for any service that starts using a domain account.

Is the Primary Domain Controller (PDC) Accessible?

During the installation process, Setup must be able to communicate with the PDC. Otherwise, the setup process will fail. Additionally, after setup, the cluster service may not start if domain controllers are unavailable to authenticate the cluster service account. For best results, make sure each system has connectivity with the PDC, and install each node as a backup domain controller in the same domain.

Are You Installing While Logged On as an Administrator?

To install MSCS, you must have administrative rights on each server. For best results, log on to the server with an administrative account before you start Setup.

Do the Drives on the Shared SCSI Bus Appear to Be Functioning Properly?

Devices on the shared SCSI bus must be turned on, configured, and functioning properly. Consult the Microsoft Cluster Server Administrator's Guide for information on testing the drives before setup.

Are Any Errors Listed in the Event Log?

Before you install new software of any kind, it is good practice to check the system and application event logs for errors. This resource can indicate the state of the system before you make configuration changes. Events may be posted to these logs in the event of installation errors or hardware malfunctions during the installation process. Attempt to correct any problems you find. Appendix A of this document contains information regarding some events that may be related to MSCS and possible resolutions.

Is the Network Configured and Functioning Properly?

MSCS relies heavily on configured networks for communications between cluster nodes, and for client access. With improper function or configuration, the cluster software cannot function properly. The installation process attempts to validate attached networks and needs to use them during the process. Make sure that the network adapters and TCP/IP protocol are configured properly with correct IP addresses. If necessary, consult with your network administrator for proper addressing.

For best results, use statically assigned addresses and do not rely on DHCP to supply addresses for these servers. Also, make sure you're using the correct network adapter driver. Some adapter drivers may appear to work, because they are similar enough to the actual driver needed but are not an exact match. For example, an OEM or integrated network adapter may use the same chipset as a standard version of the adapter. Use of the same chipset may cause the standard version of the driver to load instead of an OEM supplied driver. Some of these adapters work more reliably with the driver supplied by the OEM, and may not attain acceptable performance if using the standard driver. In some cases, this combination may prevent the adapter from functioning at all, even though no errors appear in the system event log for the adapter.

Cannot Install MSCS on the Second Node

The previous section, "MSCS Installation Problems with the First Node," contains questions you need to ask if installation on the second node fails. Please consult this section first, before you continue with additional troubleshooting questions in this section.

During Installation, Are You Specifying the Same Cluster Name to Join ?

When you install the second node, select the Join an Existing Cluster option. The first node you installed must be running at the time with the cluster service running.

Is the RPC Service Running on Both Systems?

MSCS uses remote procedure calls (RPC) and requires that the RPC service be running on both systems. Check to make sure that the RPC service is running on both systems and that the system event logs on each server do not have any RPC-related errors.

Can Each Node Communicate with One Another Over Configured Networks?

Evaluate network connectivity between systems. If you used the procedures in the preinstallation section of this document, then you've already covered the basics. During installation of the second node, the installation progam communicates through the server's primary network and through any other networks that were configured during installation of the first node. Therefore, you should test connectivity again with the IP addresses on these adapters. Additionally, the cluster name and associated IP address you configured earlier will be used. Make sure the cluster service is running on the first node and that the cluster name and cluster IP address resources are online and available. Also, make sure that the correct network was specified for the cluster IP address when the first node was installed. The cluster service may be registering the cluster name on the wrong network. The cluster name resource should be registered on the network that clients will use to connect to the cluster.

Are Both Nodes Connected to the Same Network or Subnet?

Both nodes need to use unique addresses on the same network or subnet. The cluster nodes need to be able to communicate directly, without routers or bridges between them. If the nodes are not directly connected to the same public network, it will not be possible to failover IP addresses.

Cannot Reinstall MSCS After Node Evicted

If you evict a node from the cluster, it may no longer participate in cluster operations. If you restart the evicted node and have not removed MSCS from it, the node will still attempt to join, and cluster membership will be denied. You must remove MSCS with the Add/Remove Programs utility in Control Panel. This action requires that you restart the system. If you ignore the option to restart, and attempt to reinstall the software anyway, you may receive the following error message:

If you receive this message, restart the affected system and reinstall the MSCS software to join the existing cluster.

Chapter 3: Post-Installation Problems

As you troubleshoot or perform cluster maintenance, it may be possible to keep resources available on one of the two nodes. If you are able to use at least one of the nodes for resources while troubleshooting, you may be able to keep as many resources available to users during administrative activity. In some cases, it may be desirable to run with some unavailable resources rather than none at all.

The most likely causes for one or all nodes to be down are usually related to the shared SCSI bus. If only one node is down, check for SCSI-related problems or for communication problems between the nodes. These are the most likely sources of problems that lead to node failures.

Entire Cluster Is Down

If the entire cluster is down, try to bring at least one node online. If you can achieve this goal, the affect on users may be substantially reduced. When a node is online, gather event log data or other information that may be helpful to troubleshoot the failure. Check for the existence of a recent Memory.dmp file that may have been created from a recent crash. If necessary, contact Microsoft Product Support Services for assistance with this file.

One Node Is Down

If a single node is unavailable, make sure that resources and groups are available on the other node. If they are, begin troubleshooting the failed node. Try to bring it up and gather error data from the event log or cluster diagnostic logfile.

Applying Service Packs and Hotfixes

If you're applying service packs or hotfixes, avoid applying them to both nodes at one time, unless otherwise directed by release notes, KB articles, or other instructions. It may be possible to apply the updates to a single node at a time to avoid rendering both nodes unavailable for a short or long duration. More information on this topic may be found in Microsoft Knowledge Base article 174799, "How to Install Service Packs in a Cluster."

One or More Servers Quit Responding

If one or more servers are not responding but have not crashed or otherwise failed, the problem may be related to configuration, software, or driver issues. You can also check the shared SCSI bus or connected disk devices.

If the servers are installed as member servers (non-domain controllers), it is possible that one or both nodes may stop responding if connectivity with domain controllers becomes unavailable. Both the cluster service and other applications use remote procedure calls (RPCs). Many RPC-related operations require domain authentication. As cluster nodes must participate in domain security, it is necessary to have reliable domain authentication available. Check network connectivity with domain controllers and for other network problems. To avoid this potential problem, it is preferred that the nodes be installed as backup domain controllers (BDC). The BDC configuration allows each node to perform authentication for itself despite problems that could exist on a wide area network (WAN).

Cluster Service Will Not Start

There are a variety of conditions that could prevent the Cluster Service (ClusSvc) from starting. Many of these conditions may be the result of configuration or hardware related problems. The first things to check when diagnosing this condition are the items on which the Cluster Service depends . Many of these items may be referenced in the Chapter 1: section of this document. Common causes for this problem with error messages are noted below.

Check the service account under which ClusSvc runs. This domain account needs to be a member of the local adminstrators group on each server. The account needs the Logon as a service and Lock pages in memory rights. Make sure the account is not disabled and that password expiration is not a factor. If the failure is because of a problem related to the service account, the Service Control Manager (SCM) will not allow the service to load, much less run. As a result, if you've enabled diagnostic logging for the Cluster Service, no new entries will be written to the log, and a previous logfile may exist. Failures related to the service account may result in Event ID 7000, or in Event ID 7013 errors in the event log. In addition, you may receive the following pop-up error message:

Could not start the Cluster Service on \\computername. Error 1069: The service did not start because of a logon failure.

Check to make sure the quorum disk is online and that the shared SCSI bus has proper termination and proper function. If the quorum disk is not accessible during startup, the following popup error message may occur:

Could not start the Cluster Service on \\computername. Error 0021: The device is not ready.

Also, if diagnostic logging for the Cluster Service is enabled, the logfile entries may indicate problems attaching to the disk. See Appendix B for more information and a detailed example of the logfile entries for this condition, Example 1: Quorum Disk Turned Off.

If the Cluster Service is running on the other cluster node, check the cluster logfile (if it is enabled) on that system for indications of whether or not the other node attempted to join the cluster. If the cluster logfile did try to join the cluster, and the request was denied, the logfile may contain details of the event. For example, if you evict a node from the cluster, but do not remove and reinstall MSCS on that node, when the server attempts to join the cluster, the request to join will be denied. The following are sample error messages and event messages:

Could not start the Cluster Service on \\computername. Error 5028: Size of job is %1 bytes.

Event ID 1009, Event ID 1063, Event ID 1069, Event ID 1070, Event ID 7023

For examples of logfile entries for this type of failure, see the Example 4: Evicted Node Attempts to Join Existing Cluster section in Appendix B of this document.

If the Cluster Service won't start, check the event log for Event 7000 and 7013. These events may indicate a problem authenticating the Cluster Service account. Make sure the password specified for the Cluster Service account is correct. Also make sure that a domain controller is available to authenticate the account, if the servers are non-domain controllers.

Cluster Service Starts but Cluster Administrator Won't Connect

If the Services utility in Control Panel indicates that the service is running, and you cannot connect with Cluster Administrator to administer the cluster, the problem may be related to the Cluster Network Name or to the cluster IP address resources. There may also be RPC-related problems. Check to make sure the RPC Service is running on both nodes. If it is, try to connect to a known running cluster node by the computer name. This is probably the best name to use when troubleshooting to avoid RPC timeout delays during failover of the cluster group. If running Cluster Administrator on the local node, you may specify a period (.) in place of the name when prompted. This will create a local connection and will not require name resolution.

If you can connect through the computer name or ".", check the cluster network name and cluster IP address resources. Make sure that these and other resources in the cluster group are online. These resources may fail if a duplicate name or IP address on the network conflicts with either of these resources. A duplicate IP address on the network may cause the network adapter to shut down. Check the system event log for errors.

Examples of logfile entries for this type of failure may be found in the Example 3: Duplicate Cluster IP Address section in Appendix B of this document.

Group/Resource Failover Problems

The typical reason that a group may not failover properly is usually because of problems with resources within the group. For example, if you elect to move a group from one node to another, the resources within the group will be taken offline, and ownership of the group will be transferred to the other node. On receiving ownership, the node will attempt to bring resources online, according to dependencies defined for the resources. If resources fail to go online, MSCS attempts again to bring them online. After repeated failures, the failing resource or resources may affect the group and cause the group to transition back to the previous node. Eventually, if failures continue, the group or affected resources may be taken offline. You can configure the number of attempts and allowed failures through resource and group properties.

When you experience problems with group or resource failover, evaluate which resource or resources may be failing. Determine why the resource won't go online. Check resource dependencies for proper configuration and make sure they are available. Also, make sure that the "Possible Owners" list includes both nodes. The "Preferred Owners" list is designed for automatic failback or initial group placement within the cluster. In a two-node cluster, this list should only contain the name of the preferred node for the group, and should not contain multiple entries.

If resource properties do not appear to be part of the problem, check the event log or cluster logfile for details. These files may contain helpful information related to the resource or resources in question.

Physical Disk Resource Problems

Problems with physical disk resources are usually hardware related. Cables, termination, or SCSI host adapter configuration may cause problems with failover, or may cause premature failure of the resource. The system event log may often show events related to physical disk or controller problems. However, some cable or termination problems may not yield such helpful information. It is important to verify the configuration of the shared SCSI bus and attached devices, whenever you detect trouble with one of these devices. Marginal cable connections or cable quality can cause intermittent failures that are difficult to troubleshoot. BIOS or firmware problems might also be factors.

Quorum Resource Failures

If the Cluster Service won't start because of a quorum disk failure, check the corresponding device. If necessary, use the -fixquorum startup option for the Cluster Service, to gain access to the cluster and redesignate the quorum disk. This process may be necessary if you replace a failed drive, or attempt to use a different device in the interim. To view or change the quorum drive settings, right-click the cluster name at the top of the tree, listed on the left portion of the Cluster Administrator window, and select Properties. The Cluster Properties window contains three different tabs, one of which is for the quorum disk. From this tab, you may view or change quorum disk settings. You may also re-designate the quorum resource. More information on this topic may be found in Microsoft Knowledge Base article 172944, "How to Change Quorum Disk Designation."

Failures of the quorum device while the cluster is in operation are usually related to hardware problems, or to configuration of the shared SCSI bus. Use troubleshooting techniques to evaluate proper operation of the shared SCSI bus and attached devices.

For a file share to reach online status, the dependent resources must exist and be online. The path for the share must exist. Permissions on the file share directory must also include at least Read access for the Cluster Service account.

Problems Accessing Drive

If you attempt to access a shared drive through the drive letter, it is possible that you may receive the Incorrect Function error. The error may be a result of the drive not being online on the node you're accessing it from. The drive may be owned by another cluster node and may be online. Check Cluster Administrator for ownership of the resource and online status. If necessary, consult the Physical Disk Resource Problems section of this document. The error could also indicate drive or controller problems.

Chapter 4: Administrative Issues

Cannot Connect to Cluster Through Cluster Administrator

If you try to administer the cluster from a remote workstation, the most common way to do so would be to use the network name you defined during the setup process as Cluster Name. This resource is located in the Cluster Group. Cluster Administrator needs to establish a connection using RPC. If the RPC service has failed on the cluster node that owns the Cluster Group, it will not be possible to connect through the Cluster Name or the name of the computer. Try to connect, instead, using the computer names of each cluster node. If this works, this indicates a problem with either the IP address or Network Name resources in the Cluster Group. There may also be a name resolution problem on the network that may prevent access through the Cluster Name.

Failure to connect using the Cluster Name or computer names of either node may indicate problems with the server, with RPC connectivity, or with security. Make sure that you are logged on with an administrative account in the domain, and that the account has access to administer the cluster. Access may be granted to additional accounts by using Cluster Administrator on one of the cluster nodes. For more information on controlling administrative access to the cluster, see "Specifying Which Users Can Administer a Cluster" in the MSCS Administrator's Guide.

If Cluster Administrator cannot connect from the local console of one of the cluster nodes, check to see if the Cluster Service is started. Check the system event log for errors. You may want to enable diagnostic logging for the Cluster Service. If the problem occurs after recently starting the system, wait 30 to 60 seconds for the Cluster Service to start, and then try to run Cluster Administrator again.

Cluster Administrator Loses Connection or Stops Responding on Failover

The Cluster Administrator application uses RPC communications to connect with the cluster. If you use the Cluster Name to establish the connection, Cluster Administrator may appear to stop responding during a failover of the Cluster Group and its resources. This normal delay occurs during the registration of the IP address and network name resources within the group, and the establishment of a new RPC connection. If a problem occurs with the registration of these resources, the process may take extended time until these resources become available. The first RPC connection must timeout before the application attempts to establish another connection. As a result, Cluster Administrator may eventually time out if there are problems bringing the IP address or network name resources online within the Cluster Group. In this situation, try to connect using the computer name of one of the cluster nodes, instead of the cluster name. This usually allows a more real-time display of resource and group transitions without delay.

Cannot Move a Group

To move a group from one node to another, you must have administrative rights to run Cluster Administrator. The destination node must be online and the cluster service started. The state of the node must be online and not Paused. In a paused state, the node is a fully active member in the cluster, but cannot own or run groups.

Both cluster nodes should be listed in the Possible Owners list for the resources within the group; otherwise the group may only be owned by a single node and will not fail over. While in some configurations, this restriction may be intentional, in most it would be a mistake as it would prevent the entire group from failing over. Also, to move a group, resources within the group cannot be in a pending state. To initiate a Move Group request, resources must be in one of the following three states: online, offline, or failed.

Cannot Delete a Group

To properly delete a group from the cluster, the group must not contain resources. You may either delete the resources contained within the group, or move them to another group in the cluster.

Problems Adding, Deleting, or Moving Resources

Adding Resources

Resources are usually easy to add. However, it is important to understand various resource types and their requirements. Some resource types may have prerequisites for other resources that must exist within the same group. As you work with MSCS, you may become more familiar with these and what they are. You may find that a resource depends on one or more resources within the same group. Examples might include IP addresses, network names, or physical disks. The resource wizard will typically indicate mandatory requirements for other resources. However, in some cases, it may be a good idea to add resources to the dependency list, as they are related. While Cluster.exe may allow the addition of resources and groups, the command-line utility does not impose the dependency or resource property constraints like the Cluster Administrator, because these activities may consist of multiple commands.

For example, suppose you want to create a network name resource in a new group. If you try to create the network name resource first, the wizard will indicate that it depends on an IP address resource. The wizard lists available resources in the group from which you select. If this is a new group, the list may be empty. Therefore, you will need to create the required IP address resource before you create the network name.

If you create another resource in the group and make it dependent on the network name resource, the resource will not go online without the network name resource in an online state. A good example might be a File Share resource. Thus, the share will not be brought online until the network name is online. Because the network name resource depends on an IP address resource, it would be repetitive to make the share also dependent on the same IP address. The established dependency with the network name implies a dependency on the address. You can think of this as a cascading dependency.

You might ask, "What about the disk where the data will be? Shouldn't the share depend on the existence or online status of the disk?" Yes, you should create a dependency on the physical disk resource, although this dependency is not required. If the resource wizard did impose this requirement, it would imply that the only data source that could be used for a file share would be a physical disk resource on the shared SCSI bus. For volatile data, shared storage is the way to go, and a dependency should be created for it. This way, if the disk experiences a momentary failure, the share will be taken offline and restored when the disk becomes available. However, without a requirement for dependency on a physical disk resource, this grants the administrator additional flexibility to use other disk storage for holding data. Use of non-physical disk data storage for the share implies that for it to be moved to the other node, equivalent storage and the same drive letter with the same information must also be available there. Further, there must be some method of data replication or mirroring for this type of storage, if the data is volatile. Some third parties may have solutions for this situation. Use of local storage in this manner is not recommended for read/write shares. For read-only information, the two data sources can remain in sync, and problems with out-of-sync data are avoided.

If you use a shared drive for data storage, make sure to establish the dependency with the share and with any other resources that depend on it. Failure to do so may cause erratic or undesired behavior of resources that depend on the disk resource. Some applications or services that rely on the disk may terminate as a result of not having the dependency.

If you use Cluster.exe to create the same resources, note that it is possible to create a network name resource without the required IP address resource. However, the network name will not go online, and will generate errors from such an attempt.

Using the Generic Application/Service Resources for Third-Party Applications

While some third-party service may require modification for use within a cluster, many services may function normally while controlled by the generic service resource type as provided with MSCS. If you have a program that runs as an application on the server's desktop that you want to be highly available, you may be able to use the generic application resource type to control this application within the cluster.

The parameters for each of these generic resource types are similar. However, when planning to have MSCS manage these resources, it is necessary to first be familiar with the software and with the resources that software requires. For example, the software might create a share of some kind for clients to access data. Most applications need access to their installation directory to access DLL or INI files, to access stored data, or, perhaps, to create temporary files. In some cases, it may be wise to install the software on a shared drive in the cluster, so that the software and necessary components may be available to either node, if the group that contains the service moves to another cluster node.

Consider a service called SomeService. Assume this is a third-party service that does something useful. The service requires that the share, SS_SHARE, must exist, and that it maps to a directory called DATA beneath the installation directory. The startup mode for the service is set for AUTOMATIC, so that the service will start automatically after the system starts. Normally, the service would be installed to C:\SomeService, and it stores dynamic configuration details in the following registry key:

HKEY_LOCAL_MACHINE \Software \SomeCompany \SomeService

If you wanted to configure MSCS to manage this service and make it available through the cluster, you would probably take the following actions:

Create a group using Cluster Administrator. You might call it SomeGroup to remain consistent with the software naming convention.
Make sure the group has a physical disk resource to store the data and the software, an IP address resource, and a network name resource. For the network name, you might use something like SomeServer, for clients to access the share that will be in the group.
Install the software on the shared drive (drive Y, for example).
Using Cluster Administrator, create a File Share resource in the group named SS_SHARE, . Make the file share resource dependent on the physical disk and network name. If either of these resources fails or goes offline, you want the share to follow the state of either dependent resource. Set the path to the Data directory on the shared drive. According to what you know about the software, this should be Y:\SomeService\Data.
Set the startup mode for the service to MANUAL. Because MSCS will be controlling the service, the service does not need to start itself before MSCS has a chance to start and bring the physical disk and other resources online.
Create a generic service resource in the group. The name for the resource should be descriptive for what it corresponds to. You might want to call it SomeService, to match the service name. Allow both cluster nodes as possible owners. Make the resource dependent on the physical disk resource and network name. Specify the service name and any necessary service parameters. Click to select the, Use network name for computer name option . This will cause the application's API call requesting the computer name to return the network name in the group. Specify to replicate the registry key by adding the following line under the Registry Replication tab: Software\SomeCompany\SomeService.
Bring all the resources in the group online and test the service.
If the service works correctly, stop the service by taking the generic service offline.
Move the group to the other node.
Install the service on the other node using the same parameters and installation directory on the shared drive.
Make sure to set the startup mode to MANUAL using the Devices utility in Control Panel.
Bring all the required resources and the generic service resource online, and test the service.

Note: If you evict a node from the cluster at any time, and have to completely reinstall a cluster node from the beginning, you will likely need to repeat steps 10 through 12 on the node if you add it back to the cluster. The procedure described here is generic in nature, and may be adaptable to various applications. If you are uncertain how to configure a service in the cluster, contact the application software vendor for more information.

Applications follow a similar procedure, except that you must substitute the generic application resource type for the generic service resource type used in the above procedure. If you have a simple application that is already installed on both systems, then you may adapt the following steps to the procedure previously described:

Create a generic application resource in a group. For this example, we will make Notepad.exe a highly available application.
For the command line, specify c:\WinNT\System32\Notepad.exe (or different directory, depending on your Windows NT installation directory). The path must be the same on both cluster nodes. Be sure to specify the working directory as needed and click to select the Allow application to interact with the desktop option, so that Notepad.exe isn't put in the background
Skip the Registry Replication tab, because Notepad.exe does not have registry keys requiring replication
Bring the resource online and notice that it appears on the desktop. Choose Move Group, and the application should appear on the other node's desktop.

Some cluster-aware applications may not require this type of setup and they may have setup wizards to create necessary cluster resources.

Deleting Resources

Some resources may be difficult to delete if any cluster nodes are offline. For example, you may be able to delete an IP address resource if only one cluster node is online. However, if you try to delete a physical disk resource while in this condition, an error message dialog box may appear similar to the following:

Physical disk resources affect the disk configuration on each node in the cluster and must be dealt with accordingly on each system at the same time. Therefore, all cluster nodes must be online to remove this type of resource from the cluster.

If you attempt to remove a resource on which other resources depend, a dialog box listing the related resources will be displayed. These resources will also be deleted, as they are linked by dependency to the individual resource chosen for removal. To avoid removal of these resources, first change or remove the configured dependencies.

Moving Resources from One Group to Another

To move resources from one group to another, both groups must be owned by the same cluster node. Attempts to move resources between groups with different owners may result in the following pop-up error message:

To move resources between groups, the groups must have the same owner. This situation may be easily corrected by moving one of the groups so that both groups have the same owner. Equally important is the fact that resources to be moved may have dependent resources. If a dependency exists between the resource to be moved and another resource, a prompt may appear that lists related resources that need to move with the resource:

Problems moving resources between groups other than those mentioned in this section may be caused by system problems or configuration-related issues. Check event logs or cluster logfiles for more information that may relate to the resource in question.

Chkdsk and Autochk

Disks attached to the shared SCSI bus interact differently with Chkdsk and the companion system startup version of the same program, Autochk. Autochk does not perform Chkdsk operations on shared drives when the system starts up, even if the operations are needed. MSCS performs a file system integrity check for each drive, when bringing a physical disk online. MSCS automatically launches Chkdsk, as necessary.

If you need to run Chkdsk on a drive, consult the following Microsoft Knowledge Base articles:

174617	Chkdsk Runs while Running Microsoft Cluster Server Setup
176970	Chkdsk /f Does Not Run on the Shared Cluster Disk
174797	How to Run CHKDSK on a Shared Drive

Chapter 5: Troubleshooting the shared SCSI bus

Verifying Configuration

For the shared SCSI bus to work correctly, the SCSI host adapters must be configured correctly. As with SCSI specifications, each device on the bus must have a unique ID number. For proper operation, ensure that the host adapters are each set to a unique ID. For best results, set one adapter to ID 6 and the other adapter to ID 7 to ensure that the host adapters have adequate priority on the bus. Also, make sure that both adapters have the same firmware revision level. Because the shared SCSI bus is not used for booting the operating system, disable the BIOS on the adapter, unless otherwise directed by the hardware vendor.

Make sure that you connect only physical disk or hardware RAID devices to the shared bus. Devices other than these, such as tape drives, CD-ROM drives, or removable media devices, should not be used on the shared bus. You may use them on another bus for local storage.

Cables and termination are vital parts of the SCSI bus configuration, and should not be compromised. Cables need to be of high quality and within SCSI specifications. The total cable length on the shared SCSI bus needs to be within specifications. Cables supplied with complete certified systems should be correct for use with the shared SCSI bus. Check for bent pins on SCSI cable connectors and devices, and ensure that each cable is attached firmly.

Correct termination is also important. Terminate the bus at both ends, and use active terminators. Use of SCSI Y cables may allow disconnection of one of the nodes from the shared bus without losing termination. If you have terminators attached to each end of the bus, make sure that the controllers are not trying to also terminate the bus.

Make sure that all devices connected to the bus are rated for the type of controllers used. For example, do not attach differential SCSI devices to a standard SCSI controller. Verify that the controllers can each identify every disk device attached to the bus. Make sure that the configuration of each disk device is correct. Some newer smart devices can automatically terminate the bus or negotiate for SCSI IDs. If the controllers do not support this, configure the drives manually. A mixture of smart devices with others that require manual configuration can lead to problems in some configurations. For best results, configure the devices manually.

Also, make sure that the SCSI controllers on the shared bus are configured correctly and with the same parameters (other than SCSI ID). Data transfer rate and other parameter differences between the two controllers may encourage unpredictable behavior.

Adding Devices to the Shared SCSI Bus

To add disk devices to the shared SCSI bus, you must properly shut down all equipment and both cluster nodes. This is necessary because the SCSI bus may be disconnected while adding the device or devices. Attempting to add devices while the cluster and devices are in use may induce failures or other serious problems that may not be recoverable. Add the new device or devices in the same way you add a device to a standard SCSI bus. This means you must choose a unique SCSI ID for the new device, and ensure that the device configuration is correct for the bus and termination scheme. Verify cable and termination before applying power. Turn on one cluster node, and use Disk Administrator to assign a drive letter and format each new device. Before turning on the other node, create a physical disk resource using Cluster Administrator. After you create the physical disk resource and verify that the resource will go online successfully, turn on the other cluster node and allow it to join the cluster. Allowing both nodes to be online without first creating a disk resource for the new device can lead to file system corruption, as both nodes may have different interpretations of disk structure.

Verifying Cables and Termination

A good procedure for verification of cable and termination integrity is to first use the SCSI host adapter utilities to determine whether the adapter can identify all disk devices on the bus. Perform this check with only one node turned on. Then, turn off the computer and perform the same check on the other system. If this initial check succeeds, the next step is to check drive identification from the operating system level, with only one of the nodes turned on. If MSCS is already installed, then the cluster service will need to be started for shared drives to go online. Check to make sure the shared drives go online. If the device fails, there may be a problem with the device, or perhaps a cable or termination problem.

Chapter 6: Client Connectivity Problems

Clients Have Intermittent Connectivity Based on Group Ownership

If clients successfully connect to clustered resources only when a specific node is the owner, a few possible problems could lead to this condition. Check the system event log on each server for possible errors. Check to make sure that the group has at least one IP address resource and one network name resource, and that clients use one of these to access the resource or resources within the group. If clients connect with any other network name or IP address, they may not be accessing the correct server in the event that ownership of the resources changes. As a result of improper addressing, access to these resources may appear limited to a particular node.

If you are able to confirm that clients use proper addressing for the resource or resources, check the IP address and network name resources to see that they are online. Check network connectivity with the server that owns the resources. For example, try some of the following techniques:

From the server:

PING server's primary adapter IP address (on client network)
PING other server's primary adapter IP address (on client network)
PING IP address of the group
PING Network Name of the group
PING Router/Gateway between client and server (if any)
PING Client IP address

If the above tests work correctly up to the router/gateway check, the problem may be elsewhere on the network because you have connectivity with the other server and local addresses. If tests complete up to the client IP address test, there may be a client configuration or routing problem.

From the client:

PING Client IP address
PING Router/Gateway between client and server (if any)
PING server's primary adapter IP address (on client network)
PING other server's primary adapter IP address (on client network)
PING IP address of the group
PING Network Name of the group

If the tests from the server all pass, but you experience failures performing tests from the client, there may be client configuration problems. If all tests complete except the test using the network name of the group, there may be a name resolution problem. This may be related to client configuration, or it may be a problem with the client's designated WINS server. These problems may require network administrator intervention.

Clients Do Not Have Any Connectivity with the Cluster

If clients lose connectivity with both cluster nodes, check to make sure that the Cluster Service is running on each node. Check the system event log for possible errors. Check network connectivity between cluster nodes, and with other network devices, by using the procedure in the previous section. If the Cluster Service is running, and there are no apparent connectivity problems between the two servers, there is likely a network or client configuration problem that does not directly involve the cluster. Check to make sure the client uses the TCP/IP protocol, and has a valid IP address on the network. Also, make sure that the client is using the correct network name or IP address to access the cluster.

If clients experience problems accessing cluster file shares, first check the resource and make sure it is online, and that any dependent resources (disks, network names, and so on) are online. Check the system event log for possible errors. Next, check network connectivity between the client and the server that owns the resource. If the data for the share is on a shared drive (using a physical disk resource), make sure that the file share resource has a dependency declared for the physical disk resource. You can reset the file share by toggling the file share resource offline and back online again. Cluster file shares behave essentially the same as standard file shares. So, make sure that clients have appropriate access at both the file system level and the share level. Also, make sure that the server has the proper number of client access licenses loaded for the clients connecting, in the event that the client cannot connect because of insufficient available connections.

Clients Cannot Access Cluster Resources Immediately After IP Address Change

If you create a new IP address resource or change the IP address of an existing resource, it is possible that clients may experience some delay if you use WINS for name resolution on the network. This problem may occur because of delays in replication between WINS servers on the network. Such delays cannot be controlled by MSCS, and must be allowed sufficient time to replicate. If you suspect there is a WINS-database problem, consult your network administrator, or contact Microsoft Product Support Services for TCP/IP support.

Clients Experience Intermittent Access

Network adapter configuration is one possible cause of intermittent access to the cluster, and of premature failover. Some autosense settings for network speed can spontaneously redetect network speed. During the detection, network traffic through the adapter may be compromised. For best results, set the network speed manually to avoid the recalibration. Also, make sure to use the correct network adapter drivers. Some adapters may require special drivers, although they may be detected as a similar device.

Chapter 7: Maintenance

Most maintenance operations within a cluster may be performed with one or more nodes online, and usually without taking the entire cluster offline. This ability allows higher availability of cluster resources.

Installing Service Packs

Microsoft Windows NT service packs may normally be installed on one node at a time and tested before you move resources to the node. With this advantage of having a cluster, if something goes wrong during the update to one node, the other node is still untouched and continuing to make resources available. As there may be exceptions to the installation of a service pack and whether or not it can be applied to a single node at a time, consult the release notes for the service pack for special instructions when installing on a cluster.

Service Packs and Interoperability Issues

To avoid potential issues or compatibility problems with other applications, check the Microsoft Knowledge Base for articles that may apply. For example, the following articles discuss installation steps or interoperability issues with Windows NT Option Pack, Microsoft SQL Server, and Windows NT Service Pack 4:

218922	Installing NTOP on Cluster Server with SP4
223258	How to install NTOP on MSCS 1.0 with SQL
223259	How to install FTP from NTOP on Microsoft Cluster Server 1.0
191138	How to install Windows NT Option Pack on Cluster Server

Replacing Adapters

Adapter replacement may usually be performed after moving resources and groups to the other node. If replacing a network adapter, ensure the new adapter configuration for TCP/IP exactly matches that of the old adapter. If replacing a SCSI adapter and using Y cables with external termination, it may be possible to disconnect the SCSI adapter without affecting the remaining cluster node. Check with your hardware vendor for proper replacement techniques if you want to attempt replacement without shutting down the entire cluster. This may be possible in some configurations.

Shared Disk Subsystem Replacement

With most clusters, shared disk subsystem replacement may result in the need to shut down the cluster. Check with your manufacturer and with Microsoft Product Support Services for proper procedures. Some replacements may not require much intervention, while others may require adjustments to configuration. Further information on this topic is available in the Microsoft Cluster Server Administrator's Guide and in the Microsoft Knowledge Base.

Emergency Repair Disk

The emergency repair disk (updated with Rdisk.exe) contains vital information about a particular system that you can use to help recover a system that will not start, allowing you to restore a backup, if necessary. It is recommended that the disk be updated when the system configuration experiences changes. It is important to note that the cluster configuration is not stored on the emergency repair disk. The service and driver information for the Cluster Service is stored in the system registry. However, cluster resource and group configuration is stored in a separate registry hive and may be restored from a recent system backup. NTBACKUP will backup this hive when backing up registry files (if selected). Other backup software may or may not include the cluster hive. The file associated with the cluster hive is CLUSDB and is stored with the other cluster files (usually in c:\winnt\cluster). Be sure to check system backups to ensure this hive is included.

System Backups and Recovery

The configuration for cluster resources and groups is stored in the cluster registry hive. This registry hive may be backed up and restored with NTBackup. Some third-party backup software may not include this registry hive when backing up system registry files. It is important, if you rely on a third-party backup solution, that you verify your ability to back up and restore this hive. The registry file for the cluster hive may be found in the directory where the cluster software was installed — not on the quorum disk.

As most backup software (at the time of this writing) is not cluster-aware, it may be important to establish a network path to shared data for use in system backups. For example, if you use a local path to the data (example: G:\), and if the node loses ownership of the drive, the backup operation may fail because it cannot reach the data using the local device path. However, if you create a cluster-available share to the disk structure, and map a drive letter to it, the connection may be re-established if ownership of the actual disk changes. Although the ultimate solution would be a fully cluster-aware backup utility, this technique may be a better alternative until such a utility is available.

What not to do on a cluster server

Below is a list of things not to do with a cluster. While there may be more items that may cause problems, these items are definate words of warning. Article numbers for related Microsoft Knowledge Base articles are noted where applicable.

Do not create software fault tolerant sets with shared disks as members. (171052)
Do not add resources to the cluster group. (168948)
Do not install MSCS when both nodes have been online and connected to the shared storage at the same time without MSCS installed and running on at least one node first.
Do not change computer names of either node.
Do not use WINS static entries for cluster nodes or cluster addresses. (217199)
Do not configure WINS or default gateway addresses for the private interconnect. (193890)
Do not attempt to configure cluster resources to use unsupported network protocols or related network services (IPX, Netbeui, DLC, Appletalk, Services for Macintosh, etc.) Microsoft Cluster Server works only with the TCP/IP protocol.
Do not delete the HKEY_LOCAL_MACHINE \System \Disk registry key while the cluster is running, or if you are using local software fault tolerance.

Appendix A: MSCS Event messages

Event ID 1000

Source:	ClusSvc
Description:	Microsoft Cluster Server suffered an unexpected fatal error at line ### of source module %path%. The error code was 1006.
Problem:	Messages similar to this may occur in the event of a fatal error that may cause the Cluster Service to terminate on the node that experienced the error.
Solution:	Check the system event log and the cluster diagnostic logfile for additional information. It is possible that the cluster service may restart itself after the error. This event message may indicate serious problems that may be related to hardware or other causes.

Event ID 1002

Source:	ClusSvc
Description:	Microsoft Cluster Server handled an unexpected error at line 528 of source module G:\Nt\Private\Cluster\Resmon\Rmapi.c. The error code was 5007.
Problem:	Messages similar to this may occur after installation of Microsoft Cluster Server. If the cluster service starts and successfully forms or joins the cluster, they may be ignored. Otherwise, these errors may indicate a corrupt quorum logfile or other problem.
Solution:	Ignore the error if the cluster appears to be working properly. Otherwise, you may want to try creating a new quorum logfile using the -noquorumlogging or -fixquorum parameters as documented in the Microsoft Cluster Server Administrator's Guide.

Event ID 1006

Source:	ClusSvc
Description:	Microsoft Cluster Server was halted because of a cluster membership or communications error. The error code was 4.
Problem:	An error may have occurred between communicating cluster nodes that affected cluster membership. This error may occur if nodes lose the ability to communicate with each other.
Solution:	Check network adapters and connections between nodes. Check the system event log for errors. There may be a network problem preventing reliable communication between cluster nodes.

Event ID 1007

Source:	ClusSvc
Description:	A new node, "ComputerName", has been added to the cluster.
Information:	The Microsoft Cluster Server Setup program ran on an adjacent computer. The setup process completed, and the node was admitted for cluster membership. No action required.

Event ID 1009

Source:	ClusSvc
Description:	Microsoft Cluster Server could not join an existing cluster and could not form a new cluster. Microsoft Cluster Server has terminated.
Problem:	The cluster service started and attempted to join a cluster. The node may not be a member of an existing cluster because of eviction by an administrator. After a cluster node has been evicted from the cluster, the cluster software must be removed and reinstalled if you want it to rejoin the cluster. And, because a cluster already exists with the same cluster name, the node could not form a new cluster with the same name.
Solution:	Remove MSCS from the affected node, and reinstall MSCS on that system if desired.

Event ID 1010

Source:	ClusSvc
Description:	Microsoft Cluster Server is shutting down because the current node is not a member of any cluster. Microsoft Cluster Server must be reinstalled to make this node a member of a cluster.
Problem:	The cluster service attempted to run but found that it is not a member of an existing cluster. This may be due to eviction by an administrator or incomplete attempt to join a cluster. This error indicates a need to remove and reinstall the cluster software.
Solution:	Remove MSCS from the affected node, and reinstall MSCS on that server if desired.

Event ID 1011

Source:	ClusSvc
Description:	Cluster Node "ComputerName" has been evicted from the cluster.
Information:	A cluster administrator evicted the specified node from the cluster.

Event ID 1012

Source:	ClusSvc
Description:	Microsoft Cluster Server did not start because the current version of Windows NT is not correct. Microsoft Cluster Server runs only on Windows NT Server, Enterprise Edition.
Information:	The cluster node must be running the Enterprise Edition version of Windows NT Server, and must have Service Pack 3 or later installed. This error may occur if you force an upgrade using the installation disks, which effectively removes any service packs installed.

Event ID 1015

Source:	ClusSvc
Description:	No checkpoint record was found in the logfile W:\Mscs\Quolog.log; the checkpoint file is invalid or was deleted.
Problem:	The Cluster Service experienced difficulty reading data from the quorum logfile. The logfile could be corrupted.
Solution:	If the Cluster Service fails to start because of this problem, try manually starting the cluster service with the -noquorumlogging parameter. If you need to adjust the quorum disk designation, use the -fixquorum startup parameter when starting the cluster service. Both of these parameters are covered in the MSCS Administrator's Guide.

Event ID 1016

Source:	ClusSvc
Description:	Microsoft Cluster Server failed to obtain a checkpoint from the cluster database for log file W:\Mscs\Quolog.log.
Problem:	The cluster service experienced difficulty establishing a checkpoint for the quorum logfile. The logfile could be corrupt, or there may be a disk problem.
Solution:	You may need to use procedures to recover from a corrupt quorum logfile. You may also need to run chkdsk on the volume to ensure against file system corruption.

Event ID 1019

Source:	ClusSvc
Description:	The log file D:\MSCS\Quolog.log was found to be corrupt. An attempt will be made to reset it, or you should use the Cluster Administrator utility to adjust the maximum size.
Problem:	The quorum logfile for the cluster was found to be corrupt. The system will attempt to resolve the problem.
Solution:	The system will attempt to resolve this problem. This error may also be an indication that the cluster property for maximum size should be increased through the Quorum tab. You can manually resolve this problem by using the -noquorumlogging parameter.

Event ID 1021

Source:	ClusSvc
Description:	There is insufficient disk space remaining on the quorum device. Please free up some space on the quorum device. If there is no space on the disk for the quorum log files then changes to the cluster registry will be prevented.
Problem:	Available disk space is low on the quorum disk and must be resolved.
Solution:	Remove data or unnecessary files from the quorum disk so that sufficient free space exists for the cluster to operate. If necessary, designate another disk with adequate free space as the quorum device.

Event ID 1022

Source:	ClusSvc
Description:	There is insufficient space left on the quorum device. The Microsoft Cluster Server cannot start.
Problem:	Available disk space is low on the quorum disk and is preventing the startup of the cluster service.
Solution:	Remove data or unnecessary files from the quorum disk so that sufficient free space exists for the cluster to operate. If necessary, use the -fixquorum startup option to start one node. Bring the quorum resource online and adjust free space or designate another disk with adequate free space as the quorum device.

Event ID 1023

Source:	ClusSvc
Description:	The quorum resource was not found. The Microsoft Cluster Server has terminated.
Problem:	The device designated as the quorum resource could not be found. This could be due to the device having failed at the hardware level, or that the disk resource corresponding to the quorum drive letter does not match or no longer exists.
Solution:	Use the -fixquorum startup option for the cluster service. Investigate and resolve the problem with the quorum disk. If necessary, designate another disk as the quorum device and restart the cluster service before starting other nodes.

Event ID 1024

Source:	ClusSvc
Description:	The registry checkpoint for cluster resource "resourcename" could not be restored to registry key *registrykeyname*. The resource may not function correctly. Make sure that no other processes have open handles to registry keys in this registry subkey.
Problem:	The registry key checkpoint imposed by the cluster service failed because an application or process has an open handle to the registry key or subkey.
Solution:	Close any applications that may have an open handle to the registry key so that it may be replicated as configured with the resource properties. If necessary, contact the application vendor about this problem.

Event ID 1034

Source:	ClusSvc
Description:	The disk associated with cluster disk resource resource name could not be found. The expected signature of the disk was signature. If the disk was removed from the cluster, the resource should be deleted. If the disk was replaced, the resource must be deleted and created again to bring the disk online. If the disk has not been removed or replaced, it may be inaccessible at this time because it is reserved by another cluster node.
Problem:	The cluster service attempted to mount a physical disk resource in the cluster. The cluster disk driver could not locate a disk with this signature. The disk may be offline or may have failed. This error may also occur if the drive has been replaced or reformatted. This error may also occur if another system continues to hold a reservation for the disk.
Solution:	Determine why the disk is offline or non-operational. Check cables, termination, and power for the device. If the drive has failed, replace the drive and restore the resource to the same group as the old drive. Remove the old resource. Restore data from a backup and adjust resource dependencies within the group to point to the new disk resource.

Event ID 1035

Source:	ClusSvc
Description:	Cluster disk resource %1 could not be mounted.
Problem:	The cluster service attempted to mount a disk resource in the cluster and could not complete the operation. This could be due to a file system problem, hardware issue, or drive letter conflict.
Solution:	Check for drive letter conflicts, evidence of file system issues in the system event log, and for hardware problems.

Event ID 1036

Source:	ClusSvc
Description:	Cluster disk resource "resourcename" did not respond to a SCSI inquiry command.
Problem:	The disk did not respond to the issued SCSI command. This usually indicates a hardware problem.
Solution:	Check SCSI bus configuration. Check the configuration of SCSI adapters and devices. This may indicate a misconfigured or a failing device.

Event ID 1037

Source:	ClusSvc
Description:	Cluster disk resource %1 has failed a filesystem check. Please check your disk configuration.
Problem:	The cluster service attempted to mount a disk resource in the cluster. A filesystem check was necessary and failed during the process.
Solution:	Check cables, termination, and device configuration. If the drive has failed, replace the drive and restore data. This may also indicate a need to reformat the partition and restore data from a current backup.

Event ID 1038

Source:	ClusSvc
Description:	Reservation of cluster disk "Disk W:" has been lost. Please check your system and disk configuration.
Problem:	The cluster service had exclusive use of the disk, and lost the reservation of the device on the shared SCSI bus.
Solution:	The disk may have gone offline or failed. Another node may have taken control of the disk or a SCSI bus reset command was issued on the bus that caused a loss of reservation.

Event ID 1040

Source:	ClusSvc
Description:	Cluster generic service "ServiceName" could not be found.
Problem:	The cluster service attempted to bring the specified generic service resource online. The service could not be located and could not be managed by the Cluster Service.
Solution:	Remove the generic service resource if this service is no longer installed. The parameters for the resource may be invalid. Check the generic service resource properties and confirm correct configuration.

Event ID 1041

Source:	ClusSvc
Description:	Cluster generic service "ServiceName" could not be started.
Problem:	The cluster service attempted to bring the specified generic service resource online. The service could not be started at the operating system level.
Solution:	Remove the generic service resource if this service is no longer installed. The parameters for the resource may be invalid. Check the generic service resource properties and confirm correct configuration. Check to make sure the service account has not expired, that it has the correct password, and has necessary rights for the service to start. Check the system event log for any related errors.

Event ID 1042

Source:	ClusSvc
Description:	Cluster generic service "resourcename" failed.
Problem:	The service associated with the mentioned generic service resource failed.
Solution:	Check the generic service properties and service configuration for errors. Check system and application event logs for errors.

Event ID 1043

Source:	ClusSvc
Description:	The NetBIOS interface for "IP Address" resource has failed.
Problem:	The network adapter for the specified IP address resource has experienced a failure. As a result, the IP address is either offline, or the group has moved to a surviving node in the cluster.
Solution:	Check the network adapter and network connection for problems. Resolve the network-related problem.

Event ID 1044

Source:	ClusSvc
Description:	Cluster IP Address resource %1 could not create the required NetBios interface.
Problem:	The cluster service attempted to initialize an IP Address resource and could not establish a context with NetBios.
Solution:	This could be a network adapter or network adapter driver related issue. Make sure the adapter is using a current driver and the correct driver for the adapter. If this is an embedded adapter, check with the OEM to determine if a specific OEM version of the driver is a requirement. If you already have many IP Address resources defined, make sure you have not reached the NetBios limit of 64 addresses. If you have IP Address resources defined that do not have a need for NetBios affiliation, use the IP Address private property to disable NetBios for the address. This option is available in SP4 and helps to conserve NetBios address slots.

Event ID 1045

Source:	ClusSvc
Description:	Cluster IP address "IP address" could not create the required TCP/IP Interface..
Problem:	The cluster service tried to bring an IP address online. The resource properties may specify an invalid network or malfunctioning adapter. This error may occur if you replace a network adapter with a different model and continue to use the old or inappropriate driver. As a result, the IP address resource cannot be bound to the specified network.
Solution:	Resolve the network adapter problem or change the properties of the IP address resource to reflect the proper network for the resource.

Event ID 1046

Source:	ClusSvc
Description:	Cluster IP Address resource %1 cannot be brought online because the subnet mask parameter is invalid. Please check your network configuration.
Problem:	The cluster service tried to bring an IP address resource online but could not do so. The subnet mask for the resource is either blank or otherwise invalid.
Solution:	Correct the subnet mask for the resource.

Event ID 1047

Source:	ClusSvc
Description:	Cluster IP Address resource %1 cannot be brought online because the IP address parameter is invalid. Please check your network configuration.
Problem:	The cluster service tried to bring an IP address resource online but could not do so. The IP address property contains an invalid value. This may be caused by incorrectly creating the resource through an API or the command line interface.
Solution:	Correct the IP address properties for the resource.

Event ID 1048

Source:	ClusSvc
Description:	Cluster IP address, "IP address," cannot be brought online because the specified adapter name is invalid.
Problem:	The cluster service tried to bring an IP address online. The resource properties may specify an invalid network or a malfunctioning adapter. This error may occur if you replace a network adapter with a different model. As a result, the IP address resource cannot be bound to the specified network.
Solution:	Resolve the network adapter problem or change the properties of the IP address resource to reflect the proper network for the resource.

Event ID 1049

Source:	ClusSvc
Description:	Cluster IP address "IP address" cannot be brought online because the address IP address is already present on the network. Please check your network configuration.
Problem:	The cluster service tried to bring an IP address online. The address is already in use on the network and cannot be registered. Therefore, the resource cannot be brought online.
Solution:	Resolve the IP address conflict, or choose another address for the resource.

Event ID 1050

Source:	ClusSvc
Description:	Cluster Network Name resource %1 cannot be brought online because the name %2 is already present on the network. Please check your network configuration.
Problem:	The cluster service tried to bring a Network Name resource online. The name is already in use on the network and cannot be registered. Therefore, the resource cannot be brought online.
Solution:	Resolve the conflict, or choose another network name.

Event ID 1051

Source:	ClusSvc
Description:	Cluster Network Name resource "resourcename" cannot be brought online because it does not depend on an IP address resource. Please add an IP address dependency.
Problem:	The cluster service attempted to bring the network name resource online, and found that a required dependency was missing.
Solution:	Microsoft Cluster Server requires an IP address dependency for network name resource types. Cluster Administrator presents a pop-up message if you attempt to remove this dependency without specifying another like dependency. To resolve this error, replace the IP address dependency for this resource. Because it is difficult to remove this dependency, Event 1051 may be an indication of problems within the cluster registry. Check other resources for possible dependency problems.

Event ID 1052

Source:	ClusSvc
Description:	Cluster Network Name resource "resourcename" cannot be brought online because the name could not be added to the system.
Problem:	The cluster service attempted to bring the network name resource online but the attempt failed.
Solution:	Check the system event log for errors. Check network adapter configuration and operation. Check TCP/IP configuration and name resolution methods. Check WINS servers for possible database problems or invalid static mappings.

Event ID 1053

Source:	ClusSvc
Description:	Cluster File Share "resourcename" cannot be brought online because the share could not be created.
Problem:	The cluster service attempted to bring the share online but the attempt to create the share failed.
Solution:	Make sure the Server service is started and functioning properly. Check the path for the share. Check ownership and permissions on the directory. Check the system event log for details. Also, if diagnostic logging is enabled, check the log for an entry related to this failure. Use the net helpmsgerrornumber command with the error code found in the log entry.

Event ID 1054

Source:	ClusSvc
Description:	Cluster File Share %1 could not be found.
Problem:	The share corresponding to the named File Share resource was deleted using a mechanism other than Cluster Administrator. This may occur if you select the share with Explorer and choose 'Not Shared'.
Solution:	Delete shares or take them offline via Cluster Administrator or the command line program CLUSTER.EXE.

Event ID 1055

Source:	ClusSvc
Description:	Cluster File Share "sharename" has failed a status check.
Problem:	The cluster service (through resource monitors) periodically monitors the status of cluster resources. In this case, a file share failed a status check. This could mean that someone attempted to delete the share through Windows NT Explorer or Server Manager, instead of through Cluster Administrator. This event could also indicate a problem with the Server service, or access to the shared directory.
Solution:	Check the system event log for errors. Check the cluster diagnostic log (if it is enabled) for status codes that may be related to this event. Check the resource properties for proper configuration. Also, make sure the file share has proper dependencies defined for related resources.

Event ID 1056

Source:	ClusSvc
Description:	The cluster database on the local node is in an invalid state. Please start another node before starting this node.
Problem:	The cluster database on the local node may be in a default state from the installation process and the node has not properly joined with an existing node.
Solution:	Make sure another node of the same cluster is online first before starting this node. Upon joining with another cluster node, the node will receive an updated copy of the official cluster database and should alleviate this error.

Event ID 1057

Source:	ClusSvc
Description:	The cluster service CLUSDB could not be opened.
Problem:	The Cluster Service tried to open the CLUSDB registry hive and could not do so. As a result, the cluster service cannot be brought online.
Solution:	Check the cluster installation directory for the existence of a file called CLUSDB. Make sure the registry file is not held open by any applications, and that permissions on the file allow the cluster service access to this file and directory.

Event ID 1058

Source:	ClusSvc
Description:	The Cluster Resource Monitor could not load the DLL %1 for resource type %2.
Problem:	The Cluster Service tried to bring a resource online that requires a specific resource DLL for the resource type. The DLL is either missing, corrupt, or an incompatible version. As a result, the resource cannot be brought online.
Solution:	Check the cluster installation directory for the existence of the named resource DLL. Make sure the DLL exists in the proper directory on both nodes.

Event ID 1059

Source:	ClusSvc
Description:	The Cluster Resource DLL %1 for resource type %2 failed to initialize.
Problem:	The Cluster Service tried to load the named resource DLL and it failed to initialize. The DLL could be corrupt, or an incompatible version. As a result, the resource cannot be brought online.
Solution:	Check the cluster installation directory for the existence of the named resource DLL. Make sure the DLL exists in the proper directory on both nodes and is of proper version. If the DLL is clusres.dll, this is the default resource DLL that comes with MSCS. Check to make sure the version/date stamp is equivalent to or with a later date than the version contained in the service pack in use.

Event ID 1061

Source:	ClusSvc
Description:	Microsoft Cluster Server successfully formed a cluster on this node.
Information:	This informational message indicates that an existing cluster of the same name was not detected on the network, and that this node elected to form the cluster and own access to the quorum disk.

Event ID 1062

Source:	ClusSvc
Description:	Microsoft Cluster Server successfully joined the cluster.
Information:	When the Cluster Service started, it detected an existing cluster on the network and was able to successfully join the cluster. No action needed.

Event ID 1063

Source:	ClusSvc
Description:	Microsoft Cluster Server was successfully stopped.
Information:	The Cluster Service was stopped manually by the administrator.

Event ID 1064

Source:	ClusSvc
Description:	The quorum resource was changed. The old quorum resource could not be marked as obsolete. If there is a partition in time, you may lose changes to your database, because the node that is down will not be able to get to the new quorum resource.
Problem:	The administrator changed the quorum disk designation without all cluster nodes present.
Solution:	When other cluster nodes attempt to join the existing cluster, they may not be able to connect to the quorum disk, and may not participate in the cluster, because their configuration indicates a different quorum device. For any nodes that meet this criterion, you may need to use the -fixquorum option to start the Cluster Service on these nodes and make configuration changes.

Event ID 1065

Source:	ClusSvc
Description:	Cluster resource %1 failed to come online.
Problem:	The cluster service attempted to bring the resource online, but the resource could not reach an online status. The resource may have exhausted the timeout period allotted for the resource to reach an online state.
Solution:	Check any parameters related to the resource and check the event log for details.

Event ID 1066

Source:	ClusSvc
Description:	Cluster disk resource resourcename is corrupted. Running Chkdsk /F to repair problems.
Problem:	The Cluster Service detected corruption on the indicated disk resource and started Chkdsk /f on the volume to repair the structure. The Cluster Service will automatically perform this operation, but only for cluster-defined disk resources (not local disks).
Solution:	Scan the event log for additional errors. The disk corruption could be indicative of other problems. Check related hardware and devices on the shared bus and ensure proper cables and termination. This error may be a symptom of failing hardware or a deteriorating drive.

Event ID 1067

Source:	ClusSvc
Description:	Cluster disk resource %1 has corrupt files. Running Chkdsk /F to repair problems.
Problem:	The Cluster Service detected corruption on the indicated disk resource and started Chkdsk /f on the volume to repair the structure. The Cluster Service will automatically perform this operation, but only for cluster-defined disk resources (not local disks).
Solution:	Scan the event log for additional errors. The disk corruption could be indicative of other problems. Check related hardware and devices on the shared bus and ensure proper cables and termination. This error may be a symptom of failing hardware or a deteriorating drive.

Event ID 1068

Source:	ClusSvc
Description:	The cluster file share resource resourcename failed to start. Error 5.
Problem:	The file share cannot be brought online. The problem may be caused by permissions to the directory or disk in which the directory resides. This may also be related to permission problems within the domain.
Solution:	Check to make sure that the Cluster Service account has rights to the directory to be shared. Make sure a domain controller is accessible on the network. Make sure dependencies for the share and for other resource in the group are set correctly. Error 5 translates to "Access Denied."

Event ID 1069

Source:	ClusSvc
Description:	Cluster resource "Disk G:" failed.
Problem:	The named resource failed and the cluster service logged the event. In this example, a disk resource failed.
Solution:	For disk resources, check the device for proper operation. Check cables, termination, and logfiles on both cluster nodes. For other resources, check resource properties for proper configuration, and check to make sure dependencies are configured correctly. Check the diagnostic log (if it is enabled) for status codes corresponding to the failure.

Event ID 1070

Source:	ClusSvc
Description:	Cluster node attempted to join the cluster but failed with error 5052.
Problem:	The cluster node attempted to join an existing cluster but was unable to complete the process. This problem may occur if the node was previously evicted from the cluster.
Solution:	If the node was previously evicted from the cluster, you must remove and reinstall MSCS on the affected server.

Event ID 1071

Source:	ClusSvc
Description:	Cluster node 2 attempted to join but was refused. Error 5052.
Problem:	Another node attempted to join the cluster and this node refused the request.
Solution:	If the node was previously evicted from the cluster, you must remove and reinstall MSCS on the affected server. Look in Cluster Administrator to see if the other node is listed as a possible cluster member.

Event ID 1073

Source:	ClusSvc
Description:	Microsoft Cluster Server was halted to prevent an inconsistency within the cluster. The error code was 5028.
Problem:	The cluster service on the affected node was halted because of some kind of inconsistency between cluster nodes.
Solution:	Check connectivity between systems. This error may be an indication of configuration or hardware problems.

Event ID 1077

Source:	ClusSvc
Description:	The TCP/IP interface for cluster IP address resourcename has failed.
Problem:	The IP address resource depends on the proper operation of a specific network interface as configured in the resource properties. The network interface failed.
Solution:	Check the system event log for errors. Check the network adapter for proper operation and replace the adapter if necessary. Check to make sure the proper adapter driver is loaded for the device and check for newer versions of the driver.

Event ID 1080

Source:	ClusSvc
Description:	The Microsoft Cluster Server could not write file W:\MSCS\Chk7f5.tmp. The disk may be low on disk space, or some other serious condition exists.
Problem:	The cluster service attempted to create a temporary file in the MSCS directory on the quorum disk. Lack of disk space or other factors prevented successful completion of the operation.
Solution:	Check the quorum drive for available disk space. The file system may be corrupted or the device may be failing. Check file system permissions to ensure that the cluster service account has full access to the drive and directory.

Event ID 1093

Source:	ClusSvc
Description:	Node %1 is not a member of cluster %2. If the name of the node has changed, Microsoft Cluster Server must be reinstalled.
Problem:	The cluster service attempted to start but found that it was not a valid member of the cluster.
Solution:	Microsoft Cluster Server may need to be reinstalled on this node. If this is the result of a server name change, be sure to evict the node from the cluster (from an operational node) prior to reinstallation.

Event ID 1096

Source:	ClusSvc
Description:	Microsoft Cluster Server cannot use network adapter %1 because it does not have a valid IP address assigned to it.
Problem:	The network configuration for the adapter has changed and the cluster service cannot make use of the adapter for the network that was assigned to it.
Solution:	Check the network configuration. If a DHCP address was used for the primary address of the adapter, the address may have been lost. For best results, use a static address.

Event ID 1097

Source:	ClusSvc
Description:	Microsoft Cluster Server did not find any network adapters with valid IP addresses installed in the system. The node will not be able to join a cluster.
Problem:	The network configuration for the system needs to be corrected to match the same connected networks as the other node of the cluster.
Solution:	Check the network configuration and make sure it agrees with the working node of the cluster. Make sure the same networks are accessible from all systems in the cluster.

Event ID 1098

Source:	ClusSvc
Description:	The node is no longer attached to cluster network network_id by adapter adapter. Microsoft Cluster Server will delete network interface interface from the cluster configuration.
Information:	The Cluster Service observed a change in network configuration that might be induced by a change of adapter type or by removal of a network. The network will be removed from the list of available networks.

Event ID 1100

Source:	ClusSvc
Description:	Microsoft Cluster Server discovered that the node is now attached to cluster network network_id by adapter adapter. A new cluster network interface will be added to the cluster configuration.
Information:	The Cluster Service noticed a new network accessible by the cluster nodes, and has added the new network to the list of accessible networks.

Event ID 1102

Source:	ClusSvc
Description:	Microsoft Cluster Server discovered that the node is attached to a new network by adapter adapter. A new network and network interface will be added to the cluster configuration.
Information:	The cluster service noticed the addition of a new network. The network will be added to list of available networks.

Event ID 1104

Source:	ClusSvc
Description:	Microsoft Cluster Server failed to update the configuration for one of the nodes Network interfaces. The error code was errorcode.
Problem:	The cluster service attempted to update a cluster node and could not perform the operation.
Solution:	Use the net helpmsg errorcode command to find an explanation of the underlying error. For example, error 1393 indicates that a corrupted disk caused the operation to fail.

Event ID 1105

Source:	ClusSvc
Description:	Microsoft Cluster Server failed to initialize the RPC services. The error code was %1.
Problem:	The cluster service attempted to utilize required RPC services and could not successfully perform the operation.
Solution:	Use the net helpmsg errorcode command to find an explanation of the underlying error. Check the system event log for other RPC related errors or performance problems.

Event ID 1107

Source:	ClusSvc
Description:	Cluster node node name failed to make a connection to the node over network network name. The error code was 1715.
Problem:	The cluster service attempted to connect to another cluster node over a specific network and could not establish a connection. This error is a warning message.
Solution:	Check to make sure that the specified network is available and functioning correctly. If the node experiences this problem, it may try other available networks to establish the desired connection.

Event ID 1109

Source:	ClusSvc
Description:	The node was unable to secure its connection to cluster node %1. The error code was %2. Check that both nodes can communicate with their domain controllers.
Problem:	The cluster service attempted to connect to another cluster node and could not establish a secure connection. This could indicate domain connectivity problems.
Solution:	Check to make sure that the networks are available and functioning correctly. This may be a symptom of larger network problems or domain security issues.

Event ID 1115

Source:	ClusSvc
Description:	An unrecoverable error caused the join of node nodename to the cluster to be aborted. The error code was errorcode.
Problem:	A node attempted to join the cluster but was unable to obtain successful membership.
Solution:	Use the NET HELPMSG errorcode command to obtain further description of the error that prevented the join operation. For example, error code 1393 indicates that a disk structure is corrupted and nonreadable. An error code like this could indicate a corrupted quorum disk.

Event ID 9

Source:	Disk
Description:	The device, \Device\ScsiPort2, did not respond within the timeout period.
Problem:	An I/O request was sent to a SCSI device and was not serviced within acceptable time. The device timeout was logged by this event.
Solution:	You may have a device or controller problem. Check SCSI cables, termination, and adapter configuration. Excessive recurrence of this event message may indicate a serious problem that could indicate potential for data loss or corruption. If necessary, contact your hardware vendor for help troubleshooting this problem.

Event ID 101

Source:	W3SVC
Description:	The server was unable to add the virtual root "/" for the directory "path" because of the following error: The system cannot find the path specified. The data is the error.
Problem:	The World Wide Web Publishing service could not create a virtual root for the IIS Virtual Root resource. The directory path may have been deleted.
Solution:	Re-create or restore the directory and contents. Check the resource properties for the IIS Virtual Root resource and ensure that the path is correct. This problem may occur if you had an IIS Virtual Root resource defined and then uninstalled Microsoft Cluster Server without first deleting the resource. In this case, you may evaluate and change virtual root properties by using the Internet Service Manager.

Event ID 1004

Source:	DHCP
Description:	DHCP IP address lease "IP address" for the card with network address "media access control Address" has been denied.
Problem:	This system uses a DHCP-assigned IP address for a network adapter. The system attempted to renew the leased address and the DHCP server denied the request. The address may already be allocated to another system. The DHCP server may also have a problem. Network connectivity may be affected by this problem.
Solution:	Resolve the problem by correcting DHCP server problems or assigning a static IP address. For best results within a cluster, use statically assigned IP addresses.

Event ID 1005

Source:	DHCP
Description:	DHCP failed to renew a lease for the card with network address "MAC Address." The following error occurred: The semaphore timeout period has expired.
Problem:	This system uses a DHCP assigned IP address for a network adapter. The system attempted to renew the leased address and was unable to renew the lease. Network operations on this system may be affected.
Solution:	There may be a connectivity problem preventing access to the DHCP server that leased the address, or the DHCP server may be offline. For best results within a cluster, use statically assigned IP addresses.

Event ID 2511

Source:	Server
Description:	The server service was unable to recreate the share "Sharename" because the directory "path" no longer exists.
Problem:	The Server service attempted to create a share using the specified directory path. This problem may occur if you create a share (outside of Cluster Administrator) on a cluster shared device. If the device is not exclusively available to this computer, the server service cannot create the share. Also, the directory may no longer exist or there may be RPC related issues.
Solution:	Correct the problem by creating a shared resource through Cluster Administrator, or correct the problem with the missing directory. Check dates of RPC files in the system32 directory. Make sure they concur with those contained in the service pack in use, or any hotfixes applied.

Event ID 4199

Source:	TCPIP
Description:	The system detected an address conflict for IP address "IP address" with the system having network hardware address "media access control address." Network operations on this system may be disrupted as a result.
Problem:	Another system on the network may be using one of the addresses configured on this computer.
Solution:	Resolve the IP address conflict. Check network adapter configuration and any IP address resources defined within the cluster.

Event ID 5719

Source:	Netlogon
Description:	No Windows NT Domain controller is available for domain "domain." (This event is expected and can be ignored when booting with the "No Net" hardware profile.) The following error occurred: There are currently no logon servers available to service the logon request.
Problem:	A domain controller for the domain could not be contacted. As a result, proper authentication of accounts could not be completed. This may occur if the network is disconnected or disabled through system configuration.
Solution:	Resolve the connectivity problem with the domain controller and restart the system.

Event ID 7000

Source:	Service Control Manager
Description:	The Cluster Service failed to start because of the following error: The service did not start because of a logon failure.
Problem:	The service control manager attempted to start a service (possibly ClusSvc). It could not authenticate the service account. This error may be seen with Event 7013.
Solution:	The service account could not be authenticated. This may be because of a failure contacting a domain controller, or because account credentials are invalid. Check the service account name and password and ensure that the account is available and that credentials are correct. You may also try running the cluster service from a command prompt (if currently logged on as an administrator) by changing to the %systemroot%\Cluster directory (or where you installed the software) and typing ClusSvc -debug. If the service starts and runs correctly, stop it by pressing CTRL+C and troubleshoot the service account problem. This error may also occur if network connectivity is disabled through the system configuration or hardware profile. Microsoft Cluster Server requires network connectivity.

Event ID 7013

Source:	Service Control Manager
Description:	Logon attempt with current password failed with the following error: There are currently no logon servers available to service the logon request.
More Info:	The description for this error message may vary somewhat based on the actual error. For example, another error that may be listed in the event detail might be: "Logon Failure: unknown username or bad password."
Problem:	The service control manager attempted to start a service (possibly ClusSvc). It could not authenticate the service account with a domain controller.
Solution:	The service account may be in another domain, or this system is not a domain controller. It is acceptable for the node to be a non-domain controller, but the node needs access to a domain controller within the domain as well as the domain that the service account belongs to. Inability to contact the domain controller may be because of a problem with the server, network, or other factors. This problem is not related to the cluster software and must be resolved before you start the cluster software. This error may also occur if network connectivity is disabled through the system configuration or hardware profile. Microsoft Cluster Server requires network connectivity.

Event ID 7023

Source:	Service Control Manager
Description:	The Cluster Server service terminated with the following error: The quorum log could not be created or mounted successfully.
Problem:	The Cluster Service attempted to start but could not gain access to the quorum log on the quorum disk. This may be because of problems gaining access to the disk or problems joining a cluster that has already formed.
Solution:	Check the disk and quorum log for problems. If necessary, check the cluster logfile for more information. There may be other events in the system event log that may give more information.

Appendix B: Using AND Reading THE Cluster Logfile

CLUSTERLOG Environment Variable

If you set the CLUSTERLOG environment variable, the cluster will create a logfile that contains diagnostic information using the path specified. Important events during the operation of the Cluster Service will be logged in this file. Because so many different events occur, the logfile may be somewhat cryptic or hard to read. This document gives some hints about how to read the logfile and information about what items to look for.

Note: Each time you attempt to start the Cluster Service, the log will be cleared and a new logfile started. Each component of MSCS that places an entry in the logfile will indicate itself by abbreviation in square brackets. For example, the Node Manager component would be abbreviated [NM]. Logfile entries will vary from one cluster to another. As a result, other logfiles may vary from excerpts referenced in this document.

Note: Log entry lines in the following sections have been wrapped for space constraints in this document. The lines do not normally wrap.

Operating System Version Number and Service Pack Level

Near the beginning of the logfile, notice the build number of MSCS, followed by the operating system version number and service pack level. If you call for support, engineers may ask for this information:

082::14-21:29:26.625 Cluster Service started - Cluster Version 1.224.
082::14-21:29:26.625   OS Version 4.0.1381 - Service Pack 3.

Cluster Service Startup

Following the version information, some initialization steps occur. Those steps are followed by an attempt to join the cluster, if one node already exists in a running state. If the Cluster Service could not detect any other cluster members, it will attempt to form the cluster. Consider the following log entries:

0b5::12-20:15:23.531 We're initing Ep...
0b5::12-20:15:23.531 [DM]: Initialization
0b5::12-20:15:23.531 [DM] DmpRestartFlusher: Entry
0b5::12-20:15:23.531 [DM] DmpStartFlusher: Entry
0b5::12-20:15:23.531 [DM] DmpStartFlusher: thread created
0b5::12-20:15:23.531 [NMINIT] Initializing the Node Manager...
0b5::12-20:15:23.546 [NMINIT] Local node name = NODEA.
0b5::12-20:15:23.546 [NMINIT] Local node ID = 1.
0b5::12-20:15:23.546 [NM] Creating object for node 1 (NODEA)
0b5::12-20:15:23.546 [NM] node 1 state 1
0b5::12-20:15:23.546 [NM] Initializing networks.
0b5::12-20:15:23.546 [NM] Initializing network interface facilities.
0b5::12-20:15:23.546 [NMINIT] Initialization complete.
0b5::12-20:15:23.546 [FM] Starting worker thread...
0b5::12-20:15:23.546 [API] Initializing
0a9::12-20:15:23.546 [FM] Worker thread running
0b5::12-20:15:23.546 [lm] :LmInitialize Entry. 
0b5::12-20:15:23.546 [lm] :TimerActInitialize Entry. 
0b5::12-20:15:23.546 [CS] Initializing RPC server.
0b5::12-20:15:23.609 [INIT] Attempting to join cluster MDLCLUSTER
0b5::12-20:15:23.609 [JOIN] Spawning thread to connect to sponsor 
192.88.80.114
06c::12-20:15:23.609 [JOIN] Asking 192.88.80.114 to sponsor us.
0b5::12-20:15:23.609 [JOIN] Waiting for all connect threads to terminate.
06c::12-20:15:32.750 [JOIN] Sponsor 192.88.80.114 is not available, 
status=1722.
0b5::12-20:15:32.750 [JOIN] All connect threads have terminated.
0b5::12-20:15:32.750 [JOIN] Unable to connect to any sponsor node.
0b5::12-20:15:32.750 [INIT] Failed to join cluster, status 53
0b5::12-20:15:32.750 [INIT] Attempting to form cluster MDLCLUSTER
0b5::12-20:15:32.750 [Ep]: EpInitPhase1
0b5::12-20:15:32.750 [API] Online read only
04b::12-20:15:32.765 [RM] Main: Initializing.

Note that the cluster service attempts to join the cluster. If it cannot connect with an existing member, the software decides to form the cluster. The next series of steps attempts to form groups and resources necessary to accomplish this task. It is important to note that the cluster service must arbitrate control of the quorum disk.

0b5::12-20:15:32.781 [FM] Creating group a1a13a86-0eaf-11d1
-8427-0000f8034599
0b5::12-20:15:32.781 [FM] Group a1a13a86-0eaf-11
d1-8427-0000f8034599 contains a1a13a87-0eaf-11d1-8427-0000f8034599.
0b5::12-20:15:32.781 [FM] Creating resource a1a13a87-0eaf-
11d1-8427-0000f8034599
0b5::12-20:15:32.781 [FM] FmpAddPossibleEntry adding 
1 to a1a13a87-0eaf-11d1-8427-0000f8034599 possible node list
0b5::12-20:15:32.781 [FMX] Found the quorum 
resource a1a13a87-0eaf-11d1-8427-0000f8034599.
0b5::12-20:15:32.781 [FM] All dependencies for a
1a13a87-0eaf-11d1-8427-0000f8034599 created
0b5::12-20:15:32.781 [FM] arbitrate for quorum 
resource id a1a13a87-0eaf-11d1-8427-0000f8034599.
0b5::12-20:15:32.781 FmpRmCreateResource: 
creating resource a1a13a87-0eaf-11d1-8427-0000f8034599 
in shared resource monitor
0b5::12-20:15:32.812 FmpRmCreateResource: 
created resource a1a13a87-0eaf-11d1-8427-0000f8034599, resid 1363016
0dc::12-20:15:32.828 Physical Disk <Disk D:>: Arbitrate returned status 0.
0b5::12-20:15:32.828 [FM] FmGetQuorumResource successful
0b5::12-20:15:32.828 FmpRmOnlineResource: 
bringing resource a1a13a87-0eaf-11d1-8427-0000f8034599 
(resid 1363016) online.
0b5::12-20:15:32.843 [CP] CppResourceNotify for resource Disk D:
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting  
type 0 context 8
0b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock wait on Type 0
0b5::12-20:15:32.843 [GUM] DoLockingUpdate successful, lock granted to 1
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker dispatching seq 388 
type 0 context 8
0b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate releasing lock ownership
0b5::12-20:15:32.843 [GUM] GumSendUpdate: completed update seq 388 
type 0 context 8
0b5::12-20:15:32.843 [GUM] GumSendUpdate: Locker waiting  
type 0 context 9
0b5::12-20:15:32.843 [GUM] Thread 0xb5 UpdateLock 
wait on Type 0
0b5::12-20:15:32.843 [GUM] DoLockingUpdate successful,
 lock granted to 1
0b5::12-20:15:32.843 [GUM] GumSendUpdate: 
Locker dispatching seq 389 
type 0 context 9
0b5::12-20:15:32.843 [GUM] GumpDoUnlockingUpdate
 releasing lock ownership
0b5::12-20:15:32.843 [GUM] GumSendUpdate: 
completed update seq 389 
type 0 context 9
0b5::12-20:15:32.843 FmpRmOnlineResource: 
Resource a1a13a87-0eaf-11d1-8427-0000f8034599 pending
0e1::12-20:15:33.359 Physical Disk <Disk D:>: Online, 
created registry watcher thread.
090::12-20:15:33.359 [FM] NotifyCallBackRoutine: enqueuing event
04d::12-20:15:33.359 [FM] WorkerThread, 
processing transition event for a1a13a87-0eaf-11
d1-8427-0000f8034599, oldState = 129, newState = 2.
04d::12-20:15:33.359 [FM] HandleResourceTransition: 
Resource Name = a1a13a87-0eaf-11d1-8427-0000f8034599 
old state=129 new state=2
04d::12-20:15:33.359 [DM] DmpQuoObjNotifyCb:
 Quorum resource is online
04d::12-20:15:33.375 [DM] DmpQuoObjNotifyCb: 
Own quorum resource, try open the quorum log
04d::12-20:15:33.375 [DM] DmpQuoObjNotifyCb: 
the name of the quorum file is D:\MSCS\quolog.log
04d::12-20:15:33.375 [lm] LogCreate : 
Entry FileName=D:\MSCS\quolog.log MaxFileSize=
0x00010000
04d::12-20:15:33.375 [lm] LogpCreate : Entry

In this case, the node forms the cluster group and quorum disk resource, gains control of the disk, and opens the quorum logfile. From here, the cluster performs operations with the logfile, and proceeds to form the cluster. This involves configuring network interfaces and bringing them online.

0b5::12-20:15:33.718 [NM] Beginning form process.
0b5::12-20:15:33.718 [NM] Synchronizing node information.
0b5::12-20:15:33.718 [NM] Creating node objects.
0b5::12-20:15:33.718 [NM] Configuring networks & interfaces.
0b5::12-20:15:33.718 [NM] Synchronizing network information.
0b5::12-20:15:33.718 [NM] Synchronizing interface information.
0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry
0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Exit, 
pLocalXsaction=0x00151c20 dwError=0x00000000
0b5::12-20:15:33.718 [NM] Setting database 
entry for interface a1a13a7f-0eaf-11d1-8427-0000f8034599
0b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Entry
0b5::12-20:15:33.718 [dm] DmCommitLocalUpdate Exit, 
dwError=0x00000000
0b5::12-20:15:33.718 [dm] DmBeginLocalUpdate Entry
0b5::12-20:15:33.875 [dm] DmBeginLocalUpdate Exit, 
pLocalXsaction=0x00151c20 dwError=0x00000000
0b5::12-20:15:33.875 [NM] Setting database entry 
for interface a1a13a81-0eaf-11d1-8427-0000f8034599
0b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Entry
0b5::12-20:15:33.875 [dm] DmCommitLocalUpdate Exit, 
dwError=0x00000000
0b5::12-20:15:33.875 [NM] Matched 2 networks,
 created 0 new networks.
0b5::12-20:15:33.875 [NM] Resynchronizing network information.
0b5::12-20:15:33.875 [NM] Resynchronizing interface information.
0b5::12-20:15:33.875 [NM] Creating network objects.
0b5::12-20:15:33.875 [NM] 
Creating object for network a1a13a7e-0eaf-11d1-
8427-0000f8034599
0b5::12-20:15:33.875 [NM] 
Creating object for network a1a13a80-0eaf-11d1-
8427-0000f8034599
0b5::12-20:15:33.875 [NM] Creating interface objects.
0b5::12-20:15:33.875 [NM] 
Creating object for interface a1a13a7f-0eaf-11d1-8427-
0000f8034599.
0b5::12-20:15:33.875 [NM] 
Registering network a1a13a7e-0eaf-11d1-8427-
0000f8034599 with
 cluster transport.
0b5::12-20:15:33.875 [NM] 
Registering interfaces for network a1a13a7e-0eaf-11d1-8427-
0000f8034599 with cluster transport.
0b5::12-20:15:33.875 [NM] 
Registering interface a1a13a7f-0eaf-
11d1-8427-0000f8034599 with cluster transport, 
addr 9.9.9.2, endpoint 3003.
0b5::12-20:15:33.890 [NM]
 Instructing cluster transport to bring network a1a13a7e-0eaf-11d1-
 8427-0000f8034599 online.
0b5::12-20:15:33.890 [NM] 
Creating object for interface a1a13a81-0eaf-11d1-
8427-0000f8034599.
0b5::12-20:15:33.890 [NM] 
Registering network a1a13a80-0eaf-11d1-8427-
0000f8034599
 with cluster transport.
0b5::12-20:15:33.890 [NM] 
Registering interfaces for network a1a13a80-0eaf-11d1-8427-
0000f8034599 
with cluster transport.
0b5::12-20:15:33.890 [NM] 
Registering interface a1a13a81-0eaf-11d1-8427-
0000f8034599 
with cluster transport, addr 192.88.80.190, endpoint 3003.
0b5::12-20:15:33.890 [NM] 
Instructing cluster transport to bring network a1a13a80-0eaf-11d1-
8427-0000f8034599 online.

After initializing network interfaces, the cluster will continue formation with the enumeration of cluster nodes. In this case, as a newly formed cluster, the cluster will contain only one node. If this session had been joining an existing cluster, the node enumeration would show two nodes. Next, the cluster will bring the Cluster IP address and Cluster Name resources online.

0b5::12-20:15:34.015 [FM] OnlineGroup: 
setting group state to Online for f901aa29-0eaf-11d1-
8427-0000f8034599
069::12-20:15:34.015 IP address <
Cluster IP address>: Created NBT interface \Device\NetBt_
If6 (instance 355833456).
0b5::12-20:15:34.015 [FM] 
FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427
-0000f8034599 possible node list
0b5::12-20:15:34.015 [FM] 
FmFormNewClusterPhase2 complete.
.
.
.
0b5::12-20:15:34.281 [INIT] Successfully formed a cluster.
09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Entry. 
09c::12-20:15:34.281 [lm] :ReSyncTimerHandles Exit gdwNumHandles=3
0b5::12-20:15:34.281 [INIT] Cluster Started! Original Min WS is
 204800, Max WS is 1413120.
08c::12-20:15:34.296 [CPROXY] clussvc initialized
069::12-20:15:40.421 IP address <Cluster IP Address>: 
IP Address 192.88.80.114 on adapter DC21X41 online
.
.
.
04d::12-20:15:40.421 [FM] OnlineWaitingTree, 
a1a13a84-0eaf-11d1-8427-0000f8034599 
depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Start first
04d::12-20:15:40.421 [FM] OnlineWaitingTree, 
Start resource a1a13a84-0eaf-11d1-8427-0000f8034599
04d::12-20:15:40.421 [FM] OnlineResource: 
a1a13a84-0eaf-11d1-8427-0000f8034599 
depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Bring online first.
04d::12-20:15:40.421 FmpRmOnlineResource: 
bringing resource a1a13a84-0eaf-11d1-8427-0000f8034599
 (resid 1391032) online.
04d::12-20:15:40.421 [CP] CppResourceNotify for resource Cluster Name
04d::12-20:15:40.421 [GUM] GumSendUpdate: Locker waiting  
type 0 context 8
04d::12-20:15:40.437 [GUM] Thread 0x4d UpdateLock wait on Type 0
04d::12-20:15:40.437 [GUM] DoLockingUpdate successful, lock granted to 1
076::12-20:15:40.437 Network Name <Cluster Name>: 
Bringing resource online...
04d::12-20:15:40.437 [GUM] GumSendUpdate: Locker dispatching seq 411 
type 0 context 8
04d::12-20:15:40.437 [GUM] GumpDoUnlockingUpdate
 releasing lock ownership
04d::12-20:15:40.437 [GUM] GumSendUpdate: completed update seq 411 
type 0 context 8
04d::12-20:15:40.437 [GUM] GumSendUpdate: Locker waiting  
type 0 context 11
.
.
.
076::12-20:15:43.515 Network Name <Cluster Name>: 
Registered server name MDLCLUSTER on transport \Device\NetBt_If6.
076::12-20:15:46.578 Network Name <Cluster Name>: 
Registered workstation name MDLCLUSTER on transport \Device\NetBt_If6.
076::12-20:15:46.578 Network Name <Cluster Name>: 
Network Name MDLCLUSTER is now online

Following these steps, the cluster will attempt to bring other resources and groups online. The logfile will continue to increase in size as the cluster service runs. Therefore, it may be a good idea to enable this option when you are having problems, rather than leaving it on for days or weeks at a time.

Logfile Entries for Common Failures

After reviewing a successful startup of the Cluster Service, you may want to examine some errors that may appear because of various failures. The following examples illustrate possible log entries for four different failures.

Example 1: Quorum Disk Turned Off

If the cluster attempts to form and cannot connect to the quorum disk, entries similar to the following may appear in the logfile. Because of the failure, the cluster cannot form, and the Cluster Service terminates.

0b9::14-20:59:42.921 [RM] Main: Initializing.
08f::14-20:59:42.937 [FM] 
Creating group a1a13a86-0eaf-11d1-8427-
0000f8034599
08f::14-20:59:42.937 [FM] 
Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains
 a1a13a87-0eaf-11d1-8427-0000f8034599.
08f::14-20:59:42.937 [FM] 
Creating resource a1a13a87-0eaf-11d1-8427-
0000f8034599
08f::14-20:59:42.937 [FM] 
FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-
0000f8034599 possible node list
08f::14-20:59:42.937 [FMX] 
Found the quorum resource a1a13a87-0eaf-11d1-8427-
0000f8034599.
08f::14-20:59:42.937 [FM] 
All dependencies for a1a13a87-0eaf-11d1-8427-
0000f8034599 created
08f::14-20:59:42.937 [FM] 
arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427-
0000f8034599.
08f::14-20:59:42.937 
FmpRmCreateResource: 
creating resource a1a13a87-0eaf-11d1-8427-
0000f8034599 in 
shared resource monitor
08f::14-20:59:42.968 FmpRmCreateResource: 
created resource a1a13a87-0eaf-11d1-8427-
0000f8034599, 
resid 1362616
0e9::14-20:59:43.765 Physical Disk <Disk D:>: 
SCSI, error reserving disk, error 21.
0e9::14-20:59:54.125 Physical Disk <Disk D:>: 
SCSI, error reserving disk, error 21.
0e9::14-20:59:54.140 Physical Disk <Disk D:>: 
Arbitrate returned status 21.
08f::14-20:59:54.140 [FM] FmGetQuorumResource
 failed, error 21.
08f::14-20:59:54.140 [INIT] Cleaning up failed form attempt.
08f::14-20:59:54.140 [INIT] Failed to form cluster,
 status 3213068.
08f::14-20:59:54.140 [CS] ClusterInitialize failed 21
08f::14-20:59:54.140 [INIT] The cluster service is shutting down.
08f::14-20:59:54.140 [evt] EvShutdown
08f::14-20:59:54.140 [FM] Shutdown: 
Failover Manager requested to shutdown groups.
08f::14-20:59:54.140 [FM] DestroyGroup: 
destroying a1a13a86-0eaf-11d1-8427-0000f8034599
08f::14-20:59:54.140 [FM] DestroyResource: 
destroying a1a13a87-0eaf-11d1-8427-0000f8034599
08f::14-20:59:54.140 [OM] Deleting object Physical Disk
08f::14-20:59:54.140 [FM] 
Resource a1a13a87-0eaf-11d1-8427-0000f8034599 destroyed.
08f::14-20:59:54.140 [FM] 
Group a1a13a86-0eaf-11d1-8427-0000f8034599 destroyed.
08f::14-20:59:54.140 [Dm] DmShutdown
08f::14-20:59:54.140 [DM] DmpShutdownFlusher: Entry
08f::14-20:59:54.156 [DM] DmpShutdownFlusher: Setting event
062::14-20:59:54.156 [DM] DmpRegistryFlusher: got 0
062::14-20:59:54.156 [DM] DmpRegistryFlusher: exiting
0ca::14-20:59:54.156 [FM] WorkItem, delete resource
 <Disk D:> status 0
0ca::14-20:59:54.156 [OM] 
Deleting object Disk Group 1 (a1a13a86-0eaf-11d1-
8427-0000f8034599)
0e7::14-20:59:54.375 [CPROXY] clussvc terminated, error 0.
0e7::14-20:59:54.375 [CPROXY] Service Stopping...
0b9::14-20:59:54.375 [RM] Going away, Status = 1, Shutdown = 0.
02c::14-20:59:54.375 [RM] 
PollerThread stopping. Shutdown = 1, Status = 0, WaitFailed = 0, 
NotifyEvent address = 196.
0e7::14-20:59:54.375 [CPROXY] Cleaning up
0b9::14-20:59:54.375 [RM] 
RundownResources posting shutdown notification.
0e7::14-20:59:54.375 [CPROXY] Cleanup complete.
0e3::14-20:59:54.375 [RM] NotifyChanges shutting down.
0e7::14-20:59:54.375 [CPROXY] Service Stopped.

Perhaps the most meaningful lines from above are:

0e9::14-20:59:43.765 Physical Disk <Disk D:>: SCSI, 
error reserving disk, error 21.
0e9::14-20:59:54.125 Physical Disk <Disk D:>: SCSI, 
error reserving disk, error 21.
0e9::14-20:59:54.140 Physical Disk <Disk D:>:
 Arbitrate returned status 21.

Note: The error code on these logfile entries is 21. You can issue net helpmsg 21 from the command line and receive the explanation of the error status code. Status code 21 means, "The device is not ready." This indicates a possible problem with the device. In this case, the device was turned off, and the error status correctly indicates the problem.

Example 2: Quorum Disk Failure

In this example, the drive has failed or has been reformatted from the SCSI controller. As a result, the cluster service cannot locate a drive with the specific signature it is looking for.

0b8::14-21:11:46.515 [RM] Main: Initializing.
074::14-21:11:46.531 [FM] 
Creating group a1a13a86-0eaf-11d1-8427-0000f8034599
074::14-21:11:46.531 [FM] 
Group a1a13a86-0eaf-11d1-8427-0000f8034599 contains
 a1a13a87-0eaf-11d1-8427-0000f8034599.
074::14-21:11:46.531 [FM] 
Creating resource a1a13a87-0eaf-11d1-8427-0000f8034599
074::14-21:11:46.531 [FM] 
FmpAddPossibleEntry adding 1 to a1a13a87-0eaf-11d1-8427-
0000f8034599 possible node list
074::14-21:11:46.531 [FMX] 
Found the quorum resource a1a13a87-0eaf-11d1-8427-
0000f8034599.
074::14-21:11:46.531 [FM] 
All dependencies for a1a13a87-0eaf-11d1-8427-0000f8034599 created
074::14-21:11:46.531 [FM] 
arbitrate for quorum resource id a1a13a87-0eaf-11d1-8427-
0000f8034599.
074::14-21:11:46.531 FmpRmCreateResource: 
creating resource a1a13a87-0eaf-11d1-8427-0000f8034599 in 
shared resource monitor
074::14-21:11:46.562 FmpRmCreateResource: 
created resource a1a13a87-0eaf-11d1-8427-0000f8034599, 
resid 1362696
075::14-21:11:46.671 Physical Disk <Disk D:>: 
SCSI,Performing bus rescan.
075::14-21:11:51.843 Physical Disk <Disk D:>: 
SCSI,error attaching to signature 71cd0549, error 2.
075::14-21:11:51.843 Physical Disk <Disk D:>: 
Unable to attach to signature 71cd0549. Error: 2.
074::14-21:11:51.859 [FM] FmGetQuorumResource failed, error 2.
074::14-21:11:51.859 [INIT] Cleaning up failed form attempt.

In this case, the most important logfile entries are:

075::14-21:11:51.843 Physical Disk <Disk D:>: 
SCSI, error attaching to signature 71cd0549, error 2.
075::14-21:11:51.843 Physical Disk <Disk D:>: 
Unable to attach to signature 71cd0549. Error: 2.

Status code 2 means, "The system cannot find the file specified." The error in this case may mean that it cannot find the disk, or that, because of some kind of problem, it cannot locate the quorum logfile that should be on the disk.

Example 3: Duplicate Cluster IP Address

If another computer on the network has the same IP address as the cluster IP address resource, the resource will be prevented from going online. Further, the cluster name will not be registered on the network, as it depends on the IP address resource. Because this name is the network name used for cluster administration, you will not be able to administer the cluster using this name, in this type of failure. However, you may be able to use the computer name of the cluster node to connect with Cluster Administrator. Additionally, you may be able to connect locally from the console using the loopback address. The following sample entries are from a cluster logfile during this type of failure:

0b9::14-21:32:59.968 IP Address <Cluster IP Address>: 
The IP address is already in use on the network, status 5057.
0d2::14-21:32:59.984 [FM] NotifyCallBackRoutine: enqueuing event
03e::14-21:32:59.984 [FM] 
WorkerThread, processing transition event for 
a1a13a83-0eaf-11d1-8427-0000f8034599, oldState = 129, newState = 4.03e
.
.
.
03e::14-21:32:59.984 
FmpHandleResourceFailure: 
taking resource a1a13a83-0eaf-11d1-8427-0000f8034599 and dependents offline
03e::14-21:32:59.984 [FM] 
TerminateResource: a1a13a84-0eaf-11d1-8427-0000f8034599 
depends on a1a13a83-0eaf-11d1-8427-0000f8034599. Terminating first
0d3::14-21:32:59.984 Network Name <Cluster Name>: 
Terminating name MDLCLUSTER...
0d3::14-21:32:59.984 Network Name <Cluster Name>: 
Name MDLCLUSTER is already offline.
.
.
.
03e::14-21:33:00.000 FmpRmTerminateResource: 
a1a13a84-0eaf-11d1-8427-0000f8034599 is now offline
0c7::14-21:33:00.000 IP Address <Cluster IP Address>: 
Terminating resource...
0c7::14-21:33:00.000 IP Address <Cluster IP Address>: 
Address 192.88.80.114 on adapter DC21X41 offline.

Example 4: Evicted Node Attempts to Join Existing Cluster

If you evict a node from a cluster, the cluster software on that node must be reinstalled to gain access to the cluster again. If you start the evicted node, and the Cluster Service attempts to join the cluster, entries similar to the following may appear in the cluster logfile:

032::26-16:11:45.109 [INIT] 
Attempting to join cluster MDLCLUSTER
032::26-16:11:45.109 [JOIN] 
Spawning thread to connect to sponsor 192.88.80.115
040::26-16:11:45.109 [JOIN] 
Asking 192.88.80.115 to sponsor us.
032::26-16:11:45.109 [JOIN] 
Spawning thread to connect to sponsor 9.9.9.2
032::26-16:11:45.109 [JOIN] 
Spawning thread to connect to sponsor 192.88.80.190
099::26-16:11:45.109 [JOIN] 
Asking 9.9.9.2 to sponsor us.
032::26-16:11:45.109 [JOIN] 
Spawning thread to connect to sponsor NODEA
098::26-16:11:45.109 [JOIN] 
Asking 192.88.80.190 to sponsor us.
032::26-16:11:45.125 [JOIN] 
Waiting for all connect threads to terminate.
092::26-16:11:45.125 [JOIN] 
Asking NODEA to sponsor us.
040::26-16:12:18.640 [JOIN] 
Sponsor 192.88.80.115 is not available (JoinVersion), status=1722.
098::26-16:12:18.640 [JOIN] 
Sponsor 192.88.80.190 is not available (JoinVersion), status=1722.
099::26-16:12:18.640 [JOIN] 
Sponsor 9.9.9.2 is not available (JoinVersion), status=1722.
098::26-16:12:18.640 [JOIN] 
JoinVersion data for sponsor 157.57.224.190 is invalid, status 1722.
099::26-16:12:18.640 [JOIN] 
JoinVersion data for sponsor 9.9.9.2 is invalid, status 1722.
040::26-16:12:18.640 [JOIN] 
JoinVersion data for sponsor 157.58.80.115 is invalid, status 1722.
092::26-16:12:18.703 [JOIN] 
Sponsor NODEA is not available (JoinVersion), status=1722.
092::26-16:12:18.703 [JOIN] 
JoinVersion data for sponsor NODEA is invalid, status 1722.
032::26-16:12:18.703 [JOIN] 
All connect threads have terminated.
032::26-16:12:18.703 [JOIN] 
Unable to connect to any sponsor node.
032::26-16:12:18.703 [INIT] 
Failed to join cluster, status 0
032::26-16:12:18.703 [INIT] 
Attempting to form cluster MDLCLUSTER
.
.
.
032::26-16:12:18.734 [FM] 
arbitrate for quorum resource id 24acc093-1e28-11d1-9e5d-0000f8034599.
032::26-16:12:18.734 [FM] 
FmpQueryResourceInfo:initialize the resource with the registry information
032::26-16:12:18.734 FmpRmCreateResource: 
creating resource 24acc093-1e28-11d1-9e5d-0000f8034599 in shared 
resource monitor
032::26-16:12:18.765 FmpRmCreateResource: 
created resource 24acc093-1e28-11d1-9e5d-0000f8034599, resid 1360000
06d::26-16:12:18.812 
Physical Disk <Disk G:>: SCSI, error attaching to signature b2320a9b, error 2.
06d::26-16:12:18.812 
Physical Disk <Disk G:>: Unable to attach to signature b2320a9b. Error: 2.
032::26-16:12:18.812 [FM] 
FmGetQuorumResource failed, error 2.
032::26-16:12:18.812 [INIT] Cleaning up failed form attempt.
032::26-16:12:18.812 [INIT] Failed to form cluster, status 2.
032::26-16:12:18.828 [CS] ClusterInitialize failed 2

The node attempts to join the existing cluster, but has invalid credentials, because it was previously evicted. Therefore, the existing node refuses to communicate with it. The node may attempt to form its own version of the cluster, but cannot gain control of the quorum disk, because the existing cluster node maintains ownership. Examination of the logfile on the existing cluster node reveals that the Cluster Service posted entries to reflect the failed attempt to join:

0c4::29-18:13:31.035 [NMJOIN] Processing request by node 2 to
 begin joining.
0c4::29-18:13:31.035 [NMJOIN] Node 2 is not a member of this
 cluster. Cannot join.

Appendix C: Command-Line Administration

You can perform many of the administrative tasks for MSCS from the Windows NT command prompt, without using the provided graphical interface. While the graphical method provides easier administration and status of cluster resources at a glance, MSCS does provide the capability to issue most administrative commands without the graphical interface. This ability opens up interesting possibilities for batch files, scheduled commands, and other techniques, in which many tasks may be automated.

Using Cluster.exe

Cluster.exe is a companion program and is installed with Cluster Administrator. While the Microsoft Cluster Server Administrator's Guide details basic syntax for this utility, the intention of this section is to complement the existing documentation and to offer examples. All examples in this section assume a cluster name of MYCLUSTER, installed in the domain called MYDOMAIN, with NODEA and NODEB as servers in the cluster. All examples are given as a single command line.

Note: Specify any names that contain spaces within quotation marks.

Basic Syntax

With the exception of the cluster /? command, which returns basic syntax for the command, every command line uses the syntax:

CLUSTER [cluster name] /option

To test connectivity with a cluster, or to ensure you can use Cluster.exe, try the simple command in the next section to check the version name (/version).

Cluster Commands

Version Number

To check the version number of your cluster, use a command similar to the following:

CLUSTER mycluster /version

If your cluster were named MYCLUSTER, the above command would return the version information for the product.

Listing Clusters in the Domain

To list all clusters within a single domain, use a command including the /list option like this:

CLUSTER mycluster /LIST:mydomain

Node Commands

All commands directed toward a specific cluster node must use the following syntax:

CLUSTER [cluster name] NODE [node name] /option

Node Status

To obtain the status of a particular cluster node, use the /status command. For example:

CLUSTER mycluster NODE NodeA /Status

The node name is optional only for the /status command, so the following command will report the status of all nodes in the cluster:

CLUSTER mycluster NODE /Status

Pause or Resume

The pause option allows the cluster service to continue running and communicating in the cluster. However, the paused node may not own groups or resources. For example, to pause a node, use the /pause switch:

CLUSTER mycluster NODE NodeB /Pause

An example of the use of this command might be to transfer groups to another node while you perform some other kind of task, such as with a backup or disk defrag utility. To resume the node, simply use the /resume switch instead:

CLUSTER mycluster NODE NodeB /Resume

Evict a Node

The evict option removes the ability of a node to participate in the cluster. In other words, the cluster node loses membership rights in the cluster. The only way to grant membership rights again to the evicted node is:

Remove the cluster software from the evicted node through Add/Remove Programs in Control Panel.
Restart the node.
Reinstall MSCS on the previously evicted node through the MSCS Setup program.

To perform this action, use a command similar to the following:

CLUSTER mycluster NODE NodeB /Evict

Changing Node Properties

While the cluster node only has one property that may be changed by Cluster.exe, this example illustrates how to change a property of a cluster resource. The node description is the only property that may be changed. For example:

CLUSTER mycluster NODE NodeA /Properties Description="
The best node in MyCluster."

A good use for this node changing property might be in the case of multiple administrators. For example, you pause a node to run a large application on the designated node, and want to change the node description to reflect this. The field could serve as a reminder to yourself and to other administrators as to why it was paused — and that someone may want to /resume the node later. It might be good to include /resume in a batch file that might pause a node while setting up for the designated task.

Group Commands

All group commands use the syntax:

CLUSTER [cluster name] GROUP [group name] /option

Group Status

To obtain the status of a group, you may use the /status option. This option is the only group option in which the group name is optional. Therefore, if you omit the group name, the status of all groups will be displayed. Another status option (/node) will display group status by node.

Example 1: Status of all groups:

CLUSTER mycluster GROUP /Status

Example 2: Status of all groups owned by a specific node:

CLUSTER mycluster GROUP /Status /Node:nodea

Example 3: Status of a specific group:

CLUSTER mycluster GROUP "Cluster Group"

Create a New Group

It is easy to create a new group from the command line.

Note: The following example creates a group called mygroup:

CLUSTER mycluster GROUP mygroup /create

Delete a Group

Equally as simple as the /create option, you may delete groups from the command line. However, the group must be empty before it can be deleted.

CLUSTER mycluster GROUP mygroup /delete

Rename a Group

To rename a group, use the following syntax:

CLUSTER mycluster GROUP mygroup /rename:yourgroup

Move, Online, and Offline Group Commands

The move group command may be used to transfer ownership of a group and its resources to another node. By design, the move command must take the group offline and bring it online on the other node. Further, a timeout value (number of seconds) may be supplied to specify the time to wait before cancellation of the move request. By default, Cluster.exe waits indefinitely until the state of the group changes to the desired state.

Examples:

CLUSTER mycluster GROUP mygroup /MoveTo:Nodeb /wait:120
CLUSTER mycluster GROUP mygroup /Offline
CLUSTER mycluster GROUP mygroup /Online

Group Properties

Use the /property option to display or set group properties. Documentation on common properties for groups may be found in the Microsoft Cluster Server Administrator's Guide. One additional property not documented is LoadBalState. This property is not used in MSCS version 1.0, and is reserved for future use.

Examples:

CLUSTER mycluster GROUP mygroup /Properties
CLUSTER mygroup GROUP 
mygroup /Properties Description="My favorite group"

Preferred Owner

You may specify a preferred owner for a group. The preferred owner is the node you prefer each group to run. If a node fails, the remaining node takes over the groups from the failed node. By setting the fail back option at the group level, groups may fail back to their preferred server when the node becomes available. A group does not fail back if a preferred owner is not specified. MSCS version 1.0 is limited to two nodes in a cluster. For best results, specify no more than one preferred owner. In future releases, this property may use a list of more than one preferred owner.

Example: To list the preferred owner for a group, type:

CLUSTER mycluster GROUP mygroup /Listowner

Example: To specify the preferred owner list, type:

CLUSTER mycluster GROUP mygroup /Setowners:Nodea

Resource Commands

Resource Status

To list the status of resources or a particular resource, you can use the /status option. Note the following examples:

CLUSTER mycluster RESOURCE /Status
CLUSTER mycluster RESOURCE myshare /Status

Create a New Resource

To create a new resource, use the /create option.

Note: To avoid error, you must specify all required parameters for the resource. The /create option allow creation of resources in an incomplete state. Make sure to set additional resource properties as appropriate with subsequent commands.

Example: Command sequence to add a file share resource

CLUSTER mycluster RESOURCE 
myshare /Create /Group:mygroup /Type:"File Share"
CLUSTER mycluster RESOURCE 
myshare /PrivProp ShareName="myshare"
CLUSTER mycluster RESOURCE 
myshare /PrivProp Path="w:\myshare"
CLUSTER mycluster RESOURCE 
myshare /PrivProp Maxusers=-1
CLUSTER mycluster RESOURCE 
myshare /AddDependency:"Disk W"

Note: Log entry lines in the sections above have been wrapped for space constraints in this document. The lines do not normally wrap.

Simulating Resource Failure

You can simulate resource failure in a cluster from the command line by using the /fail option for a resource. This option is similar to using the Initiate Failure command from Cluster Administrator. The command assumes that the resource is already online.

Example:

CLUSTER mycluster RESOURCE myshare /Fail

Online/Offline Resource Commands

The /online and /offline resource commands work very much the same way as the corresponding group commands, and also may use the /wait option to specify a time limit (in seconds) for the operation to complete.

Examples:

CLUSTER mycluster RESOURCE myshare /Offline
CLUSTER mycluster RESOURCE myshare /Online

Dependencies

Resource dependency relationships may be listed or changed from the command line. To add or remove a dependency, you must know the name of the resource to be added or removed as a dependency.

Examples:

CLUSTER mycluster RESOURCE myshare /ListDependencies
CLUSTER mycluster RESOURCE myshare /AddDependency:"Disk W:"
CLUSTER mycluster RESOURCE myshare /RemoveDependency:"Disk W:"

Note: Log entry lines in the sections above have been wrapped for space constraints in this document. The lines do not normally wrap.

Example Batch Job

The following example takes an existing group, Mygroup, and creates resources within the group. The example creates a network name resource, and initiates failures to test failover. During the process, it uses various reporting commands to obtain the status of the group and resources. This example shows the output from all commands given. The commands in this example work, but may require minor alteration depending on configured cluster, group, resource, network, and IP addresses in your environment — if you choose to use them.

Note: The LoadBal properties reported in the example are reserved for future use. The EnableNetBIOS property for the IP address resource is a Service Pack 4 addition, and must be set to 1, for the resource to be a valid dependency for a network name resource.

C:\>REM Get group status
C:\>CLUSTER mycluster GROUP mygroup /status
Listing status for resource group 'mygroup':
Group                Node            Status
-------------------- --------------- ------
mygroup              NodeA           Online
C:\>REM Create the IP Address resource: myip
C:\>CLUSTER mycluster RESOURCE myip /create /Group:mygroup /Type:"Ip Address"
Creating resource 'myip'...
Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
myip                 mygroup              NodeA           Offline
C:\>REM Define the IP Address parameters
C:\>CLUSTER mycluster RESOURCE myip /priv network:client
C:\>CLUSTER mycluster RESOURCE myip /priv address:157.57.152.23
C:\>REM Redundant. Subnet mask should already be same as network uses.
C:\>CLUSTER mycluster RESOURCE myip /priv subnetmask:255.255.252.0
C:\>CLUSTER mycluster RESOURCE myip /priv EnableNetBIOS:1
C:\>REM Check the status
C:\>CLUSTER mycluster RESOURCE myip /Stat
Listing status for resource 'myip':
Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
myip                 mygroup              NodeA           Offline
C:\>REM View the properties
C:\>CLUSTER mycluster RESOURCE myip /prop
Listing properties for 'myip':
R Name                             Value
--------------------------------- -------------------------------
R Name                             myip
  Type                             IP Address
  Description
  DebugPrefix
  SeparateMonitor                  0 (0x0)
  PersistentState                  0 (0x0)
  LooksAlivePollInterval           5000 (0x1388)
  IsAlivePollInterval              60000 (0xea60)
  RestartAction                    2 (0x2)
  RestartThreshold                 3 (0x3)
  RestartPeriod                    900000 (0xdbba0)
  PendingTimeout                   180000 (0x2bf20)
  LoadBalStartupInterval           300000 (0x493e0)
  LoadBalSampleInterval            10000 (0x2710)
  LoadBalAnalysisInterval          300000 (0x493e0)
  LoadBalMinProcessorUnits         0 (0x0)
  LoadBalMinMemoryUnits            0 (0x0)
C:\>REM View the private properties
C:\>CLUSTER mycluster RESOURCE myip /priv
Listing private properties for 'myip':
R Name                             Value
--------------------------------- -------------------------------
  Network                          Client
  Address                          157.57.152.23
  SubnetMask                       255.255.252.0
  EnableNetBIOS                    1 (0x1)
C:\>REM Bring online and wait 60 sec. for completion
C:\>CLUSTER mycluster RESOURCE myip /Online /Wait:60
Bringing resource 'myip' online...
Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
myip                 mygroup              NodeA           Online
C:\>REM Check the status again.
C:\>CLUSTER mycluster RESOURCE myip /Stat
Listing status for resource 'myip':
Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
myip                 mygroup              NodeA           Online
C:\>REM Define a network name resource
C:\>CLUSTER mycluster RESOURCE mynetname /Create /
Group:mygroup /Type:"Network Name"
Creating resource 'mynetname'...
Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
mynetname            mygroup              NodeA           Offline
C:\>CLUSTER mycluster RESOURCE mynetname /priv Name:"mynetname"
C:\>CLUSTER mycluster RESOURCE mynetname /Adddependency:myip
Making resource 'mynetname' depend on resource 'myip'...
C:\>REM Status check
C:\>CLUSTER mycluster RESOURCE mynetname /Stat
Listing status for resource 'mynetname':
Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
mynetname            mygroup              NodeA           Offline
C:\>REM Bring the network name online
C:\>CLUSTER mycluster RESOURCE mynetname /Online /Wait:60
Bringing resource 'mynetname' online...
Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
mynetname            mygroup              NodeA           Online
C:\>REM Status check
C:\>CLUSTER mycluster Group mygroup /stat
Listing status for resource group 'mygroup':
Group                Node            Status
-------------------- --------------- ------
mygroup              NodeA           Online
C:\>REM Let's simulate a failure of the IP address
C:\>CLUSTER mycluster RESOURCE myip /Fail
Failing resource 'myip'...
Resource             Group                Node            Status
-------------------- -------------------- --------------- ------
myip                 mygroup              NodeA           Online Pending
C:\>REM Get group status
C:\>CLUSTER mycluster GROUP mygroup /status
Listing status for resource group 'mygroup':
Group                Node            Status
-------------------- --------------- ------
mygroup              NodeA           Online

For More Information

For the latest information about Windows NT Server, check out Microsoft TechNet, visit https://www.microsoft.com/backofficeserver/ or the Windows NT Server Forum on the Microsoft Network (GO WORD: MSNTS).

For the latest information on Windows NT Server, Enterprise Edition, and Microsoft Cluster Server, use the following links:

https://www.microsoft.com/ntserver/default.asp

https://support.microsoft.com/ph/3194

Alpha AXP is a trademark of Digital Equipment Corporation.

DEC is a trademark of Digital Equipment Corporation.
Intel is a registered trademark of Intel Corporation.
IBM is a registered trademark of International Business Machines Corporation.
PowerPC is a trademark of International Business Machines Corporation.
MIPS is a registered trademark of MIPS Computer Systems, Inc.

AI Skills Fest

Share via

MS Cluster Server Troubleshooting and Maintenance

On This Page

Abstract

Introduction

Clustering and Microsoft Cluster Server (MSCS)

Chapter 1: Preinstallation

MSCS Hardware Compatibility List (HCL)

Configuring the Hardware

Installing the Operating System

Configuring Network Adapters

Preinstallation Checklist

Installation on systems using custom disk hardware

Chapter 2: Installation Problems

MSCS Installation Problems with the First Node

Cannot Install MSCS on the Second Node

Chapter 3: Post-Installation Problems

Entire Cluster Is Down

One Node Is Down

Applying Service Packs and Hotfixes

One or More Servers Quit Responding

Cluster Service Will Not Start

Cluster Service Starts but Cluster Administrator Won't Connect

Group/Resource Failover Problems

Physical Disk Resource Problems

Quorum Resource Failures

File Share Won't Go Online

Problems Accessing Drive

Chapter 4: Administrative Issues

Cannot Connect to Cluster Through Cluster Administrator

Cluster Administrator Loses Connection or Stops Responding on Failover

Cannot Move a Group

Cannot Delete a Group

Problems Adding, Deleting, or Moving Resources

Chkdsk and Autochk

Chapter 5: Troubleshooting the shared SCSI bus

Verifying Configuration

Adding Devices to the Shared SCSI Bus

Verifying Cables and Termination

Chapter 6: Client Connectivity Problems

Clients Have Intermittent Connectivity Based on Group Ownership

Clients Do Not Have Any Connectivity with the Cluster

Clients Have Problems Accessing Data Through a File Share

Clients Cannot Access Cluster Resources Immediately After IP Address Change

Clients Experience Intermittent Access

Chapter 7: Maintenance

Installing Service Packs

Service Packs and Interoperability Issues

Replacing Adapters

Shared Disk Subsystem Replacement

Emergency Repair Disk

System Backups and Recovery

What not to do on a cluster server

Appendix A: MSCS Event messages

Event ID 1000

Related Event Messages

Appendix B: Using AND Reading THE Cluster Logfile

CLUSTERLOG Environment Variable

Operating System Version Number and Service Pack Level

Cluster Service Startup

Logfile Entries for Common Failures

Appendix C: Command-Line Administration

Using Cluster.exe

For More Information

Additional resources