Techniques for Tracking the Source of a Problem

It's often helpful to go to the end of the log and work your way backward to find the oldest failure that caused your problem. However, the true cause of the failure might not be the first apparent cause you find when working backward. The initial failure can cause side effects that trigger other failures.

You can also use the following for tracking operations in the cluster log:

  • Timestamps

  • GUIDs

  • Resources

  • Groups

  • Process and thread IDs

  • GUM sequence numbers

  • The gdwQuoBlockingResources global variable counts of shared locks

Timestamps

Scan the timestamps for breaks of one second or more to find periods of inactivity in the Cluster service. A break might indicate that all threads of activity have completed.

For example, consider the following entries from the logging of a cluster form operation and cluster join operation.

378.328::1999/06/09-18:00:52.006 [FM] FmpPropagateResourceState:

resource 76d0bf88-5e2c-44b9-9bdb-3eb677487afb online event.

378.328::1999/06/09-18:00:52.006 [FM] FmpRmWorkItemHandler: Exit

378.3a8::1999/06/09-18:00:52.006 [GUM] GumSendUpdate: Locker waiting

type 0 context 11

378.3a8::1999/06/09-18:00:52.006 [GUM] Thread 0x3a8 UpdateLock wait on

Type 0

378.3a8::1999/06/09-18:00:52.006 [GUM] DoLockingUpdate successful, lock

granted to 1

378.3a8::1999/06/09-18:00:52.006 [GUM] GumSendUpdate: Locker dispatching

seq 1673 type 0 context 11

378.3a8::1999/06/09-18:00:52.006 [GUM] GumpDoUnlockingUpdate releasing

lock ownership

378.3a8::1999/06/09-18:00:52.006 [GUM] GumSendUpdate: completed update

seq 1673 type 0 context 11

378.3c8::1999/06/09-18:00:53.053 [GUM] GumSendUpdate: Locker waiting

type 1 context 4098

The first two entries report the Failover Manager's initiation of global update, and the second-to-last entry reports the completion of the update. Note the break in time between this entry and the last entry.

Occasionally, timestamps violate chronological order. For example, note that in the following example from a Microsoft® Windows NT® 4.0 cluster, the entry with the timestamp 01-20:43:09.257 ("[GUM] Thread 0x7f UpdateLock wait on Type 1") appears out of order.

07b::01-20:43:09.304 [GUM] GumSendUpdate: completed update seq 1191

type 1 context 0

07f::01-20:43:09.257 [GUM] Thread 0x7f UpdateLock wait on Type 1

07f::01-20:43:09.304 [GUM] DoLockingUpdate successful, lock granted to 2

This happened because the cluster log obtains the system time and then tries to acquire access to the log file (also known as acquiring the print lock). In this case, thread 07b acquired the print lock before thread 07f, which had the time but not the print lock.

To find why 07f didn't acquire the print lock when it got the time, scan for the 07f thread ID. The following two entries show that 07f was interrupted by 0ae in the middle of a GUM update:

07f::01-20:43:09.257 [GUM] GumSendUpdate: Locker waiting type 1

context 4098

0ae::01-20:43:09.265 [GUM] GumSendUpdate: Locker waiting type 0

context 8

GUIDs, Resources, and Groups

Because the Cluster service often doesn't include the resource's display name in entries written to the cluster log, you need to track resource GUIDs as well as display names. The entry reporting an offline call for a given resource reliably includes both the display name and the GUID, so it is a good resource for correlating a resource's GUID and display name.

For example, the following entry tells you that resource "254ef0e5-1937-11d3-b3fe-00a0c986aa14" is the Cluster IP Address resource:

378.130::1999/06/09-18:02:24.189 [FM] FmpOfflineResource: Offline

resource <Cluster IP Address> <254ef0e5-1937-11d3-b3fe-00a0c986aa14>

Another dependable correlation of a resource's display name and GUID is the entry the Failover Manager [FM] writes during resource creation, which names the resource. The following is an example:

378.380::1999/06/09-18:00:51.209 [FM] Name for Resource

191c151b-a8e5-45a6-8330-d38669845e66 is 'testwinfile'.

The Cluster service's model for bringing resources online is to bring everything online as quickly as possible. If there are hundreds of resources, the Cluster service tries to bring them all online simultaneously.

Creating Groups and Resources

The following set of entries log successful creation of a group and a resource at startup of the Cluster service. In both cases, you must track the object by its GUID. Note, however, the entry that gives the display "Name for" the group or resource. Note also that the Failover Manager is the component responsible for creating and initializing groups and resources at startup.

In the first two entries, the Failover Manager creates the group in its memory and initializes it with the values for the group in the cluster registry key.

378.380::1999/06/09-18:00:51.209 [FM] Creating group

d103ed17-f7bc-4ff9-8a64-e692f4ecb663

378.380::1999/06/09-18:00:51.209 [FM] Initializing group

d103ed17-f7bc-4ff9-8a64-e692f4ecb663 from the registry.

The following is the "Name for" entry for the group.

378.380::1999/06/09-18:00:51.209 [FM] Name for Group

d103ed17-f7bc-4ff9-8a64-e692f4ecb663 is 'test group'.

The Failover Manager creates the list of the nodes that are the group's preferred owners.

378.380::1999/06/09-18:00:51.209 [FM] Group

d103ed17-f7bc-4ff9-8a64-e692f4ecb663 preferred owner 1.

378.380::1999/06/09-18:00:51.209 [FM] Group

d103ed17-f7bc-4ff9-8a64-e692f4ecb663 preferred owner 2.

The Failover Manager creates the list of the resources that the group owns.

378.380::1999/06/09-18:00:51.209 [FM] Group

d103ed17-f7bc-4ff9-8a64-e692f4ecb663 contains Resource

191c151b-a8e5-45a6-8330-d38669845e66 .

Having created the group, the Failover Manager creates resource 191c151b-a8e5-45a6-8330-d38669845e66 and then initializes it with its values from the cluster database (the cluster key in the registry).

378.380::1999/06/09-18:00:51.209 [FM] Creating resource

191c151b-a8e5-45a6-8330-d38669845e66

378.380::1999/06/09-18:00:51.209 [FM] Initializing resource

191c151b-a8e5-45a6-8330-d38669845e66 from the registry.

The Failover Manager creates the list of possible owners for the resource.

378.380::1999/06/09-18:00:51.209 [FM] FmpAddPossibleEntry: adding node 1

as possible host for resource 191c151b-a8e5-45a6-8330-d38669845e66.

378.380::1999/06/09-18:00:51.209 [FM] FmpAddPossibleEntry: adding node 2

as possible host for resource 191c151b-a8e5-45a6-8330-d38669845e66.

The Failover Manager notes that the resource's dependencies have already been created.

378.380::1999/06/09-18:00:51.209 [FM] All dependencies for resource

191c151b-a8e5-45a6-8330-d38669845e66 created.

Process and Thread IDs

Process and thread IDs help you trace activities and interactions. For example, in the section "Joining a Cluster (Unsuccessful Attempt)" later in the chapter, each thread spawned to connect to the sponsoring node has a distinct thread ID.

When an instance of an executable file (such as the Cluster service and Resource Monitor) starts, it starts a new process and, therefore, gets a new process ID. Knowing this, you can examine the activity with which a process ID is associated in order to deduce which executable file's process it represents.

Tracing process and thread IDs is especially helpful for finding out why a particular process is taking too long.

One caveat, however: Related events can have different thread IDs in the cluster log, even though they report activity that occurs as a single logical operation.

GUM Updates and Sequence Numbers

One of the most frequently repeated patterns you'll find in cluster logs is that of a global update, performed by the Global Update Manager (GUM) when there is a change to the cluster database. The following group of entries is a typical global update.

The GUM uses a unique sequence number for each transaction, incrementing the sequence number by one for each new update. In this group of entries, "1657" is the sequence number. Note that this update propagates two changes: a membership change (type 2 context 17) and a resource-state change (type 0 context 8). For context numbers of GUM updates, see "Context Numbers" later in this chapter.

378.380::1999/06/09-18:00:50.897 [GUM] GumSendUpdate: Locker waiting

type 2 context 17

378.380::1999/06/09-18:00:50.897 [GUM] Thread 0x380 UpdateLock wait on

Type 2

378.380::1999/06/09-18:00:50.897 [GUM] DoLockingUpdate successful, lock

granted to 1

378.380::1999/06/09-18:00:50.897 [GUM] GumSendUpdate: Locker dispatching

seq 1657 type 2 context 17

378.380::1999/06/09-18:00:50.897 [dm] DmBeginLocalUpdate Entry

378.380::1999/06/09-18:00:50.897 [dm] DmBeginLocalUpdate Exit,

pLocalXsaction=0x000a0dd8 dwError=0x00000000

378.380::1999/06/09-18:00:47.928 [GUM] GumSendUpdate: Locker dispatching

seq 1657 type 0 context 8

378.380::1999/06/09-18:00:47.928 [GUM] GumpDoUnlockingUpdate releasing

lock ownership

378.380::1999/06/09-18:00:47.928 [GUM] GumSendUpdate: completed update

seq 1657 type 0 context 8

GUM sequence numbers are useful for correlating the activities of two or more nodes in a cluster. For instance, a "Locker dispatching seq xxx " message has a corresponding "s_GumQueueLockingUpdate: dispatching seq xxx " in the other node's log, where " xxx " represents the same sequence number. This can help you synchronize time periods between two logs. In the following pair of examples, note the continuity of the sequence number ("seq 1732"). Note also that the timestamps for the two nodes are not exactly synchronized: The timestamp for the "GumSendUpdate: Dispatching" entry on the sending node is later than the timestamp for the "GumQueueLockingUpdate: Dispatching" entry on the receiving node. Thus, you need the sequence number to correlate the update between the two nodes.

The following entries were logged on the node issuing a GUM update:

378.390::1999/06/09-18:16:29.671 [GUM] GumSendUpdate: Dispatching

seq 1732 type 2 context 17 to node 2

.

.

.

378.390::1999/06/09-18:16:29.718 [GUM] GumSendUpdate: completed update

seq 1732 type 2 context 17

.

.

.

378.390::1999/06/09-18:16:29.671 [GUM] GumSendUpdate: Dispatching

seq 1732 type 2 context 17 to node 2

The following entries were logged on the node receiving the same GUM update:

388.55c::1999/06/09-18:16:29.600 s_GumQueueLockingUpdate: dispatching

seq 1732 type 2 context 17

.

.

.

388.55c::1999/06/09-18:16:29.662 s_GumQueueLockingUpdate: completed

update seq 1732 type 2 context 17 result 0

Shared Locks and gdwQuoBlockingResources

When something in your cluster is preventing the Quorum resource from being taken offline (as in a failover or move of the Quorum resource from one node to the other), you can use the gdwQuoBlockingResources count of shared locks to identify the root cause of the problem.

A lock (called the quorum lock) prevents access to the Quorum resource by components other than the holder of the lock; there are two modes for the lock on the Quorum resource:

  • Shared lock
    Dependent resources acquire the quorum lock in shared mode when they need to access the quorum device and therefore need the quorum device to stay online. A shared lock prevents the Cluster service from taking the Quorum resource offline.

  • Exclusive lock
    The Cluster service acquires the quorum lock in exclusive mode when the Quorum resource is going to go offline. The information in the Quorum resource can only be modified by a thread that acquired the quorum lock in exclusive mode.

Clearly, acquisition of an exclusive lock must not supersede existing shared locks. That is, the Cluster service cannot acquire an exclusive lock and take the Quorum resource offline while components are in the middle of operations that require the Quorum resource to be online.

On the other hand, after existing operations that require access to the quorum device have completed their tasks, the Cluster service must have an opportunity to take the Quorum resource offline.

The Cluster service solves this problem with the gdwQuoBlockingResources global variable and a rule. The gdwQuoBlockingResources global variable is a count of the number of existing shared locks, which functions as follows:

  • When a process acquires a shared lock, the gdwQuoBlockingResources count increases by one.

  • When the process releases the shared lock, the gdwQuoBlockingResources count decreases by one.

  • When the gdwQuoBlockingResources count reaches zero, the Cluster service acquires the exclusive lock and is free to take the Quorum resource offline.

The rule is that a request for an exclusive lock is queued behind existing shared locks.

If there is a request to take the Quorum resource offline and the count is not zero, the InterlockedIncrement and InterlockedDecrement entries include the count of currently shared locks.

The following "InterlockedIncrement/InterlockedDecrement" pair of entries bracket the acquisition of shared locks by the Network Name resource as the resource is taken offline. If the resource could not be taken offline, preventing the gdwQuoBlockingResources count from reaching zero, you would find the problem with the resource here, between the "InterlockedIncrement/InterlockedDecrement" pair of entries.

388.4e8::1999/06/09-20:20:56.640 [FM] FmpRmOfflineResource:

InterlockedIncrement on gdwQuoBlockingResources for resource

254ef0e6-1937-11d3-b3fe-00a0c986aa14

4b4.450::1999/06/09-20:20:56.906 Network Name <Cluster Name>: Taking

resource offline...

4b4.450::1999/06/09-20:20:56.906 Network Name <Cluster Name>: Offline of

resource continuing...

4b4.450::1999/06/09-20:20:57.078 Network Name <Cluster Name>: Failed to

delete server name CLUSTER1, status 2114.

4b4.450::1999/06/09-20:20:57.078 Network Name <Cluster Name>: Failed to

delete server name CLUSTER1, status 2114.

4b4.450::1999/06/09-20:20:57.078 Network Name <Cluster Name>: Deleted

workstation name CLUSTER1 from transport 0.

4b4.450::1999/06/09-20:20:57.187 Network Name <Cluster Name>: Resource

is now offline

388.4e8::1999/06/09-20:20:57.187 [FM] FmpRmOfflineResource:

InterlockedDecrement on gdwQuoBlockingResources for resource

254ef0e6-1937-11d3-b3fe-00a0c986aa14

The next section discusses what the cluster log reveals about cluster activities.