Intracluster Network Connection Is Broken

This section includes cluster log output from two nodes when the intracluster network connection is broken. In this cluster, the intracluster network connection is a single point of failure.

Log from Node 2

Node 2 is the Quorum owner when the intracluster network connection is broken.

The following entry indicates that there has been a loss of communications between nodes. The Cluster service initiates the holding I/O operation when there has been a loss of communications.

00000534.00000500::1999/10/21-23:09:01.999 [NM] Holding I/O.

.

.

.

00000534.0000057c::1999/10/21-23:09:02.108 [NM] Checking if we own the

quorum resource.

.

.

.

00000534.0000057c::1999/10/21-23:09:02.124 [FM] Successfully arbitrated

quorum resource a83b4084-3391-4618-890e-8794d4df923b.

.

.

.

00000534.00000500::1999/10/21-23:09:04.905 [ClMsg] Received interface

unreachable event for node 1 network 1

00000534.00000500::1999/10/21-23:09:04.905 [ClMsg] Received interface

unreachable event for node 1 network 2

00000534.0000052c::1999/10/21-23:09:04.905 [NM] Communication was lost

with interface 0bd641f7-7d8c-4d94-9279-d461846b299b (node: NODE1,

network: clients(1))

.

.

.

00000534.0000052c::1999/10/21-23:09:04.905 [NM] Communication was lost

with interface ddda464e-7c6d-4439-b27b-cd0da7957162 (node: NODE1,

network: interconnect)

.

.

.

00000534.0000057c::1999/10/21-23:09:09.123 [NM] Resuming I/O.

.

.

.

00000534.0000057c::1999/10/21-23:09:09.123 [EP] Nodes down event

received

.

.

.

00000534.00000464::1999/10/21-23:09:09.139 [DM] DmpEventHandler - Node

is down, turn quorum logging on...

Log from Node 1

The following log entries are from node 1 and were generated for the same occurrence: the loss of the "interconnect" connection for cluster communications. The following entries, which establish that the network interface is unavailable, are the first indications. Declaring the other node to be down because the interface is unavailable triggers the regroup events noted below as RGP.

00000404.000004e4::1999/10/21-23:10:38.039 [ClMsg] Received interface

unreachable event for node 2 network 2

00000404.00000590::1999/10/21-23:10:38.039 [NM] Communication was lost

with interface 198ffe74-b7b9-41e5-b95a-25f618eb0c43 (node: NODE2,

network: interconnect)

00000404.000004e4::1999/10/21-23:10:42.914 [ClMsg] Received node down

event for node 2, epoch 0

.

.

.

00000404.00000374::1999/10/21-23:10:46.711 [NM] Checking if we own the

quorum resource.

In the following entry, "error 1" means "Incorrect Function." The Cluster service on this node could not read the partition information from the quorum disk prior to asserting a reservation. This is because the other node had reserved the quorum disk after the successful bus reset noted several entries earlier.

00000388.0000059c::1999/10/21-23:10:50.351 Physical Disk <Disk E:>:

[DiskArb]Failed to write (sector 12), error 1.

In the following entry, "status 1" means that arbitration for drive E:, the quorum disk, failed. That is, the other node successfully defended its reservation on the disk. The second and third entries following also report the failure:

00000388.0000059c::1999/10/21-23:10:50.351 Physical Disk <Disk E:>:

[DiskArb]Arbitrate returned status 1.

.

.

.

00000404.00000374::1999/10/21-23:10:50.351 [FM] Failed to arbitrate

quorum resource a83b4084-3391-4618-890e-8794d4df923b, error 1.

.

.

.

00000404.00000374::1999/10/21-23:10:50.351 [RGP] Node 1: REGROUP ERROR:

arbitration failed.

Because arbitration has failed and the nodes are partitioned, the Cluster service on this node shuts down in order to quit participating in the cluster.

00000404.00000374::1999/10/21-23:10:50.351 [NM] Halting this node due to

membership or communications error. Halt code = 1000

.

.

.

00000388.000003cc::1999/10/21-23:10:51.117 [RM] Going away, Status = 1,

Shutdown = 0.

.

.

.

00000388.0000052c::1999/10/21-23:10:51.148 [RM] NotifyChanges shutting

down.