Export (0) Print
Expand All

Cluster Continuous Replication Recovery Behavior

 

Applies to: Exchange Server 2007 SP3, Exchange Server 2007 SP2, Exchange Server 2007 SP1, Exchange Server 2007

Topic Last Modified: 2007-10-29

Cluster continuous replication (CCR) provides full redundancy of both the data and the services that provide access to the data. Full redundancy enables rapid recovery in cases where a shared copy of mailbox data would not allow rapid recovery.

CCR recovery behavior can be separated into two types of outages:

  • Scheduled outages   Scheduled outages are initiated by the administrator. A scheduled outage can be used to recover from a failure detected by the monitoring system or to perform some administrative task, such as hardware maintenance, software installation, or software updates.

  • Unscheduled outages   Unscheduled outages are initiated by the system as a recovery action to a detected failure. These outages are detected and their recovery is triggered by the Windows Cluster service.

The following table describes the expected recovery actions for a variety of failures. Some failures require the administrator to initiate the recovery, and other failures are automatically handled by the Exchange clustering solution.

Recovery actions for failures

Description Action Comments

Operating system stop error; detected operating system hang (stops responding); complete power failure of a node; unrecoverable failure of the processor chip, motherboard, backplane; or complete communication failure for a node

Automatically fail over to the passive node, if available. The administrator also has the choice to force automatic mount independent of data loss if recovery has not occurred within a configured amount of time. If no databases are mounted after the failover and the original active node comes back online, with all its storage operational, the missing logs are copied and the databases automatically mounted.

For a passive node to be available, it must be possible to establish a quorum after the failure. This means that the remaining node must be able to access the file share quorum. Alternatively, a majority of the nodes in the cluster must be operational and able to communicate with each other.

Total storage failure on the active server

Storage failures reported to and through the monitoring system. The administrator can recover the storage or initiate a scheduled outage to the passive node.

This failure would be reported as a failure of all databases.

Data center failure

If the active node in the primary data center fails, automatic failover of the clustered mailbox server to the passive node in the second data center.

Other Exchange, directory services, networking services, and servers must be recovered to continue to deliver mail access. Mail data is available and current within a few minutes.

Operating system drive failure

No automatic recovery action. Not detected by Exchange unless the operating system fails. Detected based on apparent failures rather than root cause.

Operating system drive failure is reported by operating system monitoring services and may cause the operating system to fail.

Operating system drive out of space

Automatic failover to passive node, if available.

This failure is reported to and through the monitoring services. If automatic recovery does not or cannot occur, the recovery action for this scenario is determined by the administrator.

Complete failure of the cluster's public network

No automatic recovery action.

If the public network is lost, the IP Address resources enter a failed state. After the public network issue is addressed, the resources can be brought back online.

Loss of cluster quorum

Clustered mailbox servers and cluster quorum offline.

This scenario results in no service if a quorum cannot be formed.

Information store failure

Automatic restart of Information store resource. If the Information store resource failure is during a restart, a failover is triggered.

After repeated failures, the administrator can try to manually move the clustered mailbox server to the passive node in an attempt to bring it online.

Application (binary files) drive failure

No automatic recovery action.

Generally, this scenario results in other failures that are reported to and through monitoring services and are actionable by the administrator. The recovery action for this scenario is determined by the administrator.

Application (binary files) drive out of space

No automatic recovery action.

Reported to and through the monitoring services. The recovery action for this scenario is determined by the administrator.

Complete loss of database or storage group, or database complete failure

Automatic attempt to remount the affected databases. If this attempt fails, the database remains in a failed state, but no failover of the clustered mailbox server occurs.

The storage group or database either is dismounted due to software failure or corruption, or has failed because of hardware failures. For example, a storage group does a forced dismount of all databases when its log directory is not available. The administrator determines the corrective action.

Partial failure of storage group or database partial failure, some data unavailable, or initial database mount failure

No automatic recovery action.

Partial failure means that some corruption has been reported, but the corruption did not force a dismount of the storage group or database. If a database does not mount at startup, no action is taken and monitoring reports the failure. The Mailbox server generates events when this is detected that can be reported by the monitoring services. Monitoring also detects and reports dismounted databases.

Corrupted log detected for storage group

No automatic recovery action. The copy goes into a broken condition and must be reseeded.

Monitoring reports this condition.

Database or transaction log drive out of space

No automatic recovery action. The databases in the storage group are dismounted.

The lack of a free drive space condition is reported through the monitoring system. The administrator determines the corrective action.

The administrator has configuration control over unscheduled outage failure recovery. For more information about scheduled and unscheduled outages, see Scheduled and Unscheduled Outages.

 
Was this page helpful?
(1500 characters remaining)
Thank you for your feedback

Community Additions

ADD
Show:
© 2014 Microsoft