Scheduled and Unscheduled Outages
Applies to: Exchange Server 2007 SP3, Exchange Server 2007 SP2, Exchange Server 2007 SP1, Exchange Server 2007
Topic Last Modified: 2007-10-29
Scheduled and unscheduled outages are the two forms of outages in a cluster continuous replication (CCR) environment. Scheduled outages are explicitly initiated by an administrator to either recover from a failure or to perform a maintenance operation. Unscheduled outages refer to unexpected events that result in the unavailability of service, data, or both. CCR is designed to handle both scheduled and unscheduled outages.
CCR allows you to schedule an extended system outage of a specific node without an extended outage of the clustered mailbox server (CMS). CCR scheduled outage functionality is designed to make sure that all log data on the active node is successfully copied to the passive node. As a result, scheduled outages are always without data loss, even though the replication occurs asynchronously. Failures, and the resulting failover, can cause very recent log data to be unavailable to the passive node at the time it is brought online.
In a CCR environment, only one node can be taken offline at a time. Taking more than one node offline will result in an interruption in service. At any specific time, either the computer hosting the file share for the Majority Node Set (MNS) quorum with file share witness or the passive node in the failover cluster can be taken offline for hardware and software maintenance, updates, and repairs. We recommend that you never take a node down without first checking whether it is active or hosting resources. You can determine if it is hosting any resources using Cluster Administrator. You can check node status by running the Get-ClusteredMailboxServerStatus cmdlet in the Exchange Management Shell. For more information about viewing the status of a CMS, see How to View the Status of a Clustered Mailbox Server.
|The CCR scheduled outage support is not integrated with the Windows Server shutdown process. You must move the CMS to a different node before you shut down the active node. For detailed steps that explain how to move a CMS from one node to another, see How to Move a Clustered Mailbox Server in a CCR Environment.|
Scheduled outages are explicitly initiated by an administrator to either recover from a failure or to perform a maintenance operation. A scheduled outage allows the system to move the CMS from the active node to the passive node (thereby making the passive node the new active node) and mount the replicated databases and storage groups. After being mounted, the databases become the source of all subsequent updates for any further replication. The two copies switch replication roles—which copy produces the database changes and which copy receives and applies the database changes.
Because CCR uses asynchronous replication, the transition of the active CMS from one node in the cluster to another node in the cluster requires coordination between the cluster and replication support. CCR implements this coordination. The administrator initiates a scheduled outage by using the Move-ClusteredMailboxServer cmdlet in the Exchange Management Shell.
|Performing this operation results in a brief interruption in service. In addition, any backups of any storage groups on the CMS are stopped.|
The Move-ClusteredMailboxServer cmdlet verifies that the passive node has a valid and healthy copy. In addition, it checks to make sure that the copy is relatively current. If the copy is not relatively current, the outage could be extended while replication completes. If these checks are successful, the transition is initiated. In the absence of a failure during the move, the task completes when the CMS is running on the selected node, and all databases are mounted. Failures can occur during this process that can prevent the transition, or affect whether all databases are automatically mounted. If this occurs, the unscheduled outage behavior takes over.
Under some conditions, scheduled outages are used to recover partially failed servers. An example is a server with a corrupted database or log files. In this case, the logic to send logs through the replication system blocks the Move-ClusteredMailboxServer cmdlet. The administrator has a simple option to manage this scenario. The administrator dismounts the problematic databases and issues the Move-ClusteredMailboxServer command with an option that attempts to copy logs associated with dismounted databases, but does not fail the move if all logs cannot be copied. The result is that the recovery, even of a corrupted storage group, is easily accomplished with the Move-ClusteredMailboxServer cmdlet.
The Move-ClusteredMailboxServer cmdlet allows an administrator to record the reason for initiating the move. This reason is placed in the event log. The command also forces the administrator to specify the node that is to host the CMS. This prevents administrators from erroneously moving the CMS when it is already correctly hosted.
The Cluster.exe command-line management interface and the Cluster Administrator graphical user interface (GUI) both include the ability to move a clustered CMS. Using these methods triggers replication flushing logic. However, we recommend that you do not use these methods for the following reasons:
These methods do not validate the health or state of the passive copy. Thus, their use can result in an extended outage while the node performs the operations necessary to make the database mountable.
These methods may also leave a database offline indefinitely because the replication is in a broken condition.
The process for restoring a CCR environment after an active node's scheduled outage is to restart the node. The replication is automatically started at system startup. There are two cases to consider:
Scheduled outage was completely successful, showing no failures during the transition associated with the scheduled outage, and all databases automatically came online. In this case, the administrator performed the scheduled outage in a way that made sure both nodes had consistent storage groups and databases. The result is the node can come up and immediately continue to replicate. The copy can be brought current by replaying logs. No special action is required.
Scheduled outage was only partially successful, or some databases were corrupted prior to the scheduled outage. In this case, the scheduled outage was unable to make sure that all logs on the source were made available to the target before it mounted its databases. Typically, this situation occurs as a result of a failure either before the scheduled outage or late in the scheduled outage operation. Therefore, the source and target databases are not consistent. In some cases, CCR can automatically recover from some inconsistencies. If this is the case, replication starts and processes any available logs. If the replication is unable to automatically recover, it marks the copy as broken and generates an event indicating the problem. Assuming the storage is viable, the primary recovery action is to reseed the copy. For more information about the procedure to correct these issues, see How to Seed a Cluster Continuous Replication Copy.
Unscheduled outages occur as an automatic system response to some kinds of failures. CCR focuses automatic recovery toward those failures where there is a high degree of confidence that availability will be improved and that most environments would desire the automatic recovery.
An unscheduled outage allows the system to activate the Mailbox server on the passive node, thus making it active, and mount the replicated databases and storage groups. After being mounted, the databases become the source of all subsequent updates for any further replication. The two copies switch replication roles—which copy produces the database changes and which copy receives and applies the database changes.
Because CCR uses asynchronous replication, an unscheduled outage means that some data loss will occur. At a minimum, the logs being actively written by the active server are not available to the recovery activities. CCR addresses this issue by providing administrative control of the failover behavior and providing a feature to help reclaim the bulk of the data that would likely be affected.
When a failover occurs, CCR will always activate the Mailbox server on the remaining passive node. The system controls are associated with whether databases are mounted on this now active node. CCR provides administrative controls to dictate whether databases are mounted. The default position is Best availability. In this position, the system will automatically mount all databases that are synchronized with the previously active production database. Best availability allows for more variation in the inconsistency between the two copies. Good availability would bring a database online, if during the time it took to generate a new log, the last generated log was replicated. Lossless guarantees the copy is not brought online unless it can be confirmed that there will be no data loss. If Lossless is used, automatic recovery will only occur when the original server is operational again and all log data is available and not corrupted.
|The use of the Lossless setting can result in extended outages. Administrators can use the Lossless setting, and then make an explicit decision about whether to mount databases. The Lossless setting can easily result in extended outages on a failure.|
If one or more databases are in a condition where the setting does not automatically mount them, the administrator can still explicitly decide to mount the copy with its available content. The administrator must check the state of the copy, and then issue two commands. The first command informs the replication engine that this copy should be made to be a replication source (source of changes), that is, the copy should be mountable. The second command mounts the database.
For additional information about recovering from corruption or failures when CCR is enabled, see Managing Cluster Continuous Replication.
Administrative controls are provided to control behavior in case of an unscheduled outage. CCR provides an attribute for Mailbox servers that you can use to control unscheduled outage recovery behavior. The attribute, AutoDatabaseMountDial, has three possible values: Lossless, Good availability, and Best availability.
Lossless Lossless is zero logs lost. When the attribute is set to Lossless, under most circumstances the system waits for the failed node to come back online before databases are mounted. Even then the failed system must return with all logs accessible and not corrupted. After the failure, the passive node is made active, and the Microsoft Exchange Information Store service is brought online. It checks to determine whether the databases can be mounted without any data loss. If they can, the databases are mounted. If not, the system periodically attempts to copy the logs. If the server returns with its logs intact, this attempt will eventually succeed, and the databases will mount. If the server returns without its logs intact, the remaining logs will not be available, and the affected databases will not mount.
Note: At any time after the failover is complete, an administrator can intercede and decide to mount using the databases and logs available on the previously passive node. This task is done using a simple two-step process. Presumably, the administrator making the decision to intercede bases the decision on an analysis of the amount of time required to get the original server operational. The consistency of the replication between the two nodes at the time of the failure, and the urgency of the clients to gain access to their server, are some factors that are part of this analysis.
Good availability Good availability is three logs lost. Good availability provides fully automatic recovery when replication is operating normally and replicating logs at the rate they are being generated.
Best availability Best availability is six logs lost, which is the default setting. Best availability operates similarly to Good availability, but it allows automatic recovery when the replication experiences slightly more latency. Thus, the new active node might be slightly farther behind the state of the old active node after the failover, thereby increasing the likelihood that database divergence occurs, which requires a full reseed to correct.
For more information about outage management behavior, see How to Tune Failover and Mount Settings for Cluster Continuous Replication.