Monitoring Continuous Replication

Article
01/23/2017

Microsoft Exchange Server 2007 will reach end of support on April 11, 2017. To stay supported, you will need to upgrade. For more information, see Resources to help you upgrade your Office 2007 servers and clients.

Applies to: Exchange Server 2007 SP1, Exchange Server 2007 SP2, Exchange Server 2007 SP3

Microsoft Exchange Server 2007 Service Pack 1 (SP1) introduces new and improved capabilities for monitoring continuous replication environments. These changes improve upon the cluster reporting features in the release to manufacturing (RTM) version of Microsoft Exchange Server 2007 and include additional functionality designed for proactive monitoring of continuous replication environments. Specifically, Exchange 2007 SP1 introduces enhancements to the Get-StorageGroupCopyStatus cmdlet, adds a new cmdlet called Test-ReplicationHealth, and provides greater visibility into the loss window covered by the transport dumpster. In addition to using these cmdlets to monitor the health of continuous replication, you can also use several performance counters that are published by the Microsoft Exchange Replication service.

Improvements to the Get-StorageGroupCopyStatus Cmdlet in SP1

In Exchange 2007 RTM, there are several conditions where the status reported by Get-StorageGroupCopyStatus and the continuous replication performance counters are inaccurate or misleading:

A storage group that is not active (for example, not changing) can report its status as healthy when it might not be healthy. This situation occurs because the unhealthy condition is not detected until a log is replayed.
During replication initialization, the replication status is being evaluated and may not be accurate. When initialization completes, the status is updated.
The value of the LastLogGenerated field can be wrong when the database in the storage group is dismounted.
When there are one or more missing log files in the middle of a log stream, the passive copy continues to try to recover, causing the replication status to switch between failed and healthy states. When this happens, the replay and copy queues continue to grow.
Under rare conditions, a log can be successfully verified but still fail to replay. In this situation, the system will alternate between failed and healthy states during its attempts to recover. When this happens, the replay and copy queues continue to grow.

Exchange 2007 RTM uses the cluster database and the registry for communication between the Microsoft Exchange Replication service and Exchange management tasks, which is an asynchronous process. Because the process is asynchronous, it can result in the unreliable status described earlier.

In Exchange 2007 SP1, the preceding issues have been resolved by a redesign of the underlying mechanism used for communication between the Microsoft Exchange Replication service and Exchange management tasks. Instead of using the Cluster service or the registry, the management tasks now communicate directly with the Microsoft Exchange Replication service using remote procedure calls (RPCs).

In addition, the Get-StorageGroupCopyStatus cmdlet has been enhanced with the addition of new status information:

The Get-StorageGroupCopyStatus cmdlet reports a SummaryCopyStatus of ServiceDown when the Microsoft Exchange Replication service on the target computer is not network accessible.
The Get-StorageGroupCopyStatus cmdlet reports a SummaryCopyStatus of Initializing when the Microsoft Exchange Replication service on the target computer has not completed its initial startup checks. A new performance counter has also been created to represent this status as a Boolean.
The Get-StorageGroupCopyStatus cmdlet reports a SummaryCopyStatus of Synchronizing when it has not completed an incremental reseed.

The new states for the SummaryCopyStatus value are visible only when you use the Exchange 2007 SP1 version of the Exchange management tools. When you use the Exchange 2007 RTM version of the Exchange management tools, the status for any of the preceding states will be reported as Failed.

Test-ReplicationHealth Cmdlet

Exchange 2007 SP1 introduces a new cmdlet called Test-ReplicationHealth. This cmdlet is designed for proactive monitoring of continuous replication and the continuous replication pipeline. The Test-ReplicationHealth cmdlet is designed to run locally on a Mailbox server to check the status of replication in a local continuous replication (LCR), cluster continuous replication (CCR), and standby continuous replication (SCR) environment. The Test-ReplicationHealth cmdlet is also designed to be tightly integrated with the Microsoft Operations Manager (MOM) Management Pack to provide simple, accurate information detailing the health of continuous replication for the Mailbox server. The checks are done in order of seriousness; more critical tests are checked first. If one of these checks fails, it is assumed that the less critical tests would fail as well or are not relevant.

The Test-ReplicationHealth cmdlet checks all aspects of replication, Cluster services, and storage group replication and replay status to provide a complete overview of the replication system. Specifically, when run on a node in the cluster, the Test-ReplicationHealth cmdlet performs the tests described in the following table.

Tests performed by the Test-ReplicationHealth cmdlet

Test	Description
Passive node status (PassiveNodeUp)	Verifies that the passive node has a status of Up when used in a CCR environment.
Cluster network status (ClusterNetwork)	Verifies that all cluster-managed networks found on the local node are running.
Quorum group state (QuorumGroup)	Verifies that the cluster group containing the quorum resource is healthy.
File share quorum state (FileShareQuorum)	Verifies that the value of the FileSharePath used by the Majority Node Set quorum with file share witness is reachable.
Clustered mailbox server group state (CmsGroup)	Verifies that the clustered mailbox server is healthy by confirming that all resources in the group are online.
Node state (NodePaused)	Verifies that neither of the nodes in the cluster is in a paused state.
DNS registration status (DnsRegistrationStatus)	Verifies that all cluster-managed network interfaces that have Require DNS registration to succeed set have passed Domain Name System (DNS) registration.
Replication service status (ReplayService)	Verifies that the Microsoft Exchange Replication service on the local node is healthy.
Databases mounted after failover (DBMountedFailover)	Checks to see if any databases are dismounted or failed after a failover has occurred. This test only checks for databases that have failed as a result of a failover.
Storage group copy suspended (SGCopySuspended)	Checks to see if continuous replication has been suspended for any storage groups on the clustered mailbox server.
Storage group copy failed (SGCopyFailed)	Checks to see if any storage group copies exist that are in a Failed state.
Storage group initializing (SGInitializing)	Checks to see if any storage groups are in the Initializing state.
Storage group copy queue length (SGCopyQueueLength)	Checks to see if any storage group has a replication copy queue length greater than best practice thresholds. Currently, these thresholds are: Warning Queue length is 3–5 log files. Failure Queue length is 6 or more log files.
Storage group replay queue length (SGReplayQueueLength)	Checks to see if any storage group has a replication replay queue length greater than best practice thresholds. Currently, these thresholds are: Warning Queue length is 30–59 log files. Failure Queue length is 60 or more log files.

Monitoring Context for Test-ReplicationHealth

The Test-ReplicationHealth cmdlet includes a parameter called MonitoringContext, which you can use to include monitoring events and performance counters in the results of the task. This parameter is used by the Management Pack for MOM. The two possible values for this parameter are $true or $false. If you specify $true, the results will include monitoring events and performance counters in addition to the information about services.

If monitoring context is specified, only the following checks are verified on an active node:

PassiveNodeUp
ClusterNetwork
QuorumGroup
FileShareQuorum
CmsGroup
NodePaused
DnsRegistrationStatus
ReplayService
DBMountedFailover

If monitoring context is specified, only the following checks are verified on a passive node:

ClusterNetwork
DnsRegistrationStatus
ReplayService
SGCopySuspended
SGCopyFailed
SGInitializing
SGCopyQueueLength
SGReplayQueueLength

Performance Counters Published by the Microsoft Exchange Replication Service

The Microsoft Exchange Replication service provides performance counters that can be used to monitor the health of the replication in both LCR and CCR. We recommend collecting and evaluating the counters discussed later in this topic to monitor and troubleshoot performance-related issues.

Recommended Microsoft Exchange Replication Service Performance Counters

The Microsoft Exchange Replication service creates an instance of the counters in the following table for each storage group copy. This enables you to independently monitor the health and performance of each storage group. You can monitor the health and status of each storage group by monitoring the ReplayQueueLength and CopyQueueLength counters under the MSExchange Replication performance object.

Note

As mentioned earlier, the Get-StorageGroupCopyStatus cmdlet also displays the values of these counters.

Counter name	Counter description
Copy Queue Exceeds Mount Threshold (CCR only)	Indicates if the copy queue length is greater than the threshold specified by the auto database mount dial. In a CCR environment, the value for this counter will be 1 if the auto database mount dial threshold is exceeded. The value will always be 0 in an LCR environment.
CopyGenerationNumber	Indicates the generation sequence number of the last log file that has been copied.
CopyNotificationGenerationNumber	Indicates the generation sequence number of the last log file known to the Microsoft Exchange Replication service.
CopyQueueLength	Indicates the number of log files waiting to be copied and inspected.
Failed	With a value of 1, indicates that continuous replication is in a Failed state for the selected instance (storage group). A value of 0 indicates that continuous replication is not in a Failed state.
Initializing	With a value of 1, indicates that continuous replication is in an Initializing state for the selected instance (storage group). This state indicates that the storage group copy is performing initial startup checks or that the Microsoft Exchange Replication service is performing an incremental reseed. A value of 0 indicates that continuous replication is not in an Initializing state.
InspectorGenerationNumber	Indicates the generation sequence number of the last log file that was inspected.
ReplayBatchSize	Indicates the number of log files that have been replayed together.
ReplayGenerationNumber	Indicates the generation sequence number of the last log file that was replayed successfully.
ReplayGenerationsComplete	Indicates the number of log files replayed in the current batch.
ReplayGenerationsPerMinute	Indicates the rate of replay (in log generations per minute) for the current batch.
ReplayGenerationsRemaining	Indicates the number of log generations remaining to be replayed in the current batch.
ReplayNotificationGenerationNumber	Indicates the generation sequence number of the last log file known to the Microsoft Exchange Replication service.
ReplayQueueLength	Indicates the number of log files waiting to be replayed.
Suspended	With a value of 1, indicates that continuous replication activity is suspended. Suspended means that log files are not being copied or replayed into the passive copy.
TruncatedGenerationNumber	Indicates the generation sequence number of the last log file truncated by the Microsoft Exchange Replication service.

In addition to the counters listed in the preceding table, an additional counter called Seeding Finished % is published under the MSExchange Replica Seeder performance object. This counter indicates the finished percentage of seeding. Its value is from 0 to 100 percent, and it is published only for storage groups that are in the process of being seeded.