Standby Continuous Replication: Site Resilience with Standby Clustering

Microsoft Exchange Server 2007 will reach end of support on April 11, 2017. To stay supported, you will need to upgrade. For more information, see Resources to help you upgrade your Office 2007 servers and clients.

 

Applies to: Exchange Server 2007 SP1, Exchange Server 2007 SP2, Exchange Server 2007 SP3

This topic details how an organization, Contoso, Ltd., is using standby continuous replication (SCR) in a site resilience scenario. In this scenario, the primary datacenter fails and Contoso, Ltd. makes the decision to activate the secondary datacenter. After the secondary datacenter is activated, the primary datacenter is reconfigured, and eventually restored as the primary datacenter in a controlled switch. Contoso, Ltd. has two datacenters: a primary datacenter referred to as Active Directory SITEA and a second backup datacenter referred to as Active Directory SITEB. SITEA is located in Portland, Oregon, and SITEB is located in San Jose, California.

SITEA contains the following infrastructure components:

  • Directory server, DC1, which also provides secure, Active Directory-integrated Domain Name System (DNS) services

  • Client Access server, CAS1

  • Hub Transport server, HUB1

  • Clustered mailbox server, EXCMS1, running in a two-node single copy cluster (SCC), EXCLUS1, which contains NODEA and NODEB. Note that:

    • The nodes in the failover cluster are running on the Windows Server 2008 operating system, and the failover cluster is configured with a Node and Disk Majority quorum.

    • The DNS Time to Live (TTL) value for the clustered mailbox server's Network Name resource is configured for five minutes. This was configured by running the following command, and then stopping and starting the clustered mailbox server.

      Cluster.exe res "Network Name (EXCMS1)" /priv HostRecordTTL=300
      

      Note

      The HostRecordTTL property is available only on failover clusters running Windows Server 2008. This command cannot be used on failover clusters running on the Windows Server 2003 operating system.

SITEB contains the following infrastructure components:

  • Directory server, DC2, which also provides secure, Active Directory-integrated DNS services

  • Client Access server, CAS2

  • Hub Transport server, HUB2

  • Single-node cluster, DRCLUS1, which will be used as a standby failover cluster. Note that:

    • The node in this cluster, NODEC, is the only node in the standby failover cluster, and it is running Windows Server 2008. The standby failover cluster is configured with a Node Majority quorum.

      Note

      On failover clusters running Windows Server 2003, the single-node standby failover cluster would be configured with a local quorum.

    • The passive Mailbox server role is installed on NODEC using the same installation path as EXCLUS1. This is necessary to be able to perform a server recovery using Setup /RecoverCMS as part of the SCR target activation process. To use the server recovery process, the installation path for Exchange Server must be the same for the SCR source computer and the SCR target computer. If Exchange Server is installed into %ProgramFiles%\Microsoft\Exchange Server on the SCR source computer, it must also be installed into %ProgramFiles%\Microsoft\Exchange Server on all computers that will be SCR targets for the SCR source server. If these install paths do not match, the server recovery process will fail because the install path in the registry will not match the value for the msExchInstallPath attribute of the Mailbox server object in Active Directory.

    • Using the Active Directory Users and Computers snap-in, the computer account for DRCLUS1 is configured with full permissions on the EXCMS1 computer object. This is done so that the computer account for EXCMS1 can be reset by DRCLUS1 in the event that SITEB is activated because of a failure of SITEA.

      Note

      This step is used only on failover clusters running Windows Server 2008. This is because the Cluster service runs in the security context of the local system. This step is not needed on failover clusters running Windows Server 2003 when the same Cluster service account is used for both failover clusters.

All servers in both Active Directory sites are configured to use the Active Directory-integrated DNS servers. The Active Directory replication interval for both Active Directory sites is configured for 15 minutes.

SCR is configured so that transaction log files are being replicated from three storage groups on EXCMS1 to SCR targets on NODEC. This was configured using the following commands:

Enable-StorageGroupCopy EXCMS1\SG1 -StandbyMachine NODEC
Enable-StorageGroupCopy EXCMS1\SG2 -StandbyMachine NODEC
Enable-StorageGroupCopy EXCMS1\SG3 -StandbyMachine NODEC

Note

To simultaneously enable all of the storage groups on EXCMS1 for SCR, run the following command: Get-StorageGroup -Server EXCMS1 | Enable-StorageGroupCopy -StandbyMachine NODEC

The health and status of SCR for each storage group was verified using the Test-ReplicationHealth and Get-StorageGroupCopyStatus cmdlets in the Exchange Management Shell. For example:

Get-StorageGroupCopyStatus EXCMS1\SG1 -StandbyMachine NODEC

Moves of the clustered mailbox server have been verified as working as expected, as are backups and log truncation.

Primary Site Failure and Backup Site Activation

Suddenly, and without any warning, a major earthquake occurs in Portland. Although no people are seriously injured, there is so much damage to SITEA that critical utility services, such as power, water, and natural gas, have to be disconnected. Because it might be months before SITEA can be used by Contoso, Ltd., the decision is made to perform a manual activation of SITEB and have all messaging data and services provided from that site.

Activation of SITEB begins with verification of directory services and DNS resolution. Because SITEB already contains a directory server that is also hosting Active Directory-integrated DNS, these services are found to be healthy, current, and largely unaffected by the outage of SITEA. After directory services and DNS have been verified, the next step is to begin activation of the SCR targets and a recovery of the clustered mailbox server. This is accomplished by performing the following steps, in order:

  1. The Exchange Management Shell is opened on NODEC, and the following commands are run to prepare the SCR targets for mounting.

    Restore-StorageGroupCopy -Identity EXCMS1\SG1 -StandbyMachine NODEC -Force
    Restore-StorageGroupCopy -Identity EXCMS1\SG2 -StandbyMachine NODEC -Force
    Restore-StorageGroupCopy -Identity EXCMS1\SG3 -StandbyMachine NODEC -Force
    

    Important

    The Force parameter must always be specified when the SCR source is unavailable.

    Note

    An alternative to running three separate Restore-StorageGroupCopy commands as shown in Step 1 is to use a new script included with Microsoft Exchange Server 2007 Service Pack 1 (SP1) called GetSCRSources.ps1 and pipeline the output of the script to a single Restore-StorageGroupCopy task as follows: GetSCRSources | Restore-StorageGroupCopy -StandbyMachine $env:ComputerName -Force

  2. Use the DNS Management snap-in to remove the existing DNS record for EXCMS1 from DNS.

    Note

    This step is used only on failover clusters running Windows Server 2008. This is because the Cluster service runs in the security context of the local system. This step is not needed on failover clusters running Windows Server 2003 when the same Cluster service account is used for both failover clusters.

  3. SCR is disabled for all storage groups to prepare for the /RecoverCMS process. If SCR is not disabled for all storage groups, Setup /RecoverCMS fails. You can disable SCR by using the following commands:

    Disable-StorageGroupCopy -Identity EXCMS1\SG1 -StandbyMachine NODEC -Confirm:$False
    Disable-StorageGroupCopy -Identity EXCMS1\SG2 -StandbyMachine NODEC -Confirm:$False
    Disable-StorageGroupCopy -Identity EXCMS1\SG3 -StandbyMachine NODEC -Confirm:$False
    
  4. The clustered mailbox server (EXCMS1) is recovered by using the /RecoverCMS option for Setup. The recovery takes place on NODEC using the following command to perform the recovery:

    Setup.com /RecoverCMS /CMSName:EXCMS1 /CMSIPAddress:<IPAddress>
    

    Note the following:

    • The value for CMSIPAddress in the preceding command would likely be an IP address that is different from the original IP address for EXCMS1. This is because EXCMS1 is being recovered out-of-site. That is, it was originally hosted in SITEA, and it is being recovered to SITEB.

    • Setup /RecoverCMS will finish successfully after DNS replication has occurred and the recovery server's (NODEC) DNS cache is flushed. If Setup fails, you should use NSLookup from both the primary domain controller (PDC) and NODEC to verify that the correct IP address resolves to NODEC, and then rerun Setup after verification.

    • On failover clusters running Windows Server 2003, the computer object for EXCMS1 is reset during the Setup /RecoverCMS process. This reset needs to be replicated to the local Active Directory site for Setup to finish successfully. If the PDC is not in the local Active Directory site, make sure there are functioning Active Directory site links between the PDC and the local Active Directory site.

    • After the Setup recovery process has completed, EXCMS1 has been recovered to SITEB and it is now hosted on NODEC in a single-node SCC called DRCLUS1.

    • As a result of the clustered mailbox server recovery operation, the DNS TTL for EXCMS1 has reverted to the default value of 20 minutes. It should be set back to five minutes by running the following command, and then stopping and starting the clustered mailbox server.

      Cluster.exe res EXCMS1 /priv HostRecordTTL=300
      
  5. Next, the databases in the three storage groups are mounted using the Mount-Database task.

  6. A recovery operation for other server roles (specifically, Client Access and Hub Transport) does not need to be performed in this scenario because CAS2 and HUB2 already exist in SITEB.

    Note

    As part of the recovery operation for the Client Access server, the external URLs need to be reconfigured to point to the Client Access servers in SITEB.

    Note

    This example scenario does not include recovering Edge Transport servers. If Edge Transport servers existed at SITEA and were also lost during the site failure, new Edge Transport servers would need to be brought online in SITEB and the Mail Exchanger (MX) DNS records for the Contoso Simple Mail Transfer Protocol (SMTP) domains would need to be updated to point to the new Edge Transports servers.

  7. If the Contoso organization includes additional Active Directory sites, messages will be queuing to the primary Active Directory site. After the site membership information for EXCMS1 has replicated to all other Active Directory sites, the SMTP queues holding messages destined for the primary site can be manually retried. (In the absence of a manual retry, the transport engine will automatically try the queue again in 12 hours.) This will re-categorize the messages. After the messages have been re-categorized, they will be delivered to EXCMS1 in SITEB.

At this point, as long as the DNS servers used by clients and other servers have the correct IP address for EXCMS1, all clients should be able to access their mailboxes using their original access methods (for example, Microsoft Outlook Web Access, Exchange ActiveSync, and Microsoft Office Outlook).

Primary Site Reconfiguration

Because the secondary site (SITEB) is now acting as the primary datacenter, the original primary datacenter (SITEA) must be reconfigured so that services now hosted in SITEB do not conflict with services that are brought up after SITEA is ready to be activated for use again. SITEA should be brought online in an administrator-controlled manner using the following steps:

  1. Bring directory services and DNS name resolution services online first, and do so by bringing DC1 online.

  2. After DC1 is online, bring CAS1 and HUB1 online.

    Note that after HUB1 has been brought online, an administrator should verify that any messages in its queues are delivered. If messages are stuck in the queues, they can be submitted again by using the following command.

    Retry-queue [queue name] -Resubmit $True
    
  3. Bring online the nodes that are hosting the cluster EXCLUS1. For purposes of this scenario, NODEA is brought online first, and then NODEB.

  4. When both nodes are brought online, the clustered mailbox server will remain in an offline state. This includes all resources that comprise the clustered mailbox server, most notably its Network Name resource. That resource cannot come online because EXCMS1 is already online and using the same network name. Bringing EXCMS1 online on NODEA or NODEB would result in a name collision on the network.

  5. On the node that currently owns the resource group containing the clustered mailbox server, an administrator must clear the clustered mailbox server and its resources from the failover cluster. To do this, the administrator first removes any non-Exchange resources from the cluster group containing the clustered mailbox server. Then, the administrator runs the following command on NODEA:

    Setup.com /ClearLocalCMS /CMSName:EXCMS1
    

    Note that:

    • After the clustered mailbox server and its resources have been cleared from EXCLUS1, we recommend using Cluster Administrator or the Failover Cluster Management snap-in to verify that all clustered mailbox server resources have been removed.

    • EXCLUS1 is now a failover cluster with two passive nodes, NODEA and NODEB, which each have the passive Mailbox server role installed. At this point, there is no clustered mailbox server in EXCLUS1.

  6. Because the cluster nodes are running Windows Server 2008, after you run Setup /ClearLocalCMS, the virtual computer object (VCO) will be disabled. To re-enable the VCO, click Start, point to All Programs, point to Administrative Tools, and then click Active Directory Users and Computers. Expand the domain, expand Computers, right-click the EXCMS1 VCO, and then click Enable Account.

  7. To prepare for a controlled switch from SITEB to SITEA, NODEA will be made an SCR target for the storage groups hosted on EXCMS1 in SITEB. This is done by using the following commands on NODEC:

    Enable-StorageGroupCopy EXCMS1\SG1 -StandbyMachine NODEA
    Enable-StorageGroupCopy EXCMS1\SG2 -StandbyMachine NODEA
    Enable-StorageGroupCopy EXCMS1\SG3 -StandbyMachine NODEA
    

    Note

    If the storage used by the original failover cluster was unaffected by the failure of SITEA and if the original databases and their transaction logs for the three storage groups still exist on NODEA, it may be possible to use them for continuous replication purposes without having to perform a full reseed for each storage group on NODEA. If the existing files are not usable, or if circular logging was configured for the original clustered mailbox server, a full reseed for each storage group must be performed by running the Update-StorageGroupCopy cmdlet.

An SCC is being used for example purposes in this scenario. If the recovery scenario instead uses a clustered mailbox server in a cluster continuous replication (CCR) environment, an additional step is recommended in which the passive node is also prepared for the controlled switch by staging it with the databases and log files. This step is done purely for optimization purposes to eliminate the need to seed the passive storage groups after the CCR environment was moved back to SITEA. This task is performed by one of two methods:

  • Suspending continuous replication for all three SCR targets, and then copying the storage group files and databases from NODEA to the appropriate locations on NODEB.

  • Enabling NODEB as an SCR target of EXCMS1.

Controlled Switch to Original Primary Site

After SITEA has been approved for usage, a manual and controlled switch of data and services from SITEB to SITEA can be performed. The steps that are performed to accomplish the controlled switch are effectively the reverse of the steps that were performed to activate SITEB:

  1. The first step is to dismount all of the databases on EXCMS1. This is done to stop transaction log file generation and prepare for activation of the SCR targets on NODEA. Databases can be dismounted using the Exchange Management Console, or by using the Dismount-Database cmdlet in the Exchange Management Shell.

  2. On NODEA, an administrator prepares all of the storage groups for mounting. This task is performed by running the following commands:

    Restore-StorageGroupCopy -Identity EXCMS1\SG1 -StandbyMachine NODEA
    Restore-StorageGroupCopy -Identity EXCMS1\SG2 -StandbyMachine NODEA
    Restore-StorageGroupCopy -Identity EXCMS1\SG3 -StandbyMachine NODEA
    

    Note that:

    • In the preceding three commands, the Force parameter is not used because the SCR source server is available. Because the Force parameter is not used, the task will automatically attempt to copy all of the log files from the SCR source.

    • After each task has completed, the administrator should verify that all log files for each storage group have been copied to NODEA, and that SCR has been disabled.

    • If NODEB was also configured as an SCR target, it needs to be disabled and restored before proceeding. In this scenario, we recommend running Restore-StorageGroupCopy first on NODEB, and then on NODEA, and then running Setup /RecoverCMS on NODEA.

  3. The clustered mailbox server (EXCMS1) on DRCLUS1 should be stopped. This task should be performed from NODEC by using the Manage Clustered Mailbox Server wizard in the Exchange Management Console, or by using the Stop-ClusteredMailboxServer cmdlet in the Exchange Management Shell.

  4. The A record for EXCMS1 should be removed from DNS using the DNS Management snap-in.

    Note

    The A record for EXCMS1 needs to be removed only when the failover cluster is running on Windows Server 2008. If the failover cluster is running on Windows Server 2003, this step does not need to be performed.

  5. The clustered mailbox server (EXCMS1) is recovered by using the /RecoverCMS option for Setup. The recovery takes place on NODEA using the following command to perform the recovery:

    Setup.com /RecoverCMS /CMSName:EXCMS1 /CMSIPAddress:<IPAddress>
    

    Note that:

    • The value for CMSIPAddress in the preceding command would likely be the original IP address for EXCMS1. This is because EXCLUS1 is being recovered back to its original location.

    • After the Setup recovery process has completed, EXCMS1 has been recovered to SITEA and it is now hosted on NODEA in a two-node SCC called EXCLUS1.

    Note

    Once again, an SCC is being used for example purposes in this scenario. If the recovery scenario instead uses a clustered mailbox server in a CCR environment, additional steps may need to be performed. The /RecoverCMS operation suspends continuous replication, in this case, from NODEA to NODEB. An administrator must run Resume-StorageGroupCopy for the storage groups to again establish replication and replay activities. Then, the administrator should verify that replication activity has resumed. If the staging of NODEB as described earlier was not successful, the passive copies of the storage groups will need to be reseeded.

    • As a result of the clustered mailbox server recovery operation, the DNS TTL for EXCMS1 has reverted to the default value of 20 minutes. It should be set back to five minutes by running the following command, and then stopping and starting the clustered mailbox server:

      Cluster.exe res "Network Name (EXCMS1)" /priv HostRecordTTL=300
      
  6. The databases in the three storage groups are mounted using the Mount-Database cmdlet.

  7. A recovery operation for other server roles (namely, Client Access and Hub Transport) does not need to be performed in this scenario because CAS1 and HUB1 already exist in SITEA.

    Note

    As part of the recovery operation for the Client Access server, the external URLs need to be reconfigured to point to the Client Access servers in SITEA.

    Note

    This example scenario does not include recovering Edge Transport servers. If Edge Transport servers are in use, the Mail Exchanger (MX) DNS records for the Contoso SMTP domains would need to be updated to point to the correct Edge Transports servers.

  8. If the Contoso organization includes additional Active Directory sites, messages will be queuing at the primary Active Directory site. After the site membership information for EXCMS1 has replicated to all other Active Directory sites, the SMTP queues holding messages destined for the primary site can be manually retried. (In the absence of a manual retry, the transport engine will automatically try the queue again in 12 hours.) This will re-categorize the messages. After the messages have been re-categorized, they will be delivered to EXCMS1 in SITEA.

  9. At this point, as long as the DNS servers used by clients and other servers have the correct IP address for EXCMS1, all clients should be able to access their mailboxes using their original access methods (for example, Outlook Web Access, Exchange ActiveSync, and Outlook). In addition, after the DNS changes have replicated to SITEB, and after site membership of EXCMS1 has been replicated, the Hub Transport servers will route messages to the correct Active Directory site. An administrator can also manually force a resubmission of messages that may be in any queues on HUB1 or HUB2. This task can be performed by running the following command:

    Retry-queue [queue name] -Resubmit $True
    

Reconfiguration of Backup Site

After a manual, controlled switch from SITEB to SITEA has been completed, SITEB can be returned to its operational status as a backup datacenter. This includes clearing the standby clustered mailbox server from the failover cluster in SITEB and by re-enabling NODEC as an SCR target for the three storage groups on EXCMS1. This is accomplished by performing the following steps, in order:

  1. During the controlled switch, the clustered mailbox server running on DRCLUS1 in SITEB was stopped so that it could be brought up on EXCLUS1 in SITEA. After EXCMS1 is successfully put back into production on EXCLUS1, its configuration information needs to be removed from DRCLUS1. The configuration information can be cleared, and EXCMS1 can be completely removed from DRCLUS1 by removing any non-Exchange resources from the cluster group containing the clustered mailbox server, and then running the following command:

    Setup.com /ClearLocalCMS /CMSName:EXCMS1
    
  2. Because the cluster node is running Windows Server 2008, after you run Setup /ClearLocalCMS, the VCO will be disabled. To re-enable the VCO, click Start, point to All Programs, point to Administrative Tools, and then click Active Directory Users and Computers. Expand the domain, expand Computers, right-click the EXCMS1 VCO, and then click Enable Account.

  3. After the clustered mailbox server and its resources have been cleared from DRCLUS1, an administrator should use Cluster Administrator or the Failover Cluster Management snap-in to verify that all clustered mailbox server resources have been removed.

  4. SCR is configured so that transaction log files are being replicated from three storage groups on EXCMS1 to SCR targets on NODEC. This is configured using the following commands:

    Enable-StorageGroupCopy EXCMS1\SG1 -StandbyMachine NODEC
    Enable-StorageGroupCopy EXCMS1\SG2 -StandbyMachine NODEC
    Enable-StorageGroupCopy EXCMS1\SG3 -StandbyMachine NODEC
    

    Note

    Before enabling SCR to replicate the storage groups from EXCMS1 to NODEC, you must ensure that no storage group or database path conflicts exist. You must also ensure that any old and unneeded storage group and database files have been removed from the original paths.

  5. The health and status of SCR for each storage group is then verified using the Test-ReplicationHealth and Get-StorageGroupCopyStatus cmdlets. Moves of the clustered mailbox server between nodes, as well as backups and log truncation operations, should also be verified as working as expected. After all verifications are complete, the primary datacenter and the secondary datacenter are now back to their original operating modes in terms of the Exchange 2007 messaging system.