Site Resilience Configurations

Microsoft Exchange Server 2007 will reach end of support on April 11, 2017. To stay supported, you will need to upgrade. For more information, see Resources to help you upgrade your Office 2007 servers and clients.

 

Applies to: Exchange Server 2007, Exchange Server 2007 SP1, Exchange Server 2007 SP2, Exchange Server 2007 SP3

In recent years, more businesses have recognized that messaging is fundamental to their success. For many organizations, the messaging system must be part of the business continuity plans, and site resiliency must be designed into their messaging service deployment. Fundamentally, many site resilient solutions involve the deployment of backup hardware in a second datacenter. This often results in the following basic questions:

  • What level of service is required after the primary datacenter fails?

  • Do users need their data or just messaging services?

  • How rapidly is data required?

  • How many users must be supported?

  • How will users access their data?

  • What is the standby datacenter activation service level agreement (SLA)?

  • How is service moved back to the primary datacenter?

  • Are the resources dedicated to the site resilience solution?

By answering these questions, you begin to shape your site resilience messaging solution. A core requirement of recovery from site failure is to create a solution that gets the necessary messaging data to a backup datacenter that hosts the messaging service.

This topic provides details about several site resilient configurations for the release to manufacturing (RTM) version of Microsoft Exchange Server 2007 and Exchange 2007 Service Pack 1 (SP1). Before you begin to consider site resilience solutions, we recommend that you become familiar with the following terms:

  • Stretch cluster   Also known as a geographically dispersed cluster, a cluster configuration where nodes of the cluster are present in more than one data center.

  • Database portability   Administrative task that allows mailboxes to be retargeted at a different server when their host database is moved.

  • Stretched Active Directory site   Active Directory directory service site that contains computers from more than one datacenter (for example, an Active Directory site that spans multiple physical locations).

  • Active Directory site membership   Member of a specific Active Directory site based on the computer's primary IP address. Changing the IP address, or changing which Active Directory site contains that IP address, changes the computer's Active Directory site membership.

  • Production datacenter   The datacenter hosting the active servers of a service and it associated infrastructure.

  • Hot backup datacenter   A backup datacenter that is immediately ready to take ownership of the service and continue its delivery. No special configuration is required to run the service at this location.

  • Warm backup datacenter   A backup datacenter that has servers available to take ownership of the service for the production datacenter. Activation of the service in this datacenter requires manual intervention.

  • Cold backup datacenter   A backup datacenter that has the capacity and potentially the infrastructure to take ownership of the service. Significant effort is required before the service is operational in the datacenter.

  • Dedicated   Servers that are designated to only support the users of the primary datacenter.

  • Non-Dedicated   Servers that are supporting the users of the primary datacenter, as well as users in other locations.

Terms such as production, warm, and dedicated can be combined to describe a site resilient deployment. For example, a production data center that is backed up by a dedicated and largely configured backup data center would be called Production:Warm (Dedicated).

Features that Support Site Resilience

There are several Exchange 2007 features that can be used as building blocks for a site resilience solution. They are:

  • Stretch clusters, which can be used to replicate data or simplify activation of the backup datacenter.

  • Database portability, which can be used to activate replicated data.

  • Stretched Active Directory sites, which can be used to support stretched clusters or to enable a backup datacenter.

  • Changing a computer's Active Directory site membership, which can be performed as part of activating a backup datacenter.

  • Regular tape backups in conjunction with offsite storage, which can be used to recover mailbox data in the backup datacenter.

In addition, third-party products offer data replication, which can be used to transfer data to a backup datacenter. These products can be used in conjunction with stand-alone servers, recovery clusters, or a stretched single copy cluster (SCC). In these configurations, data from the primary server or cluster is replicated to a second server or cluster configuration in a second datacenter. When a site failure occurs, the cluster or server in the second datacenter is manually activated.

In Exchange 2007 SP1, a new feature called standby continuous replication (SCR) has been added, which is specifically designed for site resilience scenarios. As its name implies, SCR is designed for scenarios that use or enable the use of standby recovery servers. SCR extends the existing continuous replication features found in Exchange 2007 RTM and enables new data availability scenarios for Mailbox servers running Exchange 2007 SP1. SCR uses the same log shipping and replay technology used by local continuous replication (LCR) and cluster continuous replication (CCR) to provide added deployment options and configurations.

SCR enables a separation of high availability (comprised of service and data availability) and site resilience. For example, SCR can be combined with CCR to replicate storage groups locally in a primary datacenter (using CCR for high availability) and remotely in a secondary or backup datacenter (using SCR for site resilience). The secondary datacenter could contain a passive node in a failover cluster that hosts the SCR targets. This type of cluster is called a standby cluster because it does not contain any clustered mailbox servers, but it can be quickly provisioned with a replacement clustered mailbox server in a recovery scenario. If the primary datacenter fails or is otherwise lost, the SCR targets hosted in this standby cluster can be quickly activated on the standby cluster.

For more information about SCR, see Standby Continuous Replication.

Solutions to Achieve Site Resilience

An organization can consider several site resilience solutions. The remainder of this topic provides information about the following site resilience solutions:

  • Production:Cold (Dedicated)

  • Production:Warm (Dedicated)

  • Production:Warm (Non-Dedicated) with two Active Directory sites

  • Production:Production (Non-Dedicated) with one Active Directory site

The solutions described in this topic assume the complete messaging infrastructure is lost when the production datacenter fails. The backup datacenter must have Internet connectivity and all necessary services to host Exchange. In addition, your activation processes should be scripted and regularly tested.

Production:Cold (Dedicated)

The most basic messaging site resiliency solution is one where the organization has contracts in place for hardware and facilities, but it does not have an active backup datacenter. All mailbox data is regularly backed up and moved off site. Active Directory data is handled in a similar way. Activating the site resilience solution requires that hardware be acquired and deployed. To shorten the overall outage time, the organization can have rapid delivery contracts with hardware vendors for the critical pieces of hardware.

A variation of this solution is to establish the relationship with a disaster recovery vendor who can make the hardware available from a pool that the vendor maintains. This type of relationship may permit the backup data to be maintained at the vendor's location to shorten recovery time. Dedicated storage at the vendor's location can be the replication targets for mailbox and Active Directory data.

For simplicity, it is likely that the deployed configurations will eventually look similar to the production environment or at least some of it. In the midst of a recovery process like this, it is best to work with as much familiar technology and dependencies as possible.

Production:Warm (Dedicated)

In the Production:Warm (Dedicated) recovery model, the production datacenter has a designated backup datacenter with dedicated equipment. The dedicated equipment is used when the production datacenter becomes unavailable. As previously mentioned, the backup datacenter is not automatically activated. The administrator must manually trigger its activation. When triggered, the activation reconfigures the dedicated backup equipment and infrastructure to provide the messaging service. The following figure illustrates a Production:Warm (Dedicated) configuration.

Example of a Production:Warm (Dedicated) deployment

Production:Warm (Dedicated) Deployment

The preceding figure shows the production datacenter (A) hosting Edge Transport, Hub Transport, Client Access, and Mailbox server roles. The warm backup datacenter (B) has dedicated backup servers for each role and for Active Directory. The figure illustrates that simple redundancy is used for all server roles except the Mailbox server role. Mailbox redundancy is handled by a cluster or a standby server configuration with an appropriate replication solution.

The possible mailbox redundancy solutions are:

  • Cluster continuous replication (CCR) in a stretch cluster configuration   CCR uses log shipping to create and manage a second copy of the mailbox data. Thus, the CCR two-node cluster has a node in each datacenter. In this configuration, the Windows Cluster service requires subnets that are stretched between the two locations. The stretch cluster allows the clustered mailbox server to fail over simply by registering its assigned IP address again on the node in the other datacenter.

  • Single copy cluster (SCC) with synchronous partner replication   The partner replication allows the system to have two copies of the Mailbox server data. As with CCR, a stretched subnet is required for cluster failover to be successful.

  • Standby cluster with partner replication   Mailbox data is replicated to a second cluster in the backup datacenter, and the server disaster recovery process is used to restore service. Replication can be synchronous or asynchronous. No clustering is required, and there is no stretched subnet requirement.

  • Standby server with partner replication   Mailbox data is replicated to a second server in the backup datacenter, and either database portability or the server disaster recovery process is used to restore service. Replication can be synchronous or asynchronous. No clustering is required, and there is no stretched subnet requirement.

  • Local continuous replication (LCR) with second copy hosted in second datacenter   This isn’t a preferred solution, but it may be sufficient for some organizations. In this configuration, Internet SCSI (iSCSI)-based storage is used to store the passive copy of the data. The network characteristics of the connection must allow the passive copy to remain reasonably consistent with the active copy. In this configuration, LCR is unavailable for rapid local activation because it is unlikely that network latency and bandwidth will support client access.

The preceding figure illustrates the use of one of the clustered solutions. This is because the Mailbox server is shown in the production datacenter’s Active Directory site. In a clustered solution, the networks on each node in the cluster must be on the same subnet. In a non-clustered solution, a single subnet is not required, but it is recommended. You can use a different subnet if necessary.

Assuming a clustered solution is used, the normal course of operations would be as follows:

  1. All incoming Internet mail would flow through the Edge Transport server in Datacenter A.

  2. All mail destined for Mailbox servers in Active Directory site Redmond-Prod would be processed by the Hub Transport servers in Redmond-Prod.

  3. The clustered mailbox servers in Active Directory site Redmond-Prod would be hosted on their configured nodes in Datacenter A or Datacenter B. NodeA and NodeB are part of Redmond-Prod and are serviced by the Redmond-Prod Hub Transport and Client Access servers.

  4. Because CCR supports two nodes, the second node must be in Datacenter B. This means that an active node failure in Datacenter A forces the clustered mailbox server to move to Datacenter B; in this case, it will still be serviced by the Hub Transport servers and Client Access servers in Datacenter A.

  5. An SCC with three servers and two copies of the data can be configured so that a failure causes the clustered mailbox server to remain in Datacenter A instead of failing over to Datacenter B. However, if the failure is storage-based, it is still necessary to activate the passive node in Datacenter B.

The network bandwidth requirements between the two datacenters have three driving factors:

  • Cluster service latency requirements   The Cluster service requires no greater than a half-second round-trip time between the cluster nodes.

  • Bandwidth requirements for replication   CCR requires less bandwidth than most third-party replication solutions because CCR replication is based on log shipping and not database copying. The bandwidth required by a CCR solution depends on a variety of factors that are typically unique to each environment, and the requirements include bandwidth for the following:

    • Log shipping

    • File system notifications, which is how the Microsoft Exchange Replication service knows when there is a new log file ready for shipping

    • Directory server traffic

    • Client traffic, if the clients are not located in the same physical location as the clustered mailbox server

    • Cluster heartbeat traffic

    • Cluster database updates

    • Any other applications that use the network

  • Hub Transport and Client Access servers require LAN communication between themselves and the Mailbox servers they serve   For Client Access servers, this requirement is more important because it serves online users. Mailbox access to domain controllers can flow over a wide area network (WAN) connection, and its latency affects online MAPI access.

The latency and bandwidth requirements may decrease when a non-clustered solution is deployed. The network requirements for replication remain and are significant. However, the majority of the other requirements are not present unless you envision activating the backup Mailbox server without the complete failure of Datacenter A.

When the production datacenter fails, the administrator can restore mail flow and messaging services by doing one of the following:

  • Moving the Mailbox servers in the backup datacenter into the Active Directory site Redmond-DR.

  • Moving the Hub Transport, Client Access, and directory servers in the backup datacenter into the Active Directory site Redmond-Prod.

The second option is the recommended strategy because it minimizes the impact on other parts of the environment. For example, Exchange servers in any branch offices do not need to change their perceived routing for queued mail. They simply connect when the correct servers are up and available.

The activation of Datacenter B follows these high-level steps:

  1. The network infrastructure is brought online.

  2. The Active Directory infrastructure is brought online.

  3. The remaining Mailbox server is brought online. This step may involve forcing the cluster to come online with the single remaining server.

  4. The Active Directory site Redmond-Prod is updated with the IP addresses of the Hub Transport, Client Access, and directory servers in Redmond-DR.

  5. The MX record for the organization's domains is updated with the IP address of the Edge Transport server in Datacenter B.

  6. The newly moved Client Access server is added to a Network Load Balancing (NLB) configuration.

  7. Datacenter A messaging service is restored in Datacenter B.

When Datacenter A is available, Datacenter B can be deactivated using these high-level steps:

  1. Datacenter A individual servers are brought online. They will participate in providing the service unless Exchange services are manually stopped or disabled. When migrating back, allow Datacenter A servers to come online.

  2. Allow the Hub Transport servers in Datacenter B to drain their queues, and then take them offline.

  3. Take Client Access servers in Datacenter B out of the NLB configuration. Clients then connect through the servers in Datacenter A.

  4. The MX record for the organization's domains is updated with the IP address of the Edge Transport server in Datacenter A.

  5. Perform any required networking infrastructure updates.

  6. Move the clustered mailbox servers to Datacenter A.

  7. Update Active Directory site Redmond-DR with the IP addresses of the servers that were moved during the activation.

  8. Datacenter A messaging service is restored.

As with any site failure solution, the activation of the production and backup datacenter should be scripted and tested regularly. Using a clustered solution for the Mailbox server decreases activation times for the backup datacenter. Other solutions may have some Domain Name System (DNS) and Active Directory replication required that can affect when mail flow resumes and clients are able to access their mailbox.

The Production:Warm (Dedicated) solution has the advantage that the dedicated computers provide a predictable level of service.

Production:Warm (Non-Dedicated) with Two Active Directory Sites

In the Production:Warm (Dedicated) configuration, the Edge Transport, Hub Transport, and Client Access servers in the backup datacenter are dedicated as standby resources for Datacenter A. That configuration represents a significant hardware investment that is not being fully used. An alternative model is represented in the following figure.

Example of a Production:Warm (Non-Dedicated) deployment

Example Production:Warm (Non-Dedicated) Deployment

Production:Warm (Non-Dedicated) requires the administrator to manually trigger activation of the backup datacenter. When triggered, the activation process reconfigures some equipment and infrastructure in the backup datacenter to take over messaging service for the users of Datacenter A.

As with the Production:Warm (Dedicated) solution, there are two Active Directory sites in the Production:Warm (Non-Dedicated) solution. But unlike the Production:Warm (Dedicated) solution, both Active Directory sites span to the other datacenter. The dedicated resources in the backup datacenter have become redundant servers for a different production configuration in the backup datacenter. This approach makes these resources available for normal use, thereby creating two production datacenters that are effectively a backup for each other.

For example, as shown in the figure Example Production:Warm (Non-Dedicated) deployment, when Datacenter A fails, Hub Transport server 4, Client Access server 4 and Global Catalog server 4 are added to Active Directory site Redmond, and in conjunction with Redmond NodeB, serve the users of Datacenter A to deliver the messaging service. After the site failure, the two production environments are now running at reduced capacity and reduced redundancy compared with their normal state. Assuming their ongoing load can be supported, this configuration is acceptable. For example, Internet mail is going through the Edge Transport server in Datacenter B. To support an extended datacenter outage, the business can have vendor contracts that rapidly provide additional hardware when requested. The added hardware could then be used to restore redundancy or add additional capacity.

The normal operation of the Redmond and Dublin Active Directory site deployments would be the same for this solution as they are for the Production:Warm (Dedicated) solution. Similarly, the network bandwidth between the two locations would have the same driving factors, except that both Redmond and Dublin servers need to be concurrently supported.

Activation of the backup datacenter is done by either:

  • Moving the active node and clustered mailbox server to the operating datacenter’s Active Directory site.

  • Moving Hub Transport, Client Access, and directory servers in the backup datacenter into the failed datacenter’s Active Directory site.

The recommended activation solution is to move the Hub Transport and Client Access servers into the failed datacenter’s Active Directory site. This solution results in the simplest and least disruptive activation.

In this solution, the recovery of Datacenter A is accomplished by these high-level steps:

  1. The network infrastructure is brought online. It is possible that no network infrastructure changes are required because Internet mail is already being received by Datacenter B.

  2. The Active Directory infrastructure for Datacenter A is brought online (Active Directory site Redmond).

  3. The remaining Mailbox server is brought online. This step may involve forcing the cluster to come online with the single remaining server.

  4. The Active Directory site Redmond is updated with the IP addresses of Hub Transport server 4, Client Access server 4, and Global Catalog server 4.

  5. Client Access server 3 is added to the NLB configuration for Redmond.

  6. Datacenter A messaging service is restored.

When Datacenter A is available, Datacenter B can be restored to its normal configuration using these high-level steps:

  1. Datacenter A individual servers are brought online. They will participate in providing the service unless Exchange services are manually stopped or disabled. When migrating back, allow Datacenter A servers to come online.

  2. Allow Hub Transport server 4 to drain its queues, and then take it offline.

  3. Take Client Access server 4 out of the NLB configuration. Clients will still be able to connect to the servers in Datacenter A.

  4. Perform any required networking infrastructure updates.

  5. Move the clustered mailbox server to Datacenter A.

  6. Update Active Directory site Dublin with the IP addresses of the servers that were moved during the activation.

  7. Both datacenters are restored to their original condition.

As with any site failure solution, the activation of the production and backup datacenter should be scripted and tested regularly. Using a clustering solution for the Mailbox server decreases activation times for the backup datacenter. Other mailbox solutions may have some DNS and Active Directory replication required that can affect when mail flow resumes and clients are able to access their mailbox.

This solution allows the servers used for site resilience to be applied to normal operation. This may decrease the cost of the site resilience solution, but it risks not being able to sustain complete system load when required. For example, should the load on the Hub Transport servers in Datacenter B grow to use 80 percent of the capacity, activation of the backup Datacenter for A will exceed Hub Transport capacity. With this solution, administrators must be careful in tracking system utilization over time to make sure that the solution remains viable. Should the load increase, you will need to acquire and deploy new hardware.

Production:Production (Non-Dedicated) with One Active Directory Site

Organizations that need a solution capable of supporting automatic activation of a backup site must deploy a Production:Production (Non-Dedicated) solution. This solution deploys redundant servers in a single Active Directory site that spans both datacenters, as illustrated in the following figure.

Example of a Production:Production (Non-Dedicated) deployment

Production:Production (Non-Dedicated) Deployment

This solution deploys the resources of both datacenters into a single Active Directory site. Any resource in the site may be used to serve most any request. For example, an Edge Transport server in Datacenter A may use a Hub Transport server in Datacenter B to deliver a message to a user whose mailbox is on a clustered mailbox server that is hosted in Datacenter A. Similarly, by default there is no locality of reference for Active Directory traffic. For these reasons, this solution is not recommended.

Activation of the backup datacenter is similar to recovery of multiple server failures. Recovery from activation simply requires restoring service on the failed servers. As with the previously discussed non-dedicated solutions, poor capacity management can result in the load exceeding the service's capacity after a datacenter failure. Administrators must make sure that the solution can support the expected load after a datacenter failure. Failure to do proper capacity management can result in a complete messaging service failure after a single datacenter failure.