New High Availability and Site Resilience Functionality

Article
07/23/2014

Applies to: Exchange Server 2010 SP2

Microsoft Exchange Server 2010 reduces the cost and complexity of deploying an e-mail solution that provides the highest levels of server availability and site resilience. Building on the native replication capabilities introduced in Exchange Server 2007, the new high availability architecture in Exchange 2010 provides a simplified, unified framework for both high availability and disaster recovery. Exchange 2010 integrates high availability into the core architecture of Exchange, enabling customers of all sizes and in all segments to be able to economically deploy a messaging continuity service in their organization.

Lessons Learned from Exchange Server 2007

Exchange 2007 decreased the costs of high availability and made site resilience much more economical by introducing new technologies such as local continuous replication (LCR), cluster continuous replication (CCR), and standby continuous replication (SCR). Still, some challenges remained:

Some administrators were intimidated by the complexity of Windows failover clustering.
Achieving a high level of uptime can require a high level of administrator intervention.
Each type of continuous replication was managed differently and separately.
Recovering from a failure of a single database on a large Mailbox server could result in a temporary disruption of service to all users on the Mailbox server.
Site resilience solutions were not seamless.
The transport dumpster feature of the Hub Transport server could only protect messages destined for mailboxes in an LCR or CCR environment. If a Hub Transport server fails while processing messages and can't be recovered, it could result in data loss.

Exchange 2010 includes significant core changes that integrate high availability deep in its architecture, making it even less costly and easier to deploy and maintain than Exchange 2007 for all customers. Organizations can now deploy a fully redundant Exchange organization with just two servers, and benefit from database-level failovers. Customers benefit from automatic, database-level failover capabilities without having to become experts in Windows failover clustering. Moreover, you can add site resilience to your existing high availability deployments with less complexity.

Exchange 2007 introduced many new architectural changes designed to make deploying high availability and site resilience solutions for Exchange faster and simpler. These improvements included an integrated Setup experience, optimized out-of-box configuration settings, and the ability to manage most aspects of the high availability solution using native Exchange management tools.

Still, management of an Exchange 2007 high availability solution required administrators to master some clustering concepts, such as the concept of moving network identities and managing cluster resources. In addition, when troubleshooting issues related to a clustered Mailbox server, administrators had to use Exchange tools and cluster tools to review and correlate logs and events from two different sources: one from Exchange and one from the cluster.

Two other limiting aspects of the Exchange 2007 architecture have also been re-evaluated and re-engineered based on customer feedback:

Clustered Exchange 2007 servers require dedicated hardware. Only the Mailbox server role could be installed on a node in the cluster. This meant that a minimum of four Exchange servers were required to achieve full redundancy of the primary components of a deployment, that is, the core server roles (Mailbox, Hub Transport, and Client Access).
In Exchange 2007, failover of a clustered Mailbox server occurs at the server level. As a result, if a single database failure occurred, the administrator had to fail over the entire clustered Mailbox server to another node in the cluster (which resulted in brief downtime for all users on the server, and not just those users with a mailbox on the affected database), or leave the users on the failed database offline (potentially for hours) while restoring the database from backup.

Mailbox Resiliency

Exchange 2010 has been re-engineered around the concept of mailbox resiliency, in which the architecture has changed so that automatic failover protection is now provided at the individual mailbox database level instead of at the server level. In Exchange 2010, this is known as database mobility. As a result of this and other database cache architectural changes, failover actions now complete much faster than in previous versions of Exchange. For example, failover of a clustered Mailbox server in a CCR environment running Exchange 2007 with Service Pack 2 (SP2) completes in about two minutes. By comparison, failover of a mailbox database in an Exchange 2010 environment completes in 30 seconds or less (measured from the time when the failure is detected to when a database copy is mounted, assuming an available copy that's healthy and up to date with log replay). The combination of database-level failovers and significantly faster failover times dramatically improves an organization's overall uptime.

The mailbox resiliency architecture built into Exchange 2010 provides new benefits for organizations and their messaging administrators:

Multiple server roles can coexist on servers that provide high availability. This enables small organizations to deploy a two-server configuration that provides redundancy of mailbox data and service, while also providing redundant Client Access and Hub Transport services.
An administrator no longer needs to build a failover cluster to achieve high availability. Failover clusters are now created by Exchange 2010 in a way that's invisible to the administrator. Unlike previous versions of Exchange clusters which used an Exchange-provided cluster resource DLL named ExRes.dll, Exchange 2010 no longer needs or uses a cluster resource DLL. Exchange 2010 isn't a clustered application, and it uses only a small portion of the failover cluster components, namely, its heartbeat capabilities and the cluster database, to provide database mobility.
Administrators can add high availability to their Exchange 2010 environment after Exchange has been deployed, without having to uninstall Exchange and then redeploy in a highly availability configuration.
Exchange 2010 provides a view of the event stream that coalesces and combines the events from the operating system with the events from Exchange.
Because storage group objects no longer exist in Exchange 2010, and because mailbox databases are portable across all Exchange 2010 Mailbox servers, it's easy to move databases when needed.

For more information, see High Availability and Site Resilience.

Flexible Mailbox Protection

Exchange 2010 includes several new features and core changes that, when deployed and configured correctly, can provide flexible mailbox protection that eliminates the need to make traditional backups of your data. Using the high availability features built into Exchange 2010 to minimize downtime and data loss in the event of a disaster can also reduce the total cost of ownership of the messaging system. By combining these features with other built-in features, such as Legal Hold, organizations can reduce or eliminate their dependency on traditional point-in-time backups and realize the cost savings of doing so.

In addition to determining whether Exchange 2010 enables you to move away from traditional point-in-time backups, we also recommend that you evaluate the cost of your current backup infrastructure. Consider the cost of end-user downtime and data loss when attempting to recover from a disaster using your existing backup infrastructure. Also, include hardware, installation and license costs, as well as the management cost associated with recovering data and maintaining the backups. Depending on the requirements of your organization, it is quite likely that a pure Exchange 2010 environment with at least three mailbox database copies will provide lower total cost of ownership than one with backups.

For more information about flexible mailbox protection, see Understanding Backup, Restore and Disaster Recovery.

Changes to High Availability from Previous Versions of Exchange

Exchange 2010 includes many changes to its core architecture. Exchange 2010 combines the key availability and resilience features of CCR and SCR into single high availability solution which handles both onsite data replication and offsite data replication. Mailbox servers can be defined as part of a database availability group (DAG) to provide automatic recovery at the individual mailbox database level instead of at the server level. Each mailbox database can have up to 16 copies. Other new high availability concepts are introduced in Exchange 2010, such as database mobility and incremental deployment. The concepts of an organization without backups and RAID are also being introduced in Exchange 2010.

To summarize, the key aspects to data and service availability for the Mailbox server role and mailbox databases are:

Exchange 2010 uses an enhanced version of the same continuous replication technology introduced in Exchange 2007. For more information, see Changes to Continuous Replication from Exchange Server 2007 later in this topic.
Storage groups no longer exist in Exchange 2010. Instead, there are simply mailbox databases, mailbox database copies, and public folder databases. The primary management interfaces for Exchange databases has moved within the Exchange Management Console from the Mailbox node under Server Configuration to the Mailbox node under Organization Configuration.
Some Windows Failover Clustering technology is used by Exchange 2010, but it's now completely managed by Exchange. Administrators don't need to install, build, or configure any aspects of failover clustering when deploying highly available Mailbox servers.
Each Mailbox server can host as many as 100 databases, and each database can have as many as 16 copies.
In addition to the transport dumpster feature, a new Hub Transport server feature named shadow redundancy has been added. Shadow redundancy provides redundancy for messages for the entire time they're in transit. The solution involves a technique similar to the transport dumpster. With shadow redundancy, the deletion of a message from the transport database is delayed until the transport server verifies that all of the next hops for that message have completed delivery. If any of the next hops fail before reporting back successful delivery, the message is resubmitted for delivery to that next hop. For more information about shadow redundancy, see Understanding Shadow Redundancy.

Incremental Deployment

In previous versions of Exchange, service availability for the Mailbox server roles was achieved by deploying Exchange in a Windows failover cluster. To deploy Exchange in a cluster, you had to first build a failover cluster, and then install the Exchange program files. This process created a special Mailbox server called a clustered Mailbox server (or Exchange Virtual Server in previous versions of Exchange). If you had already installed the Exchange program files on a non-clustered server and you decided you wanted a clustered Mailbox server, you had to build a cluster using new hardware, or remove Exchange from the existing server, install failover clustering, and reinstall Exchange.

Exchange 2010 introduces the concept of incremental deployment, which enables you to deploy service and data availability for all Mailbox servers and databases after Exchange is installed. Service and data redundancy is achieved by using new features in Exchange 2010 such as DAGs and database copies.

Database Availability Groups

A DAG is a set of up to 16 Mailbox servers that provide automatic database-level recovery from failures that affect individual databases. Any server in a DAG can host a copy of a mailbox database from any other server in the DAG. When a server is added to a DAG, it works with the other servers in the DAG to provide automatic recovery from failures that affect mailbox databases, such as a disk failure or server failure.

For more information about DAGs, see Understanding Database Availability Groups.

Mailbox Database Copies

The high availability and site resilience features first introduced in Exchange 2007 are used in Exchange 2010 to create and maintain database copies, thereby enabling you to achieve your availability goals in Exchange 2010. Exchange 2010 also introduces the new concept of database mobility, which is Exchange-managed database-level failovers.

Database mobility disconnects databases from servers, adds support for up to 16 copies of a single database, and provides a native experience for adding database copies to a database. In Exchange 2007, a feature called database portability also enabled you to move a mailbox database between servers. A key distinction between database portability and database mobility, however, is that with database mobility, all copies of a database have the same GUID.

Other key characteristics of database mobility are:

Because storage groups have been removed from Exchange 2010, continuous replication now operates at the database level. In Exchange 2010, transaction logs are replicated to one or more Mailbox servers and replayed into a copy of a mailbox database that's stored on those servers.
A failover is an automatic activation process that can occur at either the database level or at the server level. A switchover is a manual activation process that you can perform at the database, server, or data center (site) level.
Database names for Exchange 2010 must be unique within the Exchange organization.
When a mailbox database has been configured with one or more database copies, the full path for all database copies must be identical on all Mailbox servers that host a copy.
Any mailbox database copy (the active or any passive copy) can be backed up using an Exchange-aware Volume Shadow Copy Service (VSS)-based backup application.

For more information about mailbox database copies, see Understanding Mailbox Database Copies.

Changes to Continuous Replication from Exchange Server 2007

The underlying continuous replication technology previously found in CCR and SCR remains in Exchange 2010, and it's been further evolved to support new high availability features such as database copies, database mobility, and DAGs. Some of these new architectural changes are briefly described as follows:

Because storage groups have been removed from Exchange 2010, continuous replication now operates at the database level. Exchange 2010 still uses an Extensible Storage Engine (ESE) database that produces transaction logs that are replicated to one or more other locations and replayed into one or more copies of a mailbox database.
Because the log replay functionality that was performed by the Microsoft Exchange Replication service in Exchange 2007 has been moved into the Exchange 2010 version of the Microsoft Exchange Information Store service (store.exe), the performance hit associated with failovers and switchovers (because a new database cache was put into use) no longer exists. When a failover or switchover occurs, the activated database has a warm cache that's ready for use.
Log shipping and seeding no longer uses Server Message Block (SMB) for data transfer. Exchange 2010 continuous replication uses a single administrator-defined TCP port for data transfer. In addition, Exchange 2010 includes built-in options for network encryption and compression for the data stream.
Log shipping no longer uses a pull model, where the passive copy pulls closed log files from the active copy. Instead, the active copy pushes the log files to each configured passive copy.
Seeding is no longer restricted to using only the active copy of the database. Passive copies of mailbox databases can now be specified as sources for database copy seeding and reseeding.
Database copies are for mailbox databases only. For redundancy and high availability of public folder databases, we recommend that you use public folder replication. Unlike CCR, where multiple copies of a public folder database couldn't exist in the same cluster, each DAG member can host a public folder database, and you can use public folder replication to replicate public folders between public folder databases hosted on DAG members.
The LogReplayer component of the Microsoft Exchange Replication service includes new logic to suspend log replay if the copy queue length increases beyond a specific threshold. If the number of logs in the copy queue is greater than the number of log files that have been copied to the passive database copy, but not inspected by the passive copy, then the Microsoft Exchange Replication service will suspend log replay for the passive copy and log Warning event 4110 in the event log. When the number of log files in the copy queue drops below the number of non-inspected copied log files, the Microsoft Exchange Replication service will resume replay for the passive copy and log Informational event 4111 in the event log.

Several concepts used in Exchange 2007 continuous replication also remain in Exchange 2010. These include the concepts of failover management, divergence, the use of the auto database mount dial, and the use of public and private networks.

Changes to Routing Behavior When Hub Transport and Mailbox are Co-Located in a DAG

When the Hub Transport server is co-located with a Mailbox server that is a member of a DAG, there are changes in routing behavior to ensure that the resiliency features in both server roles will provide the necessary protection for messages sent and received by users on that server. The Hub Transport server role was modified so that it now attempts to re-route a message for a local Mailbox server to another Hub Transport server in same site if the Hub Transport server is also a DAG member and it has a copy of the mailbox database mounted locally. This extra hop was added in order to put the message in transport dumpster on a different Hub Transport server.

For example, EX1 hosts the Hub Transport and Mailbox role and is a member of a DAG. When a message arrives in transport for EX1 that is destined for a recipient whose mailbox is also on EX1, transport will re-route the message to another Hub Transport server in the site (for example, EX2), and that server will deliver the message to the mailbox on EX1.

There is a second similar behavior change with respect to the Microsoft Exchange Mail Submission service. This service was modified so that it would prefer to not submit messages to a local Hub Transport role when the Mailbox and Hub Transport server is a member of a DAG. In this scenario, the behavior of transport is to load balance submission requests across other Hub Transport servers in same Active Directory site, and fall back to local Hub Transport server if there are no other available Hub Transport servers in the same site.

End-to-End Availability

Exchange 2010 also includes many features designed to increase end-to-end availability of the system. These features include:

Transport resilience
Online move mailbox
Exchange native data protection
Incremental resync
Third Party Replication API

Transport Resilience

Exchange 2007 introduced the transport dumpster feature of the Hub Transport server. The transport dumpster maintains a queue of messages that were delivered to recipients whose mailbox was in a CCR (and in Exchange 2007 SP1, in an LCR) environment. This feature was designed to help protect against data loss by providing an administrator with the option to have a clustered Mailbox server automatically come online on another node with a limited amount of data loss. This is referred to as a lossy failover. When a lossy failover occurred, the system automatically re-delivered the recent e-mail messages sent to users on the failed clustered Mailbox server, by using the transport dumpster where the e-mail messages were still stored. Although this solution helped to minimize the amount of data lost in a lossy failover, the solution only protected from data loss within a site, and it didn't provide protection for messages in transit.

Exchange 2010 introduces core architectural changes that address both issues. Because DAGs can be stretched across Active Directory sites, it's possible for an individual mailbox database to move between Active Directory sites. Because of this design change, the transport dumpster re-delivery request upon a lossy database failover is now issued to Hub Transport servers in both the database's original and new Active Directory sites.

One other significant change to the transport dumpster is that it now receives feedback from the replication pipeline. When messages in the transport dumpster have been replicated to all mailbox database copies, they're removed from the transport dumpster. This ensures that only non-replicated data is held in the transport dumpster.

In addition to the transport dumpster feature, a new Hub Transport server feature named shadow redundancy has been added. Shadow redundancy provides redundancy for messages for the entire time they're in transit. The solution involves a technique similar to the transport dumpster. With shadow redundancy, the deletion of a message from the transport database is delayed until the transport server verifies that all of the next hops for that message have completed delivery. If any of the next hops fail before reporting back successful delivery, the message is resubmitted for delivery to that next hop. For more information about shadow redundancy, see Understanding Shadow Redundancy.

Online Move Mailbox

Exchange 2010 includes a new feature that enables you to move mailboxes asynchronously. In Exchange 2007, when you used the Move-Mailbox cmdlet to move a mailbox, the cmdlet logged into both the source database and the target database and moved the content from one mailbox to the other mailbox. There were several disadvantages to having the cmdlets perform the move operation:

Mailbox moves typically took hours to complete, and during the move, users weren't able to access their mailbox.
If the command window used to run Move-Mailbox cmdlet was closed, the move was terminated and had to be restarted from the beginning.
The computer used to perform the move participated in the data transfer. If an administrator ran the cmdlets from their workstation, the mailbox data would flow from the source server to the administrator's workstation and then to the target server.

The new move request cmdlets in Exchange 2010 can be used to perform asynchronous moves. Unlike Exchange 2007, the cmdlets don't perform the actual move. The move is performed by the Microsoft Exchange Mailbox Replication Service, a new service that runs on the Client Access server. The New-MoveRequest cmdlet sends requests to the Mailbox Replication Service. For more information about online move mailbox, see Understanding Move Requests.

Exchange Native Data Protection

There are several changes to the core architecture of Exchange 2010 that have a direct effect on how you protect your mailbox databases and the mailboxes they contain.

One significant change is the removal of storage groups. In Exchange 2010, each database is associated with a single log stream, represented by a series of 1 megabyte (MB) log files. Each server can host a maximum of 100 databases.

Another significant change for Exchange 2010 is that databases are no longer closely tied to a specific Mailbox server. Database mobility expands the system's use of continuous replication by replicating a database to multiple different servers. This provides better protection of the database and increased availability. In the case of failures, the other servers that have copies of the database can mount the database.

The ability to have multiple copies of a database hosted on multiple servers, means that if you have a sufficient number of database copies, you can use these copies as your backups. For more information on this strategy, see Understanding Backup, Restore and Disaster Recovery.

Incremental Resync

Exchange 2007 introduced the concepts of lost log resilience (LLR) and incremental reseed. LLR is an internal component of ESE that enables you to recover Exchange mailbox databases even if one or more of the most recently generated transaction log files have been lost or damaged. LLR enables a mailbox database to mount even when recently generated log files are unavailable. LLR works by delaying writes to the database until the specified number of log generations have been created. LLR delays recent updates to the database file for a short time. The length of time that writes are delayed depends on how quickly logs are being generated.

Note

LLR is hard-coded to one log file for all Exchange 2010 mailbox databases.

Incremental reseed provided the ability to correct divergences in the transaction log stream between a source and target storage group, by relying on the delayed replay capabilities of LLR. Incremental reseed didn't provide a means to correct divergences in the passive copy of a database after divergent logs had been replayed, which forced the need for a complete reseed.

In Exchange 2010, incremental resync is the new name for the feature that automatically corrects divergences in database copies under the following conditions:

After an automatic failover for all of the configured copies of a database
When a new copy is enabled and some database and log files already exist at the copy location
When replication is resumed following a suspension or restarting of the Microsoft Exchange Replication Service

When divergence between an active database and a copy of that database is detected, incremental resync performs the following tasks:

It searches historically in the log file stream to locate the point of divergence.
It locates the changed database pages on the diverged copy.
It reads the changed pages from the active copy and then copies the necessary log files from the active copy.
It applies the database page changes to the diverged copy.
It runs recovery on the diverged copy and replays the necessary log files into the database copy.

Third Party Replication API

Exchange 2010 also includes a new Third Party Replication API that enables organizations to use third-party synchronous replication solutions instead of the built-in continuous replication feature. For information about partner products for Exchange 2010, see the Exchange 2010 Partners Web site. If you're a partner seeking information on the Third Party Replication API, please contact your Microsoft representative.

Features Cut from Exchange Server 2007

The following features in Exchange 2007 and Exchange 2007 SP1 no longer exist in Exchange 2010. Their replacements are noted in the table.

Feature	Replacement
Cluster continuous replication (CCR)	Database availability groups and mailbox database copies
Standby continuous replication (SCR)	Database availability groups and mailbox database copies
Local continuous replication (LCR)	Database availability groups and mailbox database copies
Single copy clusters (SCC)	Database availability groups and mailbox database copies; built-in third-party synchronous API available to replace third-party data replication used with SCC
Clustered Mailbox servers	Database availability groups and mailbox database copies
Storage groups	Databases
Recovery Storage Group	Recovery database