Understanding Database Availability Groups

Article
07/23/2014

Applies to: Exchange Server 2010 SP3, Exchange Server 2010 SP2

A database availability group (DAG) is the base component of the high availability and site resilience framework built into Microsoft Exchange Server 2010. A DAG is a group of up to 16 Mailbox servers that hosts a set of databases and provides automatic database-level recovery from failures that affect individual servers or databases.

A DAG is a boundary for mailbox database replication, database and server switchovers, failovers, and an internal Exchange 2010 component called Active Manager. Active Manager, which runs on every server in a DAG, manages switchovers and failovers. For more information about Active Manager, see Understanding Active Manager.

Any server in a DAG can host a copy of a mailbox database from any other server in the DAG. When a server is added to a DAG, it works with the other servers in the DAG to provide automatic recovery from failures that affect mailbox databases, such as a disk failure or server failure.

Contents

Database Availability Group Lifecycle

Using a Database Availability Group for High Availability

Using a Database Availability Group for Site Resilience

Client Experience When Using Database Availability Groups

Database Availability Group Lifecycle

DAGs leverage a feature of Exchange 2010 known as incremental deployment, which is the ability to deploy service and data availability for all Mailbox servers and databases after Exchange is installed. After you deploy Exchange 2010, you can create a DAG, add Mailbox servers to the DAG, and then replicate mailbox databases between the DAG members.

Note

It is supported to create a DAG that contains a combination of physical Mailbox servers and virtualized Mailbox servers, provided that the servers and solution comply with the Exchange 2010 System Requirements. As with all Exchange high availability configurations, you must ensure that all Mailbox servers in the DAG are sized appropriately to handle the necessary workload during scheduled or unscheduled outages.

A DAG is created by using the New-DatabaseAvailabilityGroup cmdlet. A DAG is initially created as an empty object in Active Directory. This directory object is used to store relevant information about the DAG, such as server membership information. When you add the first server to a DAG, a failover cluster is automatically created for the DAG. This failover cluster is used exclusively by the DAG, and the cluster must be dedicated to the DAG. Use of the cluster for any other purpose isn't supported.

In addition to a failover cluster being created, the infrastructure that monitors the servers for network or server failures is initiated. The failover cluster heartbeat mechanism and cluster database are then used to track and manage information about the DAG that can change quickly, such as database mount status, replication status, and last mounted location.

During creation, the DAG is given a unique name, and either assigned one or more static IP addresses or configured to use Dynamic Host Configuration Protocol (DHCP). You can specify a single IP address or a comma-separated list of IP addresses by using the DatabaseAvailabilityGroupIPAddresses parameter.

This example shows a DAG that will have three servers. Two servers (EX1 and EX2) are on the same subnet (10.0.0.0), and the third server (EX3) is on a different subnet (192.168.0.0).

New-DatabaseAvailabilityGroup -Name DAG1 -DatabaseAvailabilityGroupIPAddresses 10.0.0.5,192.168.0.5
Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EX1
Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EX2
Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EX3

Note

Configuring the DatabaseAvailabilityGroupIPAddresses parameter with a value of 0.0.0.0 configures the DAG (cluster) to use DHCP for its IP addresses or IP address resources.

The cluster for DAG1 is created when EX1 is added to the DAG. During cluster creation, the Add-DatabaseAvailabilityGroupServer cmdlet retrieves the IP addresses configured for the DAG and ignores the ones that don't match any of the subnets found on EX1. In this example, the cluster for DAG1 is created with an IP address of 10.0.0.5, and 192.168.0.5 is ignored.

Then, EX2 is added, and the Add-DatabaseAvailabilityGroupServer cmdlet again retrieves the IP addresses configured for the DAG. There are no changes to the cluster's IP addresses because EX2 is on the same subnet as EX1.

Then, EX3 is added, and the Add-DatabaseAvailabilityGroupServer cmdlet again retrieves the IP addresses configured for the DAG. Because a subnet matching 192.168.0.5 is present on EX3, the 192.168.0.5 address is added as an IP address resource in the cluster group. In addition, an OR dependency for the Network Name resource for each IP address resource is automatically configured. The 192.168.0.5 address will be used by the cluster when the cluster group moves to EX3.

Windows failover clustering registers the IP addresses for the cluster in the Domain Name System (DNS) when the Network Name resource is brought online. In addition, a cluster name object (CNO) is created in Active Directory. The name, IP addresses and CNO for the cluster are used only internally by the system to secure the DAG and for internal communication purposes. Administrators and end users don't need to interface with or connect to the DAG name or IP address for any reason.

In addition to a name and one or more IP addresses, the DAG is also configured to use a witness server and a witness directory. The witness server and witness directory are either automatically specified by the system, or they can be manually specified by the administrator.

By default, a DAG is designed to use the built-in continuous replication feature to replicate mailbox databases among servers in the DAG. If you're using third-party data replication that supports the Third Party Replication API in Exchange 2010, you must create the DAG in third-party replication mode by using the New-DatabaseAvailabilityGroup cmdlet with the ThirdPartyReplication parameter. After this mode is enabled, it can't be disabled.

After the DAG is created, Mailbox servers can be added to the DAG. When the first server is added to the DAG, a cluster is formed for use by the DAG. DAGs make limited use of Windows failover clustering technology, such as the cluster heartbeat, cluster networks, and the cluster database (for storing data that changes, such as database state changes from active to passive or vice versa, or from mounted to dismounted and vice versa). As each subsequent server is added to the DAG, it's joined to the underlying cluster, the cluster's quorum model is automatically adjusted by the system, and the server is added to the DAG object in Active Directory.

After Mailbox servers are added to a DAG, you can configure a variety of DAG properties, such as whether to use network encryption or network compression for database replication within the DAG. You can also configure DAG networks and create additional DAG networks.

After you add members to a DAG and configure the DAG, the active mailbox databases on each server can be replicated to the other DAG members. After you create mailbox database copies, you can monitor the health and status of the copies using a variety of built-in monitoring tools. In addition, you can perform database and server switchovers.

For more information about creating DAGs, managing DAG membership, configuring DAG properties, creating and monitoring mailbox database copies, and performing switchovers, see Managing High Availability and Site Resilience.

Database Availability Group Quorum Models

Underneath every DAG is a Windows failover cluster. Failover clusters use the concept of quorum, which uses a consensus of voters to ensure that only one subset of the cluster members (which could mean all members or a majority of members) is functioning at one time. Quorum isn't a new concept for Exchange 2010. Highly available Mailbox servers in previous versions of Exchange also use failover clustering and its concept of quorum. Quorum represents a shared view of members and resources, and the term quorum is also used to describe the physical data that represents the configuration within the cluster that is shared between all cluster members. As a result, all DAGs require their underlying failover cluster to have quorum. If the cluster loses quorum, all DAG operations terminate and all mounted databases hosted in the DAG will dismount. In this event, administrator intervention will be required to correct the quorum problem and restore DAG operations.

Quorum is important to ensure consistency, to act as a tie-breaker to avoid partitioning, and to ensure cluster responsiveness:

Ensuring consistency A primary requirement for a Windows failover cluster is that each of the members always has a view of the cluster that's consistent with the other members. The cluster hive acts as the definitive repository for all configuration information relating to the cluster. If the cluster hive can't be loaded locally on a DAG member, the Cluster service doesn't start, because it isn't able to guarantee that the member meets the requirement of having a view of the cluster that's consistent with the other members.
Acting as a tie-breaker A quorum witness resource is used in DAGs with an even number of members to avoid split brain syndrome scenarios and to make sure that only one collection of the members in the DAG is considered official. When the witness server is needed for quorum, any member of the DAG that can communicate with the witness server can place a Server Message Block (SMB) lock on the witness server's witness.log file. The DAG member that locks the witness server (referred to as the locking node) retains an additional vote for quorum purposes. The DAG members in contact with the locking node are in the majority and maintain quorum. Any DAG members that can't contact the locking node are in the minority and therefore lose quorum.
Ensuring responsiveness To ensure responsiveness, the quorum model makes sure that, whenever the cluster is running, enough members of the distributed system are operational and communicative, and at least one replica of the cluster's current state can be guaranteed. No additional time is required to bring members into communication or to determine whether a specific replica is guaranteed.

DAGs with an even number of members use the failover cluster's Node and File Share Majority quorum mode, which employs an external witness server that acts as a tie-breaker. In this quorum mode, each DAG member gets a vote. In addition, the witness server is used to provide one DAG member with a weighted vote (e.g., it gets two votes instead of one). The cluster quorum data is stored by default on the system disk of each member of the DAG, and is kept consistent across those disks. However, a copy of the quorum data isn't stored on the witness server. A file on the witness server is used to keep track of which member has the most updated copy of the data, but the witness server doesn't have a copy of the cluster quorum data. In this mode, a majority of the voters (the DAG members plus the witness server) must be operational and able to communicate with each other to maintain quorum. If a majority of the voters can't communicate with each other, the DAG's underlying cluster loses quorum, and the DAG will require administrator intervention to become operational again.

DAGs with an odd number of members use the failover cluster's Node Majority quorum mode. In this mode, each member gets a vote, and each member's local system disk is used to store the cluster quorum data. If the configuration of the DAG changes, that change is reflected across the different disks. The change is only considered to have been committed and made persistent if that change is made to the disks on half the members (rounding down) plus one. For example, in a five-member DAG, the change must be made on two plus one members, or three members total.

Quorum requires a majority of voters to be able to communicate with each other. Consider a DAG that has four members. Because this DAG has an even number of members, an external witness server is used to provide one of the cluster members with a fifth, tie-breaking vote. To maintain a majority of voters (and therefore quorum), at least three voters must be able to communicate with each other. At any time, a maximum of two voters can be offline without disrupting service and data access. If three or more voters are offline, the DAG loses quorum, and service and data access will be disrupted until you resolve the problem.

Return to top

Using a Database Availability Group for High Availability

To illustrate how a DAG can provide high availability for your mailbox databases, consider the following example, which uses a DAG with five members. This DAG is illustrated in the following figure.

DAG with five members

Database Availability Group

In the preceding figure, the green databases are active mailbox database copies and the blue databases are passive mailbox database copies. In this example, the database copies aren't mirrored across each server, but rather spread across multiple servers. This ensures that no two servers in the DAG have the same set of database copies, providing the DAG with greater resilience to failures, including failures that occur while other components are unavailable as a result of regular maintenance.

Consider the following scenario, using the preceding example DAG, which illustrates resilience to multiple database and server failures.

Initially, all databases and servers are healthy. You need to install some operating system updates on EX2. You perform a server switchover, which activates the copy of DB4 on another Mailbox server. A server switchover moves all active mailbox database copies from their current server to one or more other Mailbox servers in the DAG in preparation for a scheduled outage for the current server. You can perform a server switchover quickly by running the following command in the Exchange Management Shell.

Move-ActiveMailboxDatabase -Server EX2

In this example, there's only one active mailbox database on EX2 (DB4), so only one active mailbox database copy is moved. By omitting the ActivateOnServer parameter in the preceding command, you chose to have the system select the best possible new active copy, and the system chose the copy on EX5, as shown in the following figure.

DAG with a server offline for maintenance

Database Availability Group with a Server Offline

While you perform maintenance on EX2, EX3 experiences a catastrophic hardware failure and goes offline. Prior to going offline, EX3 hosted the active copy of DB2. To recover from the failure, the system automatically activates the copy of DB2 that's hosted on EX1 within 30 seconds. This is illustrated in the following figure.

DAG with a server offline for maintenance and a failed server

DAG with a server offline and a failed server

After the scheduled maintenance is completed for EX2, you bring the server online. As soon as EX2 is available, the other members of the DAG are notified, and the copies of DB1, DB4, and DB5 hosted on EX2 are automatically synchronized with the active copy of each database. This is illustrated in the following figure.

DAG with a restored server synchronizing its database copies

DAG with restored server resynchronizing databases

After the failed hardware component in EX3 is replaced with a new component, EX3 is brought online. After EX3 is available, the other members of the DAG are notified, and the copies of DB2, DB3, and DB4 hosted on EX3 are automatically synchronized with the active copy of each database. This is illustrated in the following figure.

DAG with a repaired server synchronizing its database copies

DAG with Member Resynchronizing Database Copies

Return to top

Using a Database Availability Group for Site Resilience

In addition to providing high availability within a datacenter, a DAG can also be extended to one or more datacenters in a configuration that provides site resilience for one or multiple datacenters. In the preceding example figures, the DAG is located in a single datacenter and single Active Directory site. Incremental deployment can be used to extend this DAG to a second datacenter (and a second Active Directory site) by deploying a Mailbox server and the necessary supporting resources (one or more Active Directory servers, and one or more Hub Transport and Client Access servers). The Mailbox server is then added to the DAG, as illustrated in the following figure.

DAG extended across two Active Directory sites

DAG extended across two Active Directory sites

In this example, a passive copy of each active database in the Redmond datacenter is configured on EX6 in the Dublin datacenter. However, there are many other examples of DAG configurations that provide site resilience. For example:

Instead of hosting only passive database copies, EX6 could host all active copies, or it could host a mixture of active and passive copies.
In addition to EX6, multiple DAG members could be deployed in the Dublin datacenter, providing protection against additional failures. This configuration also provides additional capacity, so that if the Redmond datacenter fails, the Dublin datacenter can support a much larger user population.

Using Multiple Database Availability Groups for Site Resilience

In the preceding example, a single DAG extends across multiple datacenters, providing site resilience for either or both datacenters. When using a single DAG to provide site resilience in an environment where each datacenter to which you extend the DAG has an active user population, there is a single point of failure in the wide area network (WAN) connection. This is because quorum requires a majority of the voters to be active and able to communicate with each other.

In the preceding example, the majority of voters are located in the Redmond datacenter. If the Dublin datacenter hosts active mailbox databases, and it has a local user population, a WAN outage would result in a messaging service outage for the Dublin users. When WAN connectivity breaks, only the DAG members in the Redmond datacenter retain quorum and continue providing messaging service.

To eliminate the WAN as a single point of failure when you need to provide site resilience for multiple datacenters that each have an active user population, you should deploy multiple DAGs, where each DAG has a majority of voters in a separate datacenter. When a WAN outage occurs, replication will be blocked until connectivity is restored. Users will have messaging service, because each DAG continues to service its local user population.

Return to top

Client Experience When Using Database Availability Groups

DAGs can be used to provide both high availability and site resilience. The client experience when using a DAG depends on the type and version of the client and the protocol used by the client to access mailbox data. For example, if a cross-site database failover occurs, the behavior and reconnection logic used by a POP3 or IMAP4 client is different from the behavior and reconnection logic used by a Microsoft Outlook 2010 client.

The following sections describe the client behavior and logic in various scenarios. The behavior described assumes that:

The environment contains a single Client Access server array in each Active Directory site, and each site contains at least two Client Access servers.
An appropriate hardware-based or software-based load balancer is installed and configured in front of the Client Access server array.
Proper namespace and certificate planning and configuration are complete, including the necessary DNS records.

Microsoft Outlook Behavior and Logic

Generally, all versions of Outlook behave the same for database failovers that occur within a single datacenter and single Active Directory site. Unlike previous versions of Exchange, in Exchange 2010, Outlook no longer connects directly to the Exchange store on the Mailbox server. Instead, Outlook (and any other MAPI client) connects to the RPC Client Access and Address Book services on the Client Access server role, and the user's Outlook is configured to connect to the Client Access server array, which then connects the client to an individual Client Access server. This abstraction of the Outlook connection away from the Mailbox server provides the following benefits:

When a database failover occurs, Outlook remains connected to the same server in the Client Access server array. When this occurs, the Active Manager client running on the Client Access server learns which DAG member hosts the active database copy from the DAG's Active Manager. Then, the Client Access server connects to that Mailbox server, and Outlook indicates it's connected to the Exchange server.
If one of the Client Access servers in the Client Access server array becomes unavailable because of a scheduled or unscheduled outage, the remaining Client Access servers in that array handle the client load. Because Outlook is configured to connect to the Client Access server array and not an individual Client Access server, Client Access server array members can individually experience failures or be manually taken offline without affecting the user's Outlook profile. This can happen automatically (for example, automatic array reconfiguration, based on monitoring performed by the load balancer solution in front of the array), or you can perform this manually.

All versions of Outlook also behave the same for datacenter switchovers that occur between two datacenters and two Active Directory sites. Datacenter switchovers involve changing the IP addresses used by client access namespaces (for example, Microsoft Office Outlook Web App, SMTP, POP3, IMAP4, Autodiscover, Exchange Web Services, or RPC Client Access) from IP addresses in the primary datacenter to IP addresses in the secondary datacenter. As a result, the namespace used in the user's Outlook profile doesn't change, and Autodiscover continues to point clients to the same Client Access server array namespace.

The behavior of Outlook after a cross-site database failover is different from its behavior after a database failover in a single Active Directory site or after a datacenter switchover.

Example Behavior for Outlook Versions

The following examples illustrate the behavior of Outlook 2010, Office Outlook 2007, and Office Outlook 2003 after a cross-site database failover occurs. The topology used in each example is a four-member DAG extended to two Active Directory sites: Redmond and Portland. The user's mailbox is hosted on DB1, which is replicated to each of the servers. In each example, the active copy of DB1 fails over from MBX2 to MBX3.

Example topology demonstrating Outlook behavior after cross-site database failover

Outlook behavior with database availability groups

Each client is configured with CAS1 as its home server, making Redmond the Outlook profile site. Because the clients are located in Redmond, the RPCClientAccessServer property for DB1 is configured for CAS1, making Redmond the preferred database site. Because DB1 failed on MBX2 and has become active on MBX3, Portland is the mounted database site.

Example for Outlook 2010 and Outlook 2007

If a Client Access server is available in the Redmond site, Outlook 2010 and Outlook 2007 will continue to connect to the RPC Client Access array in the Redmond site. The Client Access server used by the client will communicate using MAPI RPC with the user's Mailbox server in the Portland site.

If there are no Client Access servers available in the Redmond site, then a datacenter switchover from Redmond to Portland must be performed in order to restore access to service and data. For detailed steps to perform a datacenter switchover, see Perform a Server Switchover.

Example for Outlook 2003

When Outlook 2003 attempts to connect to CAS1, it also receives an ecWrongServer message in response. Unlike Outlook 2010 and Outlook 2007, Outlook 2003 doesn't include the Autodiscover feature, and it must use some other means to update the user's profile. MAPI profile redirection is the mechanism used by Outlook 2003. MAPI profile redirection requires that the original source server be online. If CAS1 is unavailable, and if all other Client Access servers in the array are also unavailable (or if the array contains only CAS1), Outlook 2003 can't perform MAPI redirection or connect to the user's mailbox database without manual intervention.

Outlook Behavior and Logic When Public Folders Are Used

Although public folder databases can be hosted on Mailbox servers that are members of a DAG, public folder databases don't use continuous replication, and they rely on public folder replication for high availability. The behavior for Outlook clients reconnecting to a public folder database after a mailbox database failover depends not only on the nature of the failure, but on your public folder replication configuration settings and the health and currency of your public folder databases. Because continuous replication can't be used for public folder databases, high availability for public folder databases is accomplished by deploying multiple public folder databases and configuring them to replicate with each other. We recommend that you configure more than one replica of each folder.

Non-Outlook Client Behavior and Logic

Generally, the behavior of clients and protocols other than Outlook and MAPI varies based on the application being used and the failure scenario. Generally, as with Outlook, the typical Exchange applications and clients (for example, Outlook Web App, Microsoft Exchange ActiveSync, POP3, IMAP4, and Exchange Web Services) behave the same for database failovers that occur within a single datacenter and single Active Directory site. Similarly, all these clients and protocols (including SMTP and Windows PowerShell) behave the same as Outlook after a datacenter switchover.

If a cross-site database failover occurs, the behavior varies among these clients and protocols. The following table lists the behavior for these clients.

Cross-site database failover behavior for typical Exchange clients

Client or protocol	Behavior
Outlook Web App	Manual redirection. In this scenario, the client namespace is changing from http://mailred.contoso.com to http://mailpdx.contoso.com. After the user enters logon credentials, the user is redirected to CAS2 in the Portland site through a manual redirection page explaining that the wrong URL was used and that the correct URL is https://mailpdx.contoso.com/owa.
Exchange ActiveSync	Proxy or redirection. In this scenario, the client behavior is determined by the implementation and version of the Exchange ActiveSync protocol on the client device.
POP3 and IMAP4	Proxy. This scenario always involves Client Access server to Client Access server proxying.
Exchange Web Services	Uses Autodiscover to determine new connection endpoint.

Return to top

Understanding Database Availability Groups

Database Availability Group Lifecycle

Database Availability Group Quorum Models

Using a Database Availability Group for High Availability

Using a Database Availability Group for Site Resilience

Using Multiple Database Availability Groups for Site Resilience

Client Experience When Using Database Availability Groups

Microsoft Outlook Behavior and Logic

Example Behavior for Outlook Versions

Example for Outlook 2010 and Outlook 2007

Example for Outlook 2003

Outlook Behavior and Logic When Public Folders Are Used

Non-Outlook Client Behavior and Logic

Cross-site database failover behavior for typical Exchange clients

Additional resources