Understanding Database Availability Groups
Topic Last Modified: 2010-05-05
A database availability group (DAG) is the base component of the high availability and site resilience framework built into Microsoft Exchange Server 2010. A DAG is a group of up to 16 Mailbox servers that host a set of databases and provide automatic database-level recovery from failures that affect individual servers or databases.
A DAG is a boundary for mailbox database replication, database and server switchovers, and failovers, and for an internal component called Active Manager. Active Manager is an Exchange 2010 component which manages switchovers and failovers that runs on every server in a DAG. For more information about Active Manager, see Understanding Active Manager.
Any server in a DAG can host a copy of a mailbox database from any other server in the DAG. When a server is added to a DAG, it works with the other servers in the DAG to provide automatic recovery from failures that affect mailbox databases, such as a disk failure or server failure.
DAGs leverage a feature of Exchange 2010 known as incremental deployment, which is the ability to deploy service and data availability for all Mailbox servers and databases after Exchange is installed. After you've deployed Exchange 2010, you can create a DAG, add Mailbox servers to the DAG, and then replicate mailbox databases between the DAG members.
A DAG is created by using the New-DatabaseAvailabilityGroup cmdlet. A DAG is initially created as an empty object in Active Directory. This directory object is used to store relevant information about the DAG, such as server membership information. When an administrator adds the first server to a DAG, a failover cluster is automatically created for the DAG. This failover cluster is used exclusively by the DAG, and the cluster must be dedicated to the DAG. Use of the cluster for any other purpose is not supported.
In addition to a failover cluster being created, the infrastructure that monitors the servers for network or server failures is initiated. The failover cluster heartbeat mechanism and cluster database are then used to track and manage information about the DAG that can change quickly, such as database mount status, replication status, and last mounted location.
During creation, the DAG is given a unique name, and either assigned one or more static IP addresses or configured to use Dynamic Host Configuration Protocol (DHCP). You can specify a single IP address or a comma-separated list of IP addresses by using the DatabaseAvailabilityGroupIPAddresses parameter.
Consider a DAG that will have three servers; two servers (EX1 and EX2) are on the same subnet (10.0.0.0) and the third server (EX3) is on a different subnet (192.168.0.0). The administrator runs the following commands:
New-DatabaseAvailabilityGroup -Name DAG1 -DatabaseAvailabilityGroupIPAddresses 10.0.0.5,192.168.0.5 Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EX1 Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EX2 Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EX3
|Configuring the DatabaseAvailabilityGroupIPAddresses parameter with a value of 0.0.0.0 configures the DAG (cluster) to use DHCP for its IP addresses or IP address resources.|
The cluster for DAG1 is created when EX1 is added to the DAG. During cluster creation, the Add-DatabaseAvailabilityGroupServer cmdlet retrieves the IP addresses configured for the DAG and ignores the ones don't match any of the subnets found on EX1. In this example, the cluster for DAG1 is created with an IP address of 10.0.0.5, and 192.168.0.5 is ignored.
Then, EX2 is added. Again, the Add-DatabaseAvailabilityGroupServer cmdlet retrieves the IP addresses configured for the DAG. There are no changes to the cluster's IP addresses because EX2 is on the same subnet as EX1.
Then, EX3 is added. Again, the Add-DatabaseAvailabilityGroupServer cmdlet retrieves the IP addresses configured for the DAG. Because a subnet matching 192.168.0.5 is present on EX3, the 192.168.0.5 address is added as an IP address resource in the cluster group. In addition, an OR dependency for the Network Name resource for each IP address resource is automatically configured. The 192.168.0.5 address will be used by the cluster when the cluster group moves to EX3.
Windows Failover Clustering registers the IP addresses for the cluster in Domain Name System (DNS) when the Network Name resource is brought online. In addition, a cluster network object (CNO) is created in Active Directory. The name, IP addresses and CNO for the cluster are used only internally by the system to secure the DAG and for internal communication purposes. Administrators and end users don't need to interface with or connect to the DAG name or IP address for any reason.
In addition to a name and one or more IP addresses, the DAG is also configured to use a witness server and a witness directory. The witness server and witness directory are either automatically specified by the system, or they can be manually specified by the administrator.
By default, a DAG is designed to use the built-in continuous replication feature to replicate mailbox databases between servers in the DAG. If you're using third-party data replication that supports the Third Party Replication API in Exchange 2010, you must create the DAG in third-party replication mode by using the New-DatabaseAvailabilityGroup cmdlet with the ThirdPartyReplication parameter. After this mode is enabled, it can't be disabled.
After the DAG has been created, Mailbox servers can be added to the DAG. When the first server is added to the DAG, a cluster is formed for use by the DAG. DAGs make limited use of Windows Failover Clustering technology, namely the cluster heartbeat, cluster networks, and the cluster database (for storing data that changes or can change quickly, such as database state changes from active to passive or vice versa, or from mounted to dismounted and vice versa). As each subsequent server is added to the DAG, it's joined to the underlying cluster (and the cluster's quorum model is automatically adjusted by the system, as needed), and the server is added to the DAG object in Active Directory.
After Mailbox servers have been added to a DAG, you can configure a variety of DAG properties, such as whether to use network encryption or network compression for database replication within the DAG. You can also configure DAG networks and create additional DAG networks, as needed.
After you've added members to a DAG and configured the DAG, the active mailbox databases on each server can be replicated to the other DAG members. After you've created mailbox database copies, you can monitor the health and status of the copies using a variety of built-in monitoring tools. In addition, you can perform database and server switchovers, as needed.
For more information about creating DAGs, managing DAG membership, configuring DAG properties, creating and monitoring mailbox database copies, and performing switchovers, see Managing High Availability and Site Resilience.
Underneath every DAG is a Windows failover cluster. Failover clusters use the concept of quorum. Quorum is not a new concept for Exchange 2010. Highly available mailbox servers in previous versions of Exchange also use failover clustering and its concept of quorum. The three main reasons why quorum is important are to ensure consistency, act as a tie-breaker to avoid partitioning, and to ensure cluster responsiveness:
Ensuring consistency A primary requirement for a Windows failover cluster is that each of the members always has a view of the cluster that is consistent with the other members. The cluster hive acts as the definitive repository for all configuration information relating to the cluster. In the event that the cluster hive cannot be loaded locally on a DAG member, the Cluster service does not start, because it is not able to guarantee that the member meets the requirement of having a view of the cluster that is consistent with the other members.
Acting as a tie-breaker A quorum witness resource is used in DAGs with an even number of members to act as a tie-breaker vote. This enables the solution to avoid split-brain scenarios and to ensure that only one collection of the members in the DAG is considered official.
Ensuring responsiveness To ensure responsiveness, the quorum model ensures that whenever the cluster is running, enough members of the distributed system are operational and communicative, and at least one replica of the cluster's current state can be guaranteed. This means that no additional time is required to bring members into communication or to determine whether a given replica is guaranteed.
DAGs with an even number of members will use the failover cluster's Node and File Share Majority quorum mode, which employs an external witness server that acts as a tie-breaker. In this quorum mode, each DAG member and the witness server get a vote. The replica is stored by default on the system disk of each member of the DAG, and is kept consistent across those disks. However, a copy is not stored on the witness server. The witness server keeps track of which member has the most updated replica, but does not have a replica itself. In this mode, a majority of the voters (the DAG members plus the witness server) must be operational and able to communicate with each other in order to maintain quorum. If a majority of the voters are not able to communicate with each other, the DAG's underlying cluster will lose quorum, and the DAG will require manual administrative intervention to become operational again.
DAGs with an odd number of members will use the failover cluster's Node Majority quorum mode. In this mode, each member gets a vote, and each member’s local system disk is used to store the cluster configuration (the replica). If the configuration of the DAG changes, that change is reflected across the different disks. The change is only considered to have been committed and made persistent if that change is made to the disks on half the members (rounding down) plus one. For example, in a five-member DAG, the change must be made on two plus one members, or three members total.
Quorum requires a majority of voters to be able to communicate with each other. Consider a DAG that has four members. As this DAG has an even number of members, an external witness server is used to provide a fifth, tie-breaking vote. To maintain a majority of voters (and therefore quorum), at all times at least three voters must be able to communicate with each other. This means that, at any given time, a maximum of two voters can be offline without disrupting service and data access. In the event where three or more voters are offline, the DAG would lose quorum and service and data access would be disrupted until an administrator resolves the problem.
To illustrate how a DAG can provide high availability for your mailbox databases, consider the following example, which uses a DAG with five members. This DAG is illustrated in the following figure.
In the preceding figure, the green databases are active mailbox database copies and the blue databases are passive mailbox database copies. In this example, the database copies aren't mirrored across each server, but rather spread across multiple servers. This ensures that no two servers in the DAG have the same set of database copies, thereby providing the DAG with greater resilience to failures, including failures that occur while other components are down as a result of regular maintenance.
Consider the following scenario, using the preceding example DAG, which illustrates resilience to multiple database and server failures.
Initially, all databases and servers are healthy. An administrator needs to install some operating system updates on EX2. The administrator performs a server switchover, which activates the copy of DB4 on another Mailbox server. A server switchover is a task that an administrator performs to move all active mailbox database copies from their current server to one or more other Mailbox servers in the DAG in preparation for a scheduled outage for the current server. The administrator can perform a server switchover quickly by running the following command in the Exchange Management Shell.
Move-ActiveMailboxDatabase -Server EX2
In this example, there is only one active mailbox database on EX2 (DB4), so only one active mailbox database copy is moved. In this case, by omitting the ActivateOnServer parameter in the preceding command, the administrator chose to have the system select the best possible new active copy, and the system chose the copy on EX5, as shown in the following figure.
Database availability group with a server offline for maintenance
While the administrator is performing maintenance on EX2, EX3 experiences a catastrophic hardware failure, and goes offline. Prior to going offline, EX3 hosted the active copy of DB2. To recover from the failure, the system automatically activates the copy of DB2 that's hosted on EX1 within 30 seconds. This is illustrated in the following figure.
Database availability group with a server offline for maintenance and a failed server
After the scheduled maintenance has completed for EX2, the administrator brings the server back online. As soon as EX2 is up, the other members of the DAG are notified, and the copies of DB1, DB4, and DB5 that are hosted on EX2 are automatically resynchronized with the active copy of each database. This is illustrated in the following figure.
Database availability group with a restored server resynchronizing its database copies
After the failed hardware component in EX3 was replaced with a new component, EX3 is brought back online. As with EX2, after EX3 is up, the other members of the DAG are notified, and the copies of DB2, DB3, and DB4 that are hosted on EX3 are automatically resynchronized with the active copy of each database. This is illustrated in the following figure.
Database availability group with a repaired server resynchronizing its database copies
In addition to providing high availability within a datacenter, a DAG can also be extended to one or more other datacenters in a configuration that provides site resilience for one or multiple datacenters. In the preceding example figures, the DAG is located in a single datacenter and single Active Directory site. Incremental deployment can be used to extend this DAG to a second datacenter (and a second Active Directory site) by deploying a Mailbox server and the necessary supporting resources (namely, one or more Active Directory servers, and one or more Hub Transport and Client Access servers), and then adding the Mailbox server to the DAG, as illustrated below.
Database availability group extended across two active directory sites
In this example, a passive copy of each active database in the Redmond datacenter is configured on EX6 in the Dublin datacenter. However, there are many other examples of DAG configurations that provide site resilience. For example:
Instead of hosting only passive database copies, EX6 could host all active copies, or it could host a mixture of active and passive copies.
In addition to EX6, multiple DAG members could be deployed in the Dublin datacenter, thereby providing protection against additional failures, as well as additional capacity in the event the Redmond datacenter fails and the Dublin datacenter needs to support a much larger user population.
In the preceding example, a single DAG that is extended across multiple datacenters can provide site resilience for either or both datacenters. However, when using a single DAG to provide site resilience in an environment where each datacenter to which you extend the DAG has an active user population, you will have an inherent single point of failure in the WAN connection. This is due to the nature of quorum, which requires a majority of the voters to be active and able to communicate with one another.
In the example above, the majority of voters are located in the Redmond datacenter. If the Dublin datacenter hosted active mailbox databases and it had a local user population, a WAN outage would result in a messaging service outage for the Dublin users. This is because when WAN connectivity breaks, only the DAG members in the Redmond datacenter will retain quorum and therefore continue providing messaging service.
To eliminate the WAN as a single point of failure when you need to provide site resilience for multiple datacenters that each have an active user population, you should deploy multiple DAGs, where each DAG has a majority of voters in a separate datacenter. Thus, when a WAN outage occurs, replication will be blocked until connectivity is restored; however, users will have messaging service, as each DAG will continue to service its local user population.