Understanding Active Manager
Microsoft Exchange Server 2010 includes a new component called Active Manager that provides functionality that replaces the resource model and failover management features provided by integration with the Cluster service in previous versions of Exchange. Exchange no longer uses the cluster resource model for high availability. All Exchange cluster resources provided by exres.dll no longer exist, including the construct known as a clustered mailbox server. A Windows Failover Cluster is used by Exchange, but there are no cluster groups for Exchange, and there are no storage resources in the cluster. Thus, if you examine the cluster using cluster management tools, you’ll see only the core cluster resources (IP Address and Network Name, and if needed, quorum resource). Cluster nodes and networks will also exist, but those are managed by Exchange and not cluster or cluster tools.
Active Manager runs as a role on all Mailbox servers. On Mailbox servers that are not configured for high availability, there is a single Active Manager role: Standalone Active Manager. On servers that are members of a database availability group (DAG), there are two Active Manager roles: Primary Active Manager (PAM) and Standby Active Manager (SAM). PAM is the Active Manager in a DAG that decides which copies will be active and passive. PAM is responsible for getting topology change notifications and reacting to server failures. The DAG member that holds the PAM role is always the member that currently owns the cluster quorum resource (default cluster group). If the server that owns the cluster quorum resource fails, the PAM role automatically moves to a surviving server that takes ownership of the cluster quorum resource. In addition, if you need to take the server that hosts the cluster quorum resource offline for maintenance or an upgrade, you must first move the PAM to another server in the DAG. The PAM controls all movement of the active designations between a database's copies (only one copy can be active at any specified time, and that copy may be mounted or dismounted). The PAM also performs the functions of the SAM role on the local system (detecting local database and local Information Store failures).
The SAM provides information on which server hosts the active copy of a mailbox database to other components of Exchange that are running an Active Manager client component (for example, RPC Client Access service or Hub Transport server). The SAM detects failures of local databases and the local Information Store. It reacts to failures by asking the PAM to initiate a failover (if the database is replicated). A SAM doesn't determine the target of failover, nor does it update a database’s location state in the PAM. It will access the active database copy location state to answer queries for the active copy of the database that it receives.
Note
Exchange 2010 is not a clustered application. Instead, it uses the cluster library functions implemented in clusapi.dll for cluster, group, cluster network (heartbeating), node management, cluster registry, and a few control code functions. In addition, Active Manager stores current mailbox database information (for example, active and passive data, and mounted data) in the cluster database. Although the information is stored directly in the cluster database, it isn't accessed directly by any other components.
In Exchange 2010, the Microsoft Exchange Replication service periodically monitors the health of all mounted databases. In addition, it also monitors Extensible Storage Engine (ESE) for any I/O errors or failures. When the service detects a failure, it notifies Active Manager. Active Manager then determines which database copy should be mounted and what it requires to mount that database. In addition, it tracks the active copy of a mailbox database (based on the last mounted copy of the database) and provides the tracking results information to the RPC Client Access component on the Client Access server to which the client is connected.
When a failure occurs that affects a replicated mailbox database, Active Manager takes several steps to recover from the failure by selecting the best possible copy of the failed database to activate. The general process occurs in the following order:
- Active Manager detects the failure.
- The PAM runs an internal algorithm called best copy selection (BCS).
- A process called attempt copy last logs (ACLL) occurs, which tries to copy any missing log files from the server that hosted the active database copy prior to the failover.
- Once the ACLL process has completed, the PAM issues a mount request to the Microsoft Exchange Information Store via remote procedure call (RPC). At this point, either:
- The database mounts and is made available to clients; or
- The database does not mount, and PAM performs steps 2-4 on the next best copy (if one is available).
When searching for the best possible copy, the PAM uses up to ten separate sets of criteria to determine the best copy to activate. After locating the best possible copy, ACLL runs. After the ACLL process has completed, if all missing log files were copied from the previous active copy, the database mounts without any data loss. This is known as a lossless failover. If the ACLL process is unsuccessful, the configured value for AutoDatabaseMountDial is consulted. For more information about AutoDatabaseMountDial, see Set-MailboxServer. If the number of lost logs is within the configured value for AutoDatabaseMountDial, the database is mounted. If the number of lost logs is outside the configured value for AutoDatabaseMountDial, the database isn't mounted until either missing log files are recovered or until an administrator explicitly mounts the database and accepts the larger data loss. If the database doesn't mount automatically, the PAM will select the next best copy (if one is available). There are at least three reasons why the initially selected database copy does not mount automatically:
- The number of lost log files is greater than the configured value for AutoDatabaseMountDial.
- The server on which the mount attempt was made is configured with a soft maximum for the active number of databases, and the maximum number of active database copies has been reached on the server.
- The database copy is suspended for activation.
Active Manager begins the best copy selection process by creating a list of database copies that are potential candidates for activation. Any database copies that are unreachable or are administratively blocked from activation (by using the DatabaseCopyAutoActivationPolicy property of the Set-MailboxServer cmdlet) are ignored and not used during the selection process. Active Manager sorts the resulting list using the copy queue length as the primary key. The calculation is based on LastLogInspected (from the copy's point of view), so the list of potential copies is sorted by the highest value for LastLogInspected (which will be the copy with the lowest copy queue length). Then, Active Manager sorts the list a second time, using the value for ActivationPreference as a secondary key. The copy with the lowest ActivationPreference value has the higher priority on the list. Next, Active Manager attempts to locate a mailbox database copy on the list that has a status of Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing, or SeedingSource, and then evaluates the activation potential of each of the copies on the list by using an order set of ten criteria. Active Manager determines if any of the candidates for activation meet the first set of criteria:
- It has a content index with a status of Healthy.
- It has a copy queue length less than 10 log files.
- It has a replay queue length of less than 50 log files.
If none of the database copies meet the first set of criteria, Active Manager tries to locate a database copy that meets the second set of criteria:
- It has a content index with a status of Crawling.
- It has a copy queue length less than 10 log files.
- It has a replay queue length of less than 50 log files.
If none of the database copies meet the second set of criteria, Active Manager tries to locate a database copy that meets the third set of criteria:
- It has a content index with a status of Healthy.
- It has a replay queue length of less than 50 log files.
If none of the database copies meet the third set of criteria, Active Manager tries to locate a database copy that meets the fourth set of criteria:
- It has a content index with a status of Crawling.
- It has a replay queue length of less than 50 log files.
If none of the database copies meet the fourth set of criteria, Active Manager tries to locate a database copy that meets the fifth set of criteria:
- It has a replay queue length of less than 50 log files.
If none of the database copies meet the fifth set of criteria, Active Manager tries to locate a database copy that meets the sixth set of criteria:
- It has a content index with a status of Healthy.
- It has a copy queue length less than 10 log files.
If none of the database copies meet the sixth criteria, Active Manager tries to locate a database copy that meets the seventh set of criteria:
- It has a content index with a status of Crawling.
- It has a copy queue length that is less than 10 log files.
If none of the database copies meet the seventh set of criteria, Active Manager tries to locate a database copy that meets the eighth set of criteria:
- It has a content index with a status of Healthy.
If none of the database copies meet all of the eighth set of criteria, Active Manager tries to locate a database copy that meets the ninth set of criteria:
- It has a content index with a status of Crawling.
If none of the database copies meet the ninth set of criteria, Active Manager tries to activate any database copy with a status of Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing, or SeedingSource (the tenth set of criteria). If it can't find any database copies that meet the tenth set of criteria, it isn't able to automatically activate a database copy.
Once one or more copies are located that meet one or more sets of criteria, the ACLL process runs to copy any log files from the original source to the potential new active copy. Once the ACLL process has completed, the PAM issues a mount request and either the database mounts and is made available to clients or the database does not mount and the PAM searches for the next best copy (if one is available).
The following section illustrates some examples of Active Manager's best copy selection and activation process.
In this example, there are four copies of mailbox database DB1. DB1 is currently active on Server1, which experiences a hardware failure. The following table shows the current status of the database copies of DB1 on Server2, Server3 and Server4.
Database Copy | Activation Preference | Copy Queue Length | Replay Queue Length | Content Index State | Database State | Activation Blocked |
---|---|---|---|---|---|---|
Server2\DB1 |
2 |
4 |
0 |
Healthy |
Healthy |
No |
Server3\DB1 |
3 |
2 |
2 |
Healthy |
DisconnectedAndHealthy |
No |
Server4\DB1 |
4 |
10 |
0 |
Crawling |
Healthy |
No |
Sorting the available copies based on their copy queue lengths (using Activation Preference if necessary) results in the following ordered list:
- Server3\DB1
- Server2\DB1
- Server4\DB1
Out of this list, only two database copies meet the first set of criteria for activation:
- The copy on Server3, which has a database state of Disconnectedandhealthy, a copy queue length less than 10, a replay queue length less than 50, and a healthy content index.
- The copy on Server2, which has a database state of Healthy, a copy queue length less than 10, a replay queue length less than 50, and a healthy content index.
Of these two, the copy on Server3 has the lowest copy queue length; therefore, Server3 is selected as the copy to attempt to activate since it has the least amount of missing data.
After the copy on Server3 is activated, the Microsoft Exchange Replication service on the Server3 performs the ACLL process and attempts to copy any missing log files from the previous active server (in this case, Server1). When the ACLL process has completed, the PAM is notified of the results of the ACLL process. If all logs are successfully copied, then the database will be marked as the active copy and will be mounted with zero data loss. If one or more logs are missing, the value for the AutoDatabaseMountDial parameter is consulted. If the data loss is within the configured value, then the database will be marked as the active copy and will be mounted with data loss. The majority of any missing data would then be recovered from the transport dumpster.
If Active Manager does send a mount request to the Information Store and the mount operation is unsuccessful, Active Manager will go back to the above sorted list and attempt to activate the next best copy (in this case, Server2).
In this example, there are four copies of mailbox database DB2. DB2 is currently active on Server1, which experiences a hardware failure. The following table shows the current status of the database copies of DB2 on Server2, Server3 and Server4.
Database Copy | Activation Preference | Copy Queue Length | Replay Queue Length | Content Index State | Database State | Activation Blocked |
---|---|---|---|---|---|---|
Server2\DB2 |
2 |
2 |
0 |
Healthy |
Healthy |
No |
Server3\DB2 |
3 |
2 |
2 |
Healthy |
DisconnectedAndHealthy |
No |
Server4\DB2 |
4 |
10 |
0 |
Crawling |
Healthy |
No |
Sorting the available copies based on their copy queue lengths (using Activation Preference if necessary) results in the following ordered list:
- Server2\DB2
- Server3\DB2
- Server4\DB2
Out of this list, only two database copies meet the first set of criteria for activation:
- The copy on Server2, which has a database state of Healthy, a copy queue length less than 10, a replay queue length less than 50, and a healthy content index.
- The copy on Server3, which has a database state of DisconnectedandHealthy, a copy queue length less than 10, a replay queue length less than 50, and a healthy content index.
Of these two, the copy on Server2 has a copy queue length equal to the copy on Server3, but it also has a lower Activation Preference value; therefore, the copy on Server2 is on the top of the list and is selected as the copy to attempt to activate since it has the least amount of missing data and the lowest Activation Preference value.
In this example, there are four copies of mailbox database DB3. DB3 is currently active on Server1, which experiences a hardware failure. The following table shows the current status of the database copies of DB3 on Server2, Server3 and Server4.
Database Copy | Activation Preference | Copy Queue Length | Replay Queue Length | Content Index State | Database State | Activation Blocked |
---|---|---|---|---|---|---|
Server2\DB3 |
2 |
0 |
3 |
Crawling |
Healthy |
No |
Server3\DB3 |
3 |
0 |
3 |
Healthy |
DisconnectedAndHealthy |
No |
Server4\DB3 |
4 |
0 |
0 |
Healthy |
Healthy |
No |
Sorting the available copies based on their copy queue lengths (using Activation Preference if necessary) results in the following ordered list:
- Server2\DB3
- Server3\DB3
- Server4\DB3
All three of the database copies hosted on the above servers meet the criteria for activation. Although Server2 has a lower Activation Preference value, its content index state is Crawling; as a result, when Active Manager checks the list against the first set of criteria (which includes a content index status of Healthy), the database copy on Server3 will be preferred, as it's content index state is Healthy.
In this example, there are four copies of mailbox database DB4. DB4 is currently active on Server1, which experiences a failure that causes it to reboot. The following table shows the current status of the database copies of DB4 on Server2, Server3 and Server4. The AutoDatabaseMountDial for all Mailbox servers in the DAG is configured for BestAvailability (copy queue length less than or equal to 12 logs).
Database Copy | Activation Preference | Copy Queue Length | Replay Queue Length | Content Index State | Database State | Activation Blocked |
---|---|---|---|---|---|---|
Server2\DB4 |
2 |
0 |
4523 |
Healthy |
Healthy |
No |
Server3\DB4 |
3 |
100 |
25 |
Crawling |
Healthy |
No |
Server4\DB4 |
4 |
6 |
62 |
Healthy |
Healthy |
No |
Sorting the available copies based on their copy queue lengths (using Activation Preference if necessary) results in the following ordered list:
- Server2\DB4
- Server4\DB4
- Server3\DB4
None of the databases meet the first, second or third set of criteria, but the database copy on Server3 does meet the fourth set of criteria (it has a content index state of Crawling and a replay queue length less than 50). The database copy on Server3 has a copy queue length of 100, but because Server1 has not finished rebooting, the ACLL process is unable to copy these missing logs files to Server3. The ACLL process tells the PAM that the amount of missing data is not within the configured value for the AutoDatabaseMountDial parameter, and this causes the PAM to select the next best available copy.
In the above scenario, the database copies on Server2 and Server4 match the sixth set of criteria (they have a healthy database and content index, and a copy queue length less than 10). As it is higher in the sorted listed of available copies, the database copy on Server2 is tried next. The ACLL process runs on Server2, but Server1 is still not communicating on the network, and ACLL is unable to copy any logs. But because the copy queue length is within the configured value for the AutoDatabaseMountDial parameter, ACLL sends a success message to the PAM and the PAM issues a database mount request via RPC.