New information has been added to this article since publication.
Refer to the Editor's Update below.

Communication & Collaboration

Achieve High Availability with Exchange Failover Clusters

Mark Godfrey

 

At a Glance:

  • Provisions for Exchange Server failure
  • Determining specific needs
  • Clustering Exchange Server
  • The backup process

Having your e-mail system go down is bad. Losing the data archived on your e-mail server is even worse. There's critical data in your users' e-mail and such a loss can have serious consequences. To prevent the repercussions of outages and the loss of historical information,

you can employ Exchange failover clusters, allowing you to recover your e-mail systems between datacenters quickly and efficiently.

A failover cluster is effectively a standby cluster of servers with as many or more nodes than exist on the production clusters. In the event of a disaster, the Exchange Virtual Servers from the failed datacenter are brought online at the backup datacenter, saving data that would otherwise be lost.

In this article I'll discuss a real-world case in which a Microsoft® Services team built an Exchange Virtual Server failover solution for a financial sector customer. But first let's look at some requirements for building an Exchange environment where data is safer.

General Requirements

First of all, you need to know that the key to having a high Recovery Time Objective (RTO) environment (an environment in which you can recover server functionality quickly after an outage) is to have a standby site with the same configuration as the production site. The key to having an environment with a high Recovery Point Objective (RPO) (an environment in which the data recovered is as up-to-date as possible) is data replication so that you have relatively current copies of the database backups and access to current log files. There are two common failover cluster scenarios that you could employ in building such an environment: the dedicated Exchange failover and the many-to-one failover cluster.

In the dedicated approach, each production cluster has a dedicated failover cluster. When cluster A goes offline and is not recoverable, the Exchange Virtual Servers are brought online using the A' (A prime) cluster.

In a non-dedicated, or many-to-one scenario (typically used when multiple datacenters have a backup datacenter), the standby cluster must have the same number of nodes as the production cluster with the most nodes. Note that only one production cluster can fail over to the standby cluster. If a second datacenter goes off line, it cannot be restarted on the standby cluster until the first one is failed back or otherwise removed.

The Banking Example

The case involves a major financial organization that runs both banking and securities operations. The company's IT department was looking for a solution that would allow them to give their executives and traders access to e-mail within hours of a major datacenter outage, without implementing geoclustering. First, the Microsoft team interviewed the major business units and gathered their requirements for system availability and access to existing mailbox data. The needs were significantly greater than the IT department had anticipated. We divided the groups in three tiers, which are represented in Figure 1.

Figure 1 Business Units and User Groups

Tier Group Accounts RTO RPO Data (TB )
Tier 1 Traders and Executives 4,000 2 Hours 5 Minutes 0.6
Tier 2 Corporate Employees 30,000 4 Hours 5 Minutes 6.0
Tier 3 Retail Staff 16,000 12 Hours 1 Hour 0.8

The customer has two datacenters connected over a high-speed network. This solution is used for a variety of data replication processes, and the customer needs it to be fully supported by Microsoft.

Initially, we considered using a dial-tone recovery model—giving users the ability to send and receive e-mail very quickly after a failure, but with no historical information available until it could be later restored using traditional methods. This idea was abandoned because most of the executives lived and breathed by their calendars and the traders were very nervous about not having access to their existing e-mail. The other big drawback to the dial-tone solution was the bandwidth required to resynchronize the Offline Storage (OST) files when Exchange Cached Mode is enabled—which is the default standard for the customer. Although there are a variety of workaround procedures, none of them would allow us to meet the specified requirements.

The Microsoft team determined that the customer was a good candidate for combining the ability to fail over Exchange Virtual Servers and restore data using Volume Shadow Copy Service backups. The key concerns were the amount of data that needed to be replicated between datacenters to meet the two-hour RTO and a five-minute RPO, and the relative complexity of the process to restore the service under pressure.

Log files presented another challenge. In completing Volume Shadow Copy backups every hour, how could we ensure that the RPO of five minutes was obtainable? We needed to ensure the logs were available at the failover site from the time of the last backup to five minutes prior to the disaster.

The first task was to split the accounts between datacenters and set up dedicated standby clusters. This ensured that only 50 percent of the users would be affected if either datacenter went offline and that the administrative effort to restore mail servers would be cut in half (see Figure 2).

Figure 2 Accounts and Servers Split between Datacenters

Figure 2** Accounts and Servers Split between Datacenters **

The second decision we made was based on the requirements generated through the interview process. Given the high RTO and RPO of 68 percent of the users (Tiers 1 and 2), an aggressive data replication model was required. Frequent backup copies needed to be replicated to ensure that mailbox content was available, and synchronous copying of the log files was required to enable the high RPO. Figure 3 illustrates the configuration we designed.

Figure 3 Backup Configuration

Figure 3** Backup Configuration **(Click the image for a larger view)

To generate frequent database backup copies, we needed to be running the Volume Shadow Copy Service. This was especially important for the 4,000 executives and traders with the two-hour RTO service level agreement (SLA). Volume Shadow backups are made each hour at each site, staggered by half an hour. The backup data then needs to be copied over to the disaster recovery site within a half hour to ensure that there's enough time inside the two-hour window to bring the database back online and replay the logs generated since the backup.

Now that the conceptual design was understood and had been generally accepted, the team moved on to design the implementation. The first step was to identify the infrastructure required to support the solution—storage, network, servers, supporting services, and resources. The customer had identified a major standardization initiative and that, as always, cost control was important.

Storage and Network

Our top priority was to determine a storage solution that would allow us to create frequent backups required to ensure quick recovery times on the backup site. Network Appliance was selected because of its capability to make fast "incremental" backups, which would minimize the replication traffic between datacenters.

Next we identified the network requirements between datacenters. The customer had recently begun an Internet Small Computer System Interface (iSCSI) standardization initiative, and requested that the solution be compliant. Note that while this solution was implemented using an iSCSI storage solution, fiber-based Storage Area Network (SAN) solutions are also frequently used for standby cluster installations.

With this information in hand, we decided that the cluster pairs (production and standby) should be on the same IP network subnet. This reduced complexity in that the virtual server, when brought up on the standby cluster, could use the same IP address that it used on the production cluster. Thus, that client's DNS caches don't have to be forced to refresh or wait for the default DNS refresh. Note that each cluster pair must be on the same subnet, but when multiple cluster pairs exist, each pair can be on a different subnet. This is not a requirement for the virtual server failover scenario—but it sure saves some headaches for the administrators and help desk. (Although the network guys gave us some grief when we asked for spanned subnets!)

For the connection between the servers and the storage system, Microsoft Multipath I/O (MPIO) iSCSI was used with two connections per server. This is a dedicated network with no traffic other than Exchange-related I/O. For the interstorage connections, jumbo frame IP was enabled to allow for bulk data transmission. There is no requirement for the storage infrastructure to be on the same subnet; just make sure that each segment between the two has jumbo frames enabled. Figure 4 shows the network requirements.

Figure 4 IP Network Requirements

Figure 4** IP Network Requirements **

Servers

Servers presented the usual sizing challenges, but the key problem in this environment was consistency. During disaster recovery, the last thing you need to worry about is that one server needs a different hardware or software patch than the others. So the rule for hardware was to make it the same and make the customer aware that a refresh must cover all servers in each cluster pair. In reality, especially when running a many-to-one solution, this is not always viable or practical—but when you can, stick as close to a standard model as possible.

Other Requirements

The biggest challenge with the Exchange configuration turned out to be the time required to enumerate the cluster volumes and bring them online. In our lab, it took approximately 10 minutes for each node to acquire its volumes, so a four-node cluster (in an Active/Active/Active/Passive configuration with three Exchange Storage Groups (SGs) per Exchange Virtual Server) with nine storage groups—one per logical unit number (LUN)—took up to 90 minutes before it was available. For the two-hour RTO group, we went with a three-node cluster (in an Active/Active/Passive configuration with two SGs per Exchange Virtual Server), with user accounts on one node, and calendar accounts and public folders (not widely used by this customer) on the second node. This allowed us to meet the requirement of full functionality for the two-hour RTO group.

[**Editor's Update - 10/19/2006:**A major factor in the time it took for our clusters to acquire volumes and complete failover was the amount of time it took our storage solution to present the LUNs to the servers. However, with some configuration recommendations from Network Appliance we were able to manage the solution to meet the SLA requirements. Our lab environment consisted of four a node (AAAP) cluster on Windows Server 2003 Enterprise Edition SP1 and Exchange Server 2003 SP2 connected to Network Appliance FAS980 storage. The delays associated with presenting LUNs to the servers were specific to this storage configuration and should not be generalized to dissimilar configurations.]

Along the way we learned a few things. First, sufficient Active Directory®-related servers (global catalogs, DNS, and so on) must be available at each site to support the entire solution, the network infrastructure must be fully meshed (so clients can access both datacenters transparently), and there needed to be an SLA with the network group because any prolonged network outage between the datacenters would severely hamper the ability to meet the five-minute RPO.

The biggest sticking point in a disaster recovery solution is data replication. You've got to ensure that either you have sufficient capacity in place or you're willing to provision sufficient bandwidth. The following were some of the issues we encountered.

Data Replication after Online Defrag The size of the replication data required after the online defrag was surprisingly large. Some further investigation into how the online defrag works and what it actually does shed light on the dataset size; but beware, you're looking at 40 percent to 60 percent of the total database size.

Cluster Node Volume Enumeration The time that was required to do this surprised us, so make sure you build in time around this when determining the SLA. The general guidance is to use a single storage group and single data store. We found this simplified the process under a failover scenario, but it was just not realistic in terms of the number of users we wanted to support per server.

Another issue was where to validate the backup data. If you run the ESEUtil /K command against the database and back up on the production server, you can seriously impact performance. We got around this by copying the data to the standby site, then executing ESEUtil /K.

Log File Replication The number of log files must be taken into consideration early in the process. Synchronous replication of large volumes of log files requires significant bandwidth. We needed a sustained throughput of 11MB/sec to handle the log file traffic between datacenters at peak times in order to meet the five-minute RPO.

Final Design

At the end of the day, the solution looked like the one shown in Figure 5. Three different blocks are used to support the three recovery scenario requirements.

Figure 5 The Final Configuration

Figure 5** The Final Configuration **

Block 1 and Block 2 consisted of three-node clusters of Exchange servers (Active/Active/Passive) for the Tier 1 users, and a four-node cluster of Exchange servers for the Tier 2 users.

Tier 1 consists of the 4,000 executives and traders and the calendar/resource accounts that require a two-hour RTO. Tier 2 has 30,000 employees with a four-hour RTO.

Both of the Block 1 and Block 2 three-node clusters hosts 2,000 mailboxes on one Exchange Virtual Server and 1,500 resource accounts on the other Exchange Virtual Server. The replication schedule for each of the LUNs associated with the Tier 1 environment is one copy per hour. Each Exchange Virtual Server of the four-node clusters hosts approximately 2,500 users, or 7,500 per cluster. To minimize the LUN enumeration time and avoid the need for using volume mount points, three storage groups per Exchange Virtual Server are used. The replication schedule for the Tier 2 LUNs is once every four hours (twice a day).

Block 3 consists of three-node clusters (two active nodes and one passive); each Exchange Virtual Server hosts approximately 4,000 users who are split into two storage groups of 2,000 users each. The replication of the backups happens once per day after the nightly on-line defrag. Synchronous log replication ensures that the high RPO can be met for all the users, regardless of the RTO.

A lot of other technology is tied into this solution, including archiving, journaling, and all the associated data storage and replication—but that's for another article. We achieved our goal of building failover clusters for Exchange to provide a disaster recovery solution that has a relatively high RTO and RPO, is fully supported by Microsoft, and avoids the complexity of full geoclustering.

Mark Godfrey is a messaging consultant with Microsoft Consulting Services in Toronto, Canada. His primary focus is on highly available messaging systems including Exchange, LCS, and Active Directory. Contact Mark at Mgodfrey@microsoft.com.

© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.