Microsoft Exchange Server 2010: High availability strategies

The strategies Microsoft offers for creating highly available Microsoft Exchange mailboxes have evolved over the years.

Excerpted from “Exchange 2010 - A Practical Approach,” published by Red Gate Books (2009).

Jaap Wesselius

Ever since Exchange Server 5.5, Microsoft has offered Windows Clustering as an option for creating a highly available Exchange mailbox environment. There are two server nodes available in a typical shared-storage cluster environment. Both are running Exchange Server and both servers are connected to a shared storage solution.

In the early days, this shared storage was built on a shared SCSI bus. Later on, it typically used storage-area networks (SANs) with a Fibre Channel or iSCSI network connection. The important part was the shared storage where the Exchange Server databases were located.

Only one server node is the “owner” of this shared data. This node provides the client services. It’s also known as the active node. The other node isn’t able to access this data, and is therefore the passive node. A private network between the two server nodes is used for intra-cluster communications, such as a heartbeat signal. This lets both nodes determine the cluster state and ensure the other nodes are still alive.

Besides the two nodes, it creates an “Exchange Virtual Server” as a cluster resource. This has nothing to do with virtual machines. This is the resource to which Outlook clients connect in order to access their mailboxes. When the active node fails, the passive node takes over the Exchange Virtual Server, which then continues to run. Although users will notice a short downtime during the failover, it’s an otherwise seamless experience. No action is required from the user.

Although this solution offers redundancy, there’s still a single point of failure—the shared database of the Exchange server. In a typical environment, this database is stored on a SAN. By its very nature, a SAN is a highly available environment. When something does happen to the database, though, such as a logical failure, the database is unavailable for both nodes. This results in total unavailability.

Exchange database replication

With Exchange Server 2007, Microsoft offered a new solution for creating highly available Exchange environments: database replication. Database replication creates a copy of a database, resulting in database redundancy. This technology was available in three flavors:

  • Local Continuous Replication (LCR): This approach creates a copy of the database on the same server.
  • Cluster Continuous Replication (CCR): This creates a copy of the database on another node in a Windows failover cluster (there can only be two nodes in a CCR cluster).
  • Standby Continuous Replication (SCR): This came with Exchange Server 2007 SP1. It creates a copy of a database on any other Exchange Server (not necessarily in the cluster). This isn’t meant for high availability (HA); it’s more for disaster recovery.

This is how database replication works in a CCR clustered environment. Exchange Server 2007 is installed on a Windows Server 2003 or Windows Server 2008 failover cluster. There’s no shared storage in use within the cluster. Each node has its own storage. This can be either on a SAN (Fibre Channel or iSCSI) or direct-attached storage (DAS)—local physical disks.

The active node in the cluster services client requests, and Exchange Server uses the standard database technology with a database, log files and a checkpoint file. When Exchange Server is finished with a log file, it’s immediately sent to the cluster’s passive node. This can either be via a normal network connection or via a dedicated replication network.

The passive node receives the log file and checks it for errors. If it finds none, the data in the log file is relayed to the passive copy of the database. This is an asynchronous process, meaning the passive copy is always a couple of log files behind the active copy, so information is “missing” in the passive copy.

In this environment, all messages—even internal messages—are sent via a Hub Transport server. The Hub Transport server keeps track of these messages in a CCR environment. It can therefore send missing information (that the passive node actually requests) to the passive copy of the cluster in case of a cluster failover. This is called the “Transport Dumpster” in a Hub Transport server.

This kind of replication works very well. CCR replication is quite reliable, but there are a couple of potential drawbacks:

  • An Exchange Server 2007 CCR environment runs on Windows Server 2003 or Windows Server 2008 clustering. For many, this adds too much complexity to the environment.
  • Windows Server 2003 clustering in a multi-subnet environment is nearly impossible, although this has improved (but still isn’t perfect) in Windows Server 2008 failover clustering.
  • Site resilience isn’t seamless.
  • CCR clustering is only possible in a two-node environment.
  • All three kinds of replication (LCR, CCR and SCR) are managed differently.

To overcome these issues, Microsoft dramatically improved the replication technology. It also reduced the administrative overhead. It achieved this by completely hiding the cluster components behind the implementation of Exchange Server 2010. The cluster components are still there, but the administration is done entirely with the Exchange Management Console (EMC) or the Exchange Management Shell (EMS).

DAG continuous replication

In Exchange Server 2010, Microsoft introduced the concept of a database availability group (DAG). This is a logical unit of Exchange Server 2010 Mailbox Servers. All Mailbox Servers within a DAG can replicate databases to each other. A single DAG can hold up to 16 Mailbox Servers and up to 16 copies of a database.

The idea of multiple database copies in one Exchange organization is called Exchange Mobility. There is one database on multiple servers, each instance of which is 100 percent identical and thus has the same GUID.

With a DAG in place, clients connect to an active database. This is the database where all data was stored initially. New SMTP messages, either from outside or inside the organization, are stored in this database first.

When the Exchange Server has finished processing information in the database’s log file, it replicates the file to other servers. You can assign the servers that receive a copy of the database. The log file is inspected upon receipt and if everything is all right, the information in the log file is dropped into the local copy of the database.

In Exchange Server 2010, all clients connect to the Client Access Server, including all Messaging Application Programming Interface, or MAPI, clients such as Microsoft Outlook. Supported Outlook clients in Exchange Server 2010 include Outlook 2003, Outlook 2007 and Outlook 2010.

So the Outlook client connects to the Client Access Server, which then connects to the mailbox in the active copy of the database. Unfortunately, this is only true for mailbox databases. When an Outlook client needs to access a public folder database, the client still accesses the mailbox server directly.

When the active copy of a database or its server fails, one of the passive copies of the database becomes active. You can configure the failover order during the database copy configuration process. The Client Access Server automatically notices the failover and starts using the new active database. Because the Outlook client is connected to the Client Access Server and not directly to the database, a database failover is fully transparent. Messages such as, “The connection to the server was lost,” and, “The connection to the server is restored,” simply don’t appear anymore.

When building a highly available mailbox server environment in a DAG, there’s no need to build a failover cluster in advance. You can add additional mailbox servers to the DAG on the fly. However, for the DAG to function properly, you’re still using some failover clustering components. These are installed during the DAG configuration. You do all DAG and database copy management via the EMC or the EMS. You no longer have to use the Windows Cluster Manager.

The DAG with database copies is the only HA technology Exchange Server 2010 uses. Older technologies such as SCR, CCR and SCR are no longer available. The traditional single-copy cluster with shared storage is no longer supported, either.

Configuring a DAG is no longer limited to a server holding just the mailbox server role. It’s possible to create a two-server situation with the Hub Transport, Client Access and Mailbox Server roles on both servers, and then create a DAG and configure database copies.

However, it isn’t an HA configuration for the Client Access or Hub Transport servers unless you’ve put load balancers in front of them. You can’t use the default Windows Network Load Balancing in combination with the failover clustering components. Nevertheless, this is a great improvement for smaller deployments of Exchange Server 2010 where HA is still required.

Jaap Wesselius

Jaap Wesselius is the founder of DM Consultants, a company with a strong focus on messaging and collaboration solutions. After working at Microsoft for eight years, Wesselius decided to commit more of his time to the Exchange community in the Netherlands, resulting in an Exchange Server MVP award in 2007. He’s also a regular contributor at the Dutch Unified Communications User Group and a regular author for Simple-Talk.

Learn more about “Exchange 2010 - A Practical Approach” at red-gate.com/our-company/about/book-store.