Understanding Availability and Site Resilience Solutions for Exchange Server 2003

Article
01/24/2008

Companies are often evaluated on their ability to maintain operations and provide services at all times. An interruption in services or operations can have a devastating impact on a company's reputation, and it can be very, very expensive.

Generally speaking, availability is accomplished through a balance of risk-reduction measures, such as site resilience, and recovery options, including backup facilities. The term site resilience refers to the ability to extend the logical topology of a particular infrastructure from its primary physical location to a secondary physical location.

Microsoft® Exchange Server 2003 offers out-of-the box features that you can use in site-resilient infrastructures. In addition to these out-of-the box features, third-party storage vendors offer solutions that enable the deployment of Exchange Server 2003 in a site-resilient infrastructure.

In this article, I examine various factors that affect site resilience solutions and provide best practice recommendations for implementing site resilience solutions. I don't discuss specific solutions from third-party storage vendors, but I do explain the three main approaches that are typically used by third-party storage vendors.

Understanding the Cost of Availability

Although they are sometimes used interchangeably, the terms availability and high availability can mean different things depending on the context in which they are used and the audience that is involved. In this article, availability refers to the ability of a component or service to perform its required function at a stated instant or over a stated period of time. High availability refers to the minimization or masking of failures by implementing fault tolerance and redundancy.

Because these terms are often misunderstood, it is relatively easy for customers to have inappropriate expectations about availability targets. It is also easy for customers to demand higher levels of availability than they are willing to pay for.

Cost implications include, but are not limited to, the following items:

Hardware
Software
Network infrastructure
Staffing
Training
Facilities
Serviceability Serviceability refers to the contractual arrangements made with third-party service providers, or operational level agreements made with IT divisions inside your organization, to provide or maintain IT services or components.
Operational costs

The Microsoft Operations Framework (MOF) covers availability management and service continuity concepts, including cost considerations, in detail. Also, check out the following articles for more information:

Enhancing Availability

You can enhance availability by implementing redundancy. The following features of the various Exchange Server 2003 server roles help you implement redundancy:

Mailbox servers You can take advantage of the Cluster service in Microsoft Windows Server™ 2003 to provide high availability for mailbox servers.
Public folder servers You can take advantage of built-in replication to provide high availability for public folder servers. Built-in replication works by maintaining a copy of the folders on one or more other servers. Public folder stores can also reside in Windows server clusters.
Bridgehead servers With Exchange Server 2003 SMTP connectors and routing group connectors, multiple servers can be configured as source bridgeheads, thereby providing high availability for transport. In addition, SMTP servers can be part of a network load balancing (NLB) cluster to provide SMTP server high availability.
Front-end servers Exchange Server 2003 front-end servers can be part of a NLB cluster to provide availability for Microsoft Office Outlook® Web Access and other Internet-based Exchange clients.

Third-Party Storage Solutions

The Exchange Server 2003 high availability and site resilience solutions provided by third-party storage vendors typically use one or a combination of the following approaches:

Data replication
Geographically dispersed clusters
Standby clusters

From an application point of view, Exchange Server 2003 is not aware of such solutions. These solutions are built so that no configuration change is required for Exchange Server 2003, and the solution works transparently to Exchange Server.

Data Replication

The Deployment Guidelines for Exchange Server Multi-Site Data Replication explains in detail the concepts behind replication technologies for Exchange Server 2003. However, as you explore availability and site resilience solutions for Exchange Server 2003, it is important to understand some fundamental concepts behind replication solutions:

Where does replication occur? Replication can occur at the host level or at the storage system level.
How does replication occur?
- Host-based replication uses software to intercept the I/O and manage the replication process. In most cases, host-based replication uses a filter driver.
- Storage-based replication occurs at the storage system level.
How does replication work?
- Synchronous replication implies that the data is written to both the primary storage and secondary storage before the host receives a write complete response.
- Asynchronous replication implies that the host receives the write complete response from the primary storage after the data is written to it, and then replication occurs in the background.

Table 1 shows the advantages and disadvantages of synchronous replication and asynchronous replication.

Table 1 Advantages and disadvantages of synchronous and asynchronous replication

Replication type	Advantages	Disadvantages
Synchronous replication	Synchronous replication typically ensures that no data is lost because the data is written to both the primary storage and secondary storage before a write complete response is received. There is no difference between the primary storage data and secondary storage data. The replicated Exchange Server data is fully supported by Microsoft.	There are performance and scalability constraints. Distance affects write latency, especially log write latency. High latency requires a reduction of the I/O write demand to maintain an acceptable user experience. Therefore the number of mailboxes per server must be reduced. Synchronous replication is relatively expensive.
Asynchronous replication	There is no significant performance and scalability impact because the host does not have to wait for the data to be written to the remote storage before receiving a write complete response. Asynchronous replication is relatively inexpensive.	The replicated Exchange Server data is not supported by Microsoft. Data on the secondary storage may not always be current. Incorrect write order preservation may cause corruption of the Exchange Server data. The third-party storage vendor must make sure that the data is written correctly.

Solutions for Geographically Dispersed Clusters

The solutions for geographically dispersed clusters are also third-party storage vendor solutions that rely on Windows Clustering technology. For more information about geographically dispersed clusters, see the following articles:

When you consider using geographically dispersed clusters for Exchange Server 2003, note the following points:

Geographically dispersed cluster solutions are based on Windows Clustering technology. Geographically dispersed cluster solutions are built by using a combination of hardware and software from third-party storage vendors. You can find detailed information in the Windows Server Catalog at Cluster Solutions, Geographically Dispersed Cluster Solution.
Geographically dispersed clusters can be shared quorum clusters, with a minimum of two nodes, or majority node set (MNS) type clusters, with a minimum three nodes.
Geographically dispersed clusters rely on data replication, which is part of the solution.
Each node of the cluster accesses a replica of the shared storage.
Geographically dispersed cluster solutions require expansion of the subnet that is used by the public network interface of the cluster to a secondary physical location.
Geographically dispersed cluster solutions require expansion of the cluster heartbeat subnet to a secondary physical location. In all cases, the solution must not exceed network latency greater than 500 ms.
Specifically for Exchange Server, a synchronous replication solution is required for the replicated Exchange Server data to be supported by Microsoft.

Now let’s look at some of the advantages and disadvantages of using geographically dispersed clusters for Exchange Server 2003.

Advantages of Geographically Dispersed Clusters

The following are some of the advantages of using geographically dispersed clusters for Exchange Server 2003:

Geographically dispersed cluster solutions provide site resilience for mailbox clusters.
Geographically dispersed cluster solutions are fully transparent to users and require minimal manual intervention to fail over.
There is no data loss.
There is minimal impact to users during failover.
Windows Hardware Quality Labs (WHQL) qualified solutions are supported by Microsoft.

Disadvantages of Geographically Dispersed Clusters

The following are some of the disadvantages of using geographically dispersed clusters for Exchange Server 2003:

Geographically dispersed cluster solutions are relatively expensive to implement.
Geographically dispersed cluster solutions are relatively complex to implement.
Geographically dispersed cluster solutions require higher levels of operational maturity and processes.
Performance and scalability are affected by the use of synchronous replication.
The number of mailboxes per server is reduced in comparison to a standalone Exchange server.

Standby Clusters

Standby clusters can be used in site resilience solutions for Exchange Server 2003 mailbox clusters. A standby cluster for Exchange Server 2003 is a Windows server cluster with the following characteristics:

A standby cluster is identical to the production Exchange cluster in terms of hardware and software configuration, including versions of Windows Server and Exchange Server, and software updates.
Exchange Server program files are installed on the standby cluster, but the standby cluster is not yet configured with any Exchange Virtual Servers.
The standby cluster can be used only when all Exchange Virtual Servers on the production cluster are offline.

A standby cluster can be used to provide messaging dial tone capabilities or, when used with data replication solutions, a standby cluster can provide full data at the secondary location. Messaging dial tone capability is a strategy that enables users to send and receive messages by using temporary, empty mailboxes while data recovery efforts occur.

The use of standby clusters for Exchange Server 2003 is described in detail in the Exchange Server 2003 Disaster Recovery Operations Guide.

Best Practices for Site Resilience Solutions

When you plan site resilience solutions for Exchange Server 2003, consider the following best practices:

Determine your availability requirements.
Understand what the alternative site infrastructure will look like.
- Will the location have empty accommodations that contain power, environmental controls, networking equipment, and telecommunications infrastructure so that an organization can install its own computer equipment in a disaster recovery situation? Disaster recovery refers to the process to recover user data and configuration data from a backup source in order to restore service availability.
- Will the location have computer equipment and infrastructure that are appropriate for and ready to recover service?
- Will the location have dedicated computer equipment that duplicates the company’s critical business systems and that is ready to take over immediately with minimal or no loss of data?
Are other site-resilient services already in place in your infrastructure?
Do you have service level agreements in place? Customers frequently don’t have well-defined service-level agreements for their current environment, and many customers have unclear availability requirements for site resilience. A service-level agreement is a written agreement between a service provider and their customer, which documents agreed-on availability levels for a service.
Try to engage the third-party storage vendor throughout the process.
Plan for proof of concept, and make sure that the solutions you want to implement are tested and validated with Microsoft, and with the third-party storage vendor.
Plan for site failover simulations in production.
Determine how long the solution should sustain the alternative site mode.
Make sure that all dependency requirements for site resilience are well understood. These dependency requirements can include the following:
- Network
- Client connectivity and redirection
- Name resolution
- Active Directory® directory service
- Transport connectivity
- Operations readiness

For More Information

To learn more, check out the following resources: