High Availability Strategies

Article
01/23/2017

Microsoft Exchange Server 2007 will reach end of support on April 11, 2017. To stay supported, you will need to upgrade. For more information, see Resources to help you upgrade your Office 2007 servers and clients.

Applies to: Exchange Server 2007, Exchange Server 2007 SP1, Exchange Server 2007 SP2, Exchange Server 2007 SP3

This topic provides a broad overview of high availability in Microsoft Exchange Server 2007. This topic also introduces the recommended decision process for selecting the availability solution that is appropriate for your organization.

The terms availability and high availability can have very different meanings depending on the context in which they are used and the audience involved. They can be used to describe a variety of business goals and technical requirements, from hardware-only availability targets to mission-critical targets of the messaging service as a whole.

Generally, it is relatively easy for organizations to have inappropriate expectations regarding availability targets. It is also easy for organizations to demand higher levels of availability than they are actually willing to pay for before the cost implications are understood.

The cost implications of most availability solutions include, but are not limited to, the following:

Hardware
Software
Network infrastructure
Training
Serviceability
Operational costs

Serviceability refers to the contractual arrangements made with third-party service providers or operational-level agreements made with information technology divisions inside your organization, to provide or maintain information technology services or components.

Availability

Availability refers to a level of service provided by applications, services, or systems. Highly available systems have minimal downtime, whether planned or unplanned. Availability is frequently expressed as the percentage of time that a service or system is available, for example, 99.9 percent for a service that is unavailable for 8.75 hours per year.

To improve availability, you must implement fault tolerance mechanisms to mask or minimize the impact of failures of the components and dependencies of the service. Fault tolerance is achieved by implementing redundancy to single points of failure components.

When planning for Microsoft Exchange availability, consider all components that are part of the messaging infrastructure. Some components could also be other services that have subcomponents. The messaging service availability is determined by the availability of each component that is part of the infrastructure.

Defining Availability Requirements

The availability of a service is a complex issue that spans many disciplines. Many different approaches can be taken to deliver the required levels of availability, and each has its own cost implications.

However, availability requirements can frequently be expressed in relatively simplistic terms by the customer and without a full understanding of the implications. This situation can lead to misunderstandings between the customer and the information technology organization, inappropriate levels of investment, and ultimately to customer dissatisfaction through inappropriate expectations being set.

One expressed requirement for 99.5 percent availability can be different from another requirement for 99.5 percent availability. One requirement may discuss the availability of the hardware platform alone, and the other requirement may discuss the availability of the complete end-to-end service. Even the definition of complete end-to-end service availability can vary greatly. It is important to understand exactly how any availability requirements are to be measured. For example:

If all hardware and software on the primary server are functioning correctly and user connections are ready to be accepted by the application, is the solution considered 100 percent available?
If there are 100 users but 25 percent of them cannot connect because of a local network failure, is the solution still considered 100 percent available?
If only one user out of the 100 users can connect and process work, is the solution considered only 1 percent available?
If all 100 users can connect but the service is degraded with only two out of three customer transactions being available, or performance is poor, how does this affect availability measurements?

The period over which availability is to be measured can also have a significant effect on the definition of availability. A requirement for 99.9 percent availability over a one-year period allows 8.75 hours of downtime. A requirement for 99.9 percent availability over a rolling four-week window only allows 40 minutes downtime in each period.

It is also necessary to identify and negotiate periods of downtime for planned maintenance activity, service pack updates, and software updates. The amount of planned downtime that can be tolerated has a significant effect on the definition of availability requirements.

The release to manufacturing (RTM) version of Microsoft Exchange Server 2007 includes new features that can reduce costs and increase uptime:

Local continuous replication Local continuous replication (LCR) is a single-server solution that uses built-in technology to create and maintain a copy of a storage group on a second set of disks that are connected to the same server as the production storage group. LCR provides asynchronous log shipping, log replay, and a quick manual switch to a copy of the data. For more information about LCR, see Local Continuous Replication.
Cluster continuous replication Cluster continuous replication (CCR) combines the replication and replay features in Exchange 2007 with failover features in Microsoft Cluster services. CCR is a solution that can be deployed with no single point of failure in a single datacenter or between two datacenters. For more information about CCR, see Cluster Continuous Replication. CCR provides several advantages over clustering in previous versions of Exchange Server and single copy clusters in Exchange 2007. For details about these advantages, see Advantages of Cluster Continuous Replication over Single Copy Clusters.
Single copy clusters Single copy clusters (SCC), known as shared storage clusters in previous versions of Exchange Server, are present in Exchange 2007, with some significant changes and improvements. For more information about SCC, see Single Copy Clusters.

Microsoft Exchange Server 2007 Service Pack 1 (SP1) adds an additional feature designed to provide site resilience:

Standby Continuous Replication Standby continuous replication (SCR) is a new feature introduced in Exchange 2007 SP1. As its name implies, SCR is designed for scenarios that use or enable the use of standby recovery servers. SCR extends the existing continuous replication features and enables new data availability scenarios for Exchange 2007 Mailbox servers. SCR uses the same log shipping and replay technology used by LCR and CCR to provide added deployment options and configurations. SCR can be used to replicate data from stand-alone Mailbox servers and clustered mailbox servers. For more information about SCR, see Standby Continuous Replication.

These features provide enhanced recovery opportunities that meet a variety of availability requirements. The following table lists various availability requirements and provides a comparison of Exchange 2007 solutions with disaster recovery solutions that were present in Exchange Server 2003. For more information about high availability configurations for Exchange 2007, see High Availability Deployments.

High availability solution comparison based on availability requirements

Availability requirement	Exchange 2003 solution	Exchange 2007 RTM solution	Exchange 2007 SP1 solution
Long term archival	Daily full backups. Restore backups to identically rebuilt server.	Weekly full backups and daily incremental backups. Restore backups to any server.	Weekly full backups and daily incremental backups. Restore backups to any server.
Respond to user errors	Seven day dumpster default. Over seven days, restore backups to identically rebuilt server.	Fourteen day dumpster default. Over fourteen days, restore backups to any server.	Fourteen day dumpster default. Over fourteen days, restore backups to any server.
Resiliency against failures: Disk Hardware Shared storage	Restore backups to identically rebuilt server.	Continuous replication. No restore required. Stand-alone failure or CCR dual failure: Dial tone at alternate location or database portability.	Continuous replication. No restore required. Stand-alone failure or CCR dual failure: Dial tone at alternate location or database portability.
Resiliency against site-wide disaster	Restore backups to identically rebuilt server.	Continuous replication to second site. No restore required. Stand-alone failure or CCR dual failure: Dial tone at alternate location or database portability.	Standby continuous replication to second site. No restore required. Database portability or standby server activation.

Selecting the Appropriate Availability Solution

Several configurations can be used to improve availability of an Exchange 2007 deployment. A significant step toward selecting the right availability solution requires an analysis of a selected set of options to determine which solution provides the best match of your business goals and availability requirements. One way to do this is to build a table with a section for each type of failure. In each section of the table, use a row to identify each solution that provides a recovery strategy that is consistent with your availability requirements for the failure. Document the significant factors of the solution in the columns. Typical factors are:

Time to recovery
Data impact of recovery
Associated hardware and software costs
Associated resource costs
Probability of event
Implications on the business
Complexity risks
Third-party solutions
Pros
Cons

After completing these tables, select several solutions for cost analysis. For each selected solution, you should develop an estimated cost per mailbox, which can also be represented in a table. In the costs table, make sure to provide a row that characterizes the quality of the solution against the business goals. Make sure you evaluate several choices. Select at least one solution that satisfies the requirements, but that diverges from the typical solution your organization would choose.

Finally, review the business goals, availability requirements, possible solutions, and cost analysis to select your solution. During this process, consider the following keys to making the right decision:

Have a clear set of prioritized business goals Prioritization is important because different goals are likely to be in conflict.
Challenge the historical truths that might no longer apply Make use of the full potential of Exchange 2007 during the design and evaluation stage. Experience has shown that the most cost-effective solutions can require taking new approaches to backups, storage, and operations.
Examine single points of failure in your messaging system A single copy of mailbox data stored on a single storage area network (SAN) means that data is not completely protected against corruption and failure. There are several ways that corruption or failure can occur to that single copy of the data, independent of the amount of redundancy provided by the SAN. With SCC, a SAN failure can cause the service to experience hours of data loss and days of disruption. CCR is an availability solution that may lose some data when a server failure occurs, but it also maintains two copies of the data. CCR mitigates most of the data loss when a server failure occurs using a Hub Transport server feature called the transport dumpster. As a result, mailbox data is preserved for most hard failure circumstances.
Explore the range of storage options that are available for each solution CCR provides organizations with the option to use a wider range of storage solutions, such as direct attached storage. CCR does not require a SAN fabric, which reduces associated complexity and costs. Direct attached storage, whether SAN or a low-priced storage solution, is easier to deploy and operate.
Consider that CCR and LCR allow you to change your backup strategy from the typical daily full backup to a less frequent full backup and daily incremental strategy CCR and LCR can also support a shorter service level agreement (SLA) for recovery from the first failure. The SLA for recovery from a double failure (both copies failed or corrupted) may need to be lengthened over your current restore SLA. Changes like this can dramatically reduce your total cost of ownership (TCO) because backup costs are typically a major component of TCO. In addition, switching to a disk-based backup strategy can also reduce your backup costs.
Investigate the use of continuous replication technology in Exchange 2007 to create a solution CCR makes third-party replication technology unnecessary. Currently, CCR supports two-node clusters, with each node maintaining one copy of the data. A site-resilient solution based on this technology has several benefits:
- It ensures that mailbox data in the backup datacenter is available to clients.
- Continuous replication moves less data than most third-party solutions.
- It requires less integration to create a site-resilient solution.
Create tables that identify the recovery behavior and costs associated with the various options For the costs table, make sure it contains a few options that challenge your existing practices. Use the facts in the tables you create to design a solution that:
- Provides the best solution to the business requirements.
- Satisfies the cost requirements.
- Represents a level of deployment and operations complexity that your organization can support.

Base Products and Components

The deployment of products and components should be based on their capability to meet stringent availability and reliability requirements. Consider these requirements as the cornerstone of the availability design. The additional investment required to achieve even higher levels of availability will be wasted and availability levels unmet if these base products and components are unreliable and prone to failure.

Service Management Processes

Effective service management processes contribute to higher levels of availability. Processes such as availability management, incident management, problem management, and change management play an important role in the overall management of the messaging service.

Systems Management

Systems management should provide the monitoring, diagnostic, and automated error recovery to enable fast detection and resolution of potential and actual failure.

Special Solutions with Full Redundancy

To approach continuous availability in the range of 100 percent requires expensive solutions that incorporate full redundancy. Redundancy is the technique of improving availability by using duplicate components. For stringent availability requirements to be met, these components need to be working autonomously in parallel.

Establishing Availability Goals and SLA Requirements

Achieving high levels of availability begins with the deployment of good quality products and components. However, these products and components alone are unlikely to deliver the sustained levels of availability required. You should consider availability goals in the design process at the earliest possible stage of the development. This approach avoids the potential for increased costs related to rework, unplanned upgrades necessary to meet the required availability, unplanned tools to monitor the infrastructure, unplanned expenditures to eliminate single points of failure in the infrastructure, maintainability, and serviceability.

One of the first steps toward achieving high availability is to review the SLA that you have established for your organization. After you establish an SLA, you can determine the Exchange 2007 deployment and server configurations that are best suited for that agreement.

The following are the key considerations for high availability as they relate to disaster recovery:

Allowed downtime Consider the maximum allowed downtime that is acceptable for your organization based on your organization's definition of Exchange service availability. Depending on your organization's definition of downtime, you may be able to meet your organization's SLA by using a messaging dial-tone recovery strategy. A messaging dial-tone recovery strategy involves providing your users with a temporary mailbox so that they can send and receive e-mail messages immediately after a disaster. This strategy quickly restores e-mail service in advance of recovering historical mailbox data. Typically, recovery will be completed by eventually merging historical and temporary mailbox data.
Allowed recovery time Consider the maximum time allowed for each type of disaster recovery operation. For example, you should specify the approximate period of time it takes to recover a mailbox, a single database, or an entire server that is running Exchange 2007.
Data loss tolerance Consider the tolerance your organization has for either the temporary or permanent loss of Exchange data. For example, your organization may be able to tolerate the temporary loss of mailbox data since the previous backup for a period of 24 hours, as long as users can send and receive messages within a 4-hour time period. In other cases, you may want to consider a stricter requirement, such as requiring that all Exchange data up to the point of failure be restored within 4 hours.

After considering the impact of downtime on your organization and deciding on a level of uptime that you want to achieve in your messaging environment, you are ready to establish an SLA. SLA requirements determine how components, such as storage, clustering, and backup and recovery, factor into your organization.

When assessing SLAs, you should start by identifying the hours of regular operation and the expectations regarding planned downtime. You should then determine your company's expectations regarding availability, performance, and recoverability, including message delivery time, percentage of server uptime, amount of storage required, and time to recover an Exchange database.

Additionally, you should identify the estimated cost of unplanned downtime so that you can appropriate the proper amount of fault tolerance into your messaging system.

Features in Exchange 2007 and Windows Server 2003 may affect how you design your organization to meet SLAs. For example, LCR, CCR, SCCs, the Volume Shadow Copy Service (VSS), recovery storage groups, database portability, and dial-tone portability features could allow you to challenge the limits that were previously imposed by your SLAs.

The following table lists some of the categories and specific elements that you may want to include in your SLAs.

Categories and elements in a typical enterprise-level SLA

SLA categories	Examples of SLA elements
Hours of operation	Hours that the messaging service is available to users Hours reserved for planned downtime (maintenance) Amount of advance notice for network changes or other changes that may affect users
Service availability	Percentage of time Exchange services are running Percentage of time mailbox stores are mounted Percentage of time that domain controller services are running
System performance	Number of internal users who the messaging system concurrently supports Number of remotely connected users who the messaging system concurrently supports Number of messaging transactions that are supported per unit of time Acceptable level of performance, such as latency experienced by users
Disaster recovery	Time allowed for recovery of each failure type, such as individual database failure, mailbox server failure, domain controller failure, and site failure Time it takes to provide a backup mail system so that users can send and receive e-mail messages without accessing historical data (called Messaging Dial Tone) Amount of time it takes to recover data to the point of failure
Help desk and support	Specific methods that users can use to contact the Help desk Help desk response time for various classes of problems Help desk procedures regarding issue escalation procedures
Other	Amount of storage required per user Number of users who require special features, such as remote access to the messaging system

Including a variety of performance measures in your SLAs helps make sure that you are meeting the specific performance requirements of your users. For example, if there is high latency or low available bandwidth between clients and mailbox servers, users would view the performance level differently from system administrators. Specifically, users would consider the performance level to be poor, although system administrators would consider the performance to be acceptable. Therefore, make sure that you monitor disk input/output (I/O) latency levels.

Note

For each SLA element, you must also determine the specific performance benchmarks that you will use to measure performance together with availability objectives. Additionally, you must determine how frequently you will provide statistics to information technology management and other management.

Establishing Service Level Agreements with Your Vendors

Many businesses that place importance on high availability solutions use the services of third-party vendors to achieve their high availability goals. In these cases, achieving a highly available messaging system requires services from external hardware and software vendors. Unresponsive vendors and poorly trained vendor staff can reduce the availability of the messaging system.

Make sure that you negotiate an SLA with each of your major vendors. Establishing SLAs with your vendors helps guarantee that your messaging system performs to specifications, supports required growth, and is available to a specific standard. The absence of an SLA can significantly increase the length of time that the messaging system is unavailable.

Important

Make sure that your staff knows about the terms of each SLA. For example, many hardware vendor SLAs contain clauses that allow only support personnel from the vendor or certified staff members of your organization to open the server casing. Failure to comply can result in a violation of the SLA and potential nullification of any vendor warranties or liabilities.

In addition to establishing an SLA with your major vendors, you should also periodically test escalation procedures by conducting support-request drills. To confirm that you have the most recent contact information, make sure that you also test pagers and telephone trees.

Considerations for Availability

We recommend that you consider the following issues to determine availability requirements:

Understand the vulnerability to failure of the proposed infrastructure design. You should make sure that there are no single points of failure. A single point of failure is any component within the messaging infrastructure that has no redundancy capability and can affect the user when it fails. The proposed technical design for the solution should cover the full end-to-end configuration.
Consider the minimum levels of availability required by the business for the messaging service, and the minimum reliability, maintainability, and serviceability levels for each component of the messaging infrastructure.
Consider the ability to test or simulate new components to make sure that they match the specified requirements. To assess if new components within the design can match the stated requirements, it is important that the testing regime that you instigate makes sure that the availability expected can be delivered. Testing should also be performed when components are serviced. Simulation tools to generate the expected user demand for the new information technology service should be seriously considered to make sure components continue to operate under volume and stress conditions.

Considerations for High Availability

A highly available messaging solution requires that you invest in and deploy a monitoring solution, service management processes, systems management tools, and redundancy. For high availability deployments, it is important that no single point of failure exists within this end-to-end configuration. The design for high availability must consider the elimination of single points of failure and the provision of alternative components to provide minimal disruption to the business operation if a component failure occurs. The design also must eliminate or minimize the effects of planned downtime to the business operation normally required to accommodate maintenance activity, such as the implementation of changes to the infrastructure. Recovery criteria should define rapid recovery and service reinstatement as a key goal within the designing for recovery phase of design.

In developing a deployment plan for your messaging solution, you must identify the goals of the solution. This is particularly important as you design the availability characteristics of the solution. Often, a business's goals result in contradictions. For example, your availability goals might include 100 percent availability while also requiring the latest security upgrades to be applied within a week of their availability. Costs are often another factor that creates challenges for the deployment plans. Following a planning methodology that identifies all the business requirements and evaluates the available options against those requirements is the best approach to identifying the right solution for your business.

To successfully achieve high availability requires a continuous and ongoing focus on the operational practices of your organization. All causes of outages need to be understood. For outages that could be prevented by process changes, the appropriate process changes need to be initiated.

Another key factor in maximizing availability is proactive monitoring of the Exchange environment. By proactively monitoring, problem areas within the system can be identified before they produce failures and outages. In addition, monitoring can alert the operations staff of problems that are not automatically recovered by the system. In such situations, a timely response can shorten the duration of the outage, thus increasing availability.

Exchange 2007 places dependencies on infrastructure within a datacenter. As a result, the availability of Exchange is bounded by the availability delivered by its dependencies. Organizations are encouraged to establish SLAs for each dependency. The SLA must specify the availability of the provided service and the recovery time when a failure occurs. For example, the Active Directory directory service is a key dependency for Exchange. If the availability of Active Directory is lower than the Exchange availability goals, it is likely Exchange will not reach its goals.

The availability of Exchange 2007 is dependent on the availability of other services within the information technology infrastructure. Services like Active Directory and networking must be functioning for Exchange to be functional. The availability of these services directly affects Exchange availability. Therefore, you should ensure that Exchange availability requirements are not higher than the availability requirements for its dependencies. The typical list of dependencies is:

Active Directory
Domain Name System
TCP/IP network
Storage subsystem
Backup services
Monitoring services
Datacenter infrastructure (power and air conditioning)

After you have established your business goals and your SLAs for the Exchange dependencies, we recommend that you develop an initial list of availability requirements for messaging services. This list should include each general class of failure and the expected recovery time objective (RTO). For data-related failures, this list should include an indication of the failure's impact on the data. This can be specified by indicating a recovery point objective (RPO). An RPO identifies the data impact by specifying a time that defines a level of data that will be available post recovery. Failures that should be considered include:

Single mail item lost
Single mailbox lost
Database lost or corrupted
Disk failure
Disk volume failure or corruption
Storage unit failure
Server failure
Network connectivity lost
Datacenter failure

In many organizations, the established availability requirements vary based on the type of user. For example, some users might use the messaging system to track deliveries or sales, while others might use it for non-critical messages. The RTO and RPO for users who rely on the message system for critical processes needs to be as short as possible, whereas those who use the messaging system for non-critical processes can tolerate a longer RTO and RPO.

For More Information

For more information about site resilience for Exchange 2007, see Site Resilience Configurations.