Establishing Availability Goals and SLA Requirements

Última modificación del tema: 2007-07-16

Achieving high levels of availability begins with the deployment of good quality products and components. However, these products and components alone are unlikely to deliver the sustained levels of availability required. You should consider availability goals in the design process at the earliest possible stage of the development. This approach avoids the potential for increased costs related to rework, unplanned upgrades necessary to meet the required availability, unplanned tools to monitor the infrastructure, unplanned expenditures to eliminate single points of failure in the infrastructure, maintainability, and serviceability.

One of the first steps toward achieving high availability is to review the service level agreement (SLA) that you have established for your organization. After you establish an SLA, you can determine the Microsoft Exchange Server 2007 deployment and server configurations that are best suited for that agreement.

The following are the key considerations for high availability as they relate to disaster recovery:

  • Allowed downtime   Consider the maximum allowed downtime that is acceptable for your organization based on your organization's definition of Exchange service availability. Depending on your organization's definition of downtime, you may be able to meet your organization's SLA by using a messaging dial tone recovery strategy. A messaging dial tone recovery strategy involves providing your users with a temporary mailbox so that they can send and receive e-mail messages immediately after a disaster. This strategy quickly restores e-mail service in advance of recovering historical mailbox data. Typically, recovery will be completed by eventually merging historical and temporary mailbox data.
  • Allowed recovery time   Consider the maximum time allowed for each type of disaster recovery operation. For example, you should specify the approximate period of time it takes to recover a mailbox, a single database, or an entire server that is running Exchange 2007.
  • Data loss tolerance   Consider the tolerance your organization has for either the temporary or permanent loss of Exchange data. For example, your organization may be able to tolerate the temporary loss of mailbox data since the previous backup for a period of 24 hours, as long as users can send and receive messages within a 4-hour time period. In other cases, you may want to consider a stricter requirement, such as requiring that all Exchange data up to the point of failure be restored within 4 hours.

After considering the impact of downtime on your organization and deciding on a level of uptime that you want to achieve in your messaging environment, you are ready to establish an SLA. SLA requirements determine how components such as storage, clustering, and backup and recovery factor into your organization.

When assessing SLAs, you should start by identifying the hours of regular operation and the expectations regarding planned downtime. You should then determine your company's expectations regarding availability, performance, and recoverability, including message delivery time, percentage of server uptime, amount of storage required, and time to recover an Exchange database.

Additionally, you should identify the estimated cost of unplanned downtime so that you can appropriate the proper amount of fault tolerance into your messaging system.

Features in Exchange 2007 and Microsoft Windows Server 2003 may affect how you design your organization to meet SLAs. For example, local continuous replication (LCR), cluster continuous replication (CCR), single copy clusters (SCC), the Volume Shadow Copy Service (VSS), recovery storage groups, database portability, and dial tone portability features could allow you to challenge the limits that were previously imposed by your SLAs.

Table 1 lists some of the categories and specific elements that you may want to include in your SLAs.

Table 1   Categories and elements in a typical enterprise-level SLA

SLA categories Examples of SLA elements

Hours of operation

  • Hours that the messaging service is available to users
  • Hours reserved for planned downtime (maintenance)
  • Amount of advance notice for network changes or other changes that may affect users

Service availability

  • Percentage of time Exchange services are running
  • Percentage of time mailbox stores are mounted
  • Percentage of time that domain controller services are running

System performance

  • Number of internal users who the messaging system concurrently supports
  • Number of remotely connected users who the messaging system concurrently supports
  • Number of messaging transactions that are supported per unit of time
  • Acceptable level of performance, such as latency experienced by users

Disaster recovery

  • Time allowed for recovery of each failure type, such as individual database failure, mailbox server failure, domain controller failure, and site failure
  • Time it takes to provide a backup mail system so that users can send and receive e-mail messages without accessing historical data (called Messaging Dial Tone)
  • Amount of time it takes to recover data to the point of failure

Help desk and support

  • Specific methods that users can use to contact the Help desk
  • Help desk response time for various classes of problems
  • Help desk procedures regarding issue escalation procedures

Other

  • Amount of storage required per user
  • Number of users who require special features, such as remote access to the messaging system

Including a variety of performance measures in your SLAs helps make sure that you are meeting the specific performance requirements of your users. For example, if there is high latency or low available bandwidth between clients and mailbox servers, users would view the performance level differently from system administrators. Specifically, users would consider the performance level to be poor, although system administrators would consider the performance to be acceptable. Therefore, make sure that you monitor disk I/O latency levels.

Nota

For each SLA element, you must also determine the specific performance benchmarks that you will use to measure performance together with availability objectives. Additionally, you must determine how frequently you will provide statistics to information technology management and other management.

Establishing Service Level Agreements with Your Vendors

Many businesses that place importance on high availability solutions use the services of third-party vendors to achieve their high availability goals. In these cases, achieving a highly available messaging system requires services from outside hardware and software vendors. Unresponsive vendors and poorly trained vendor staff can reduce the availability of the messaging system.

Make sure that you negotiate an SLA with each of your major vendors. Establishing SLAs with your vendors helps guarantee that your messaging system performs to specifications, supports required growth, and is available to a specific standard. The absence of an SLA can significantly increase the length of time that the messaging system is unavailable.

Importante

Make sure that your staff knows about the terms of each SLA. For example, many hardware vendor SLAs contain clauses that allow only support personnel from the vendor or certified staff members of your organization to open the server casing. Failure to comply can result in a violation of the SLA and potential nullification of any vendor warranties or liabilities.

In addition to establishing an SLA with your major vendors, you should also periodically test escalation procedures by conducting support-request drills. To confirm that you have the most recent contact information, make sure that you also test pagers and telephone trees.

Considerations for Availability

We recommend that you consider the following issues to determine availability requirements:

  • Understand the vulnerability to failure of the proposed infrastructure design. You should make sure that there are no single points of failure. A single point of failure is any component within the messaging infrastructure that has no redundancy capability and can affect the user when it fails. The proposed technical design for the solution should cover the full end-to-end configuration. For high availability deployments, it is important that no single point of failure exists within this end-to-end configuration.
  • Consider the minimum levels of availability required by the business for the messaging service, and the minimum reliability, maintainability, and serviceability levels for each component of the messaging infrastructure.
  • Consider the ability to test or simulate new components to make sure that they match the specified requirements. To assess if new components within the design can match the stated requirements, it is important that the testing regime that you instigate makes sure that the availability expected can be delivered. Testing should also be performed when components are serviced. Simulation tools to generate the expected user demand for the new information technology service should be seriously considered to make sure components continue to operate under volume and stress conditions.

Considerations for High Availability

A highly available messaging solution requires that you invest in and deploy a monitoring solution, service management processes, systems management tools, and redundancy. In developing a deployment plan for your messaging solution you must identify the goals of the solution. This is particularly important as you design the availability characteristics of the solution. Often, a business’s goals result in contradictions. For example, your availability goals might include 100% availability while also requiring the latest security upgrades to be applied within a week of their availability. Costs are often another factor which creates challenges for the deployment plans. Following a planning methodology that identifies all the business requirements and evaluates the available options against those requirements is the best approach to identifying the right solution for your business.

To successfully achieve high availability requires a continuous and ongoing focus on the operational practices of your organization. All causes of outages need to be understood. For outages that could be prevented by process changes, the appropriate process changes need to be initiated.

Another key factor in maximizing availability is proactive monitoring of the environment. By proactively monitoring, problem areas within the system can be identified before they produce failures and outages. In addition, monitoring can alert the operations staff of problems that are not automatically recovered by the system. In such situations, a timely response can shorten the duration of the outage – thus increasing availability.

Exchange 2007 places dependencies on infrastructure within a datacenter. As a result, the availability of Exchange is bounded by the availability delivered by its dependencies. Organizations are encouraged to establish SLAs for each dependency. The SLA must specify the availability of the provided service and the recovery time when a failure does occur. For example, Active Directory is a key dependency for Exchange. If the availability of Active Directory is lower than the Exchange availability goals then it is likely Exchange will not reach its goals. For more information about SLAs, see Establishing Service Level Agreement Requirements.

The availability of Exchange 2007 is dependent on the availability of other services within the IT infrastructure. Services like Active Directory and networking must be functioning in order for Exchange to be functional. Thus the availability of these services directly affects Exchange availability. Thus, you should ensure that Exchange availability requirements are not higher than the availability requirements for its dependencies. The typical list of dependencies is:

  • Active Directory
  • Domain Name System
  • TCP/IP network
  • Storage sub-system
  • Backup services
  • Monitoring services
  • Datacenter infrastructure (power and air conditioning)

Once you have established your business goals and your SLAs for the Exchange dependencies, we recommend that you develop an initial list of availability requirements for messaging services. This list should include each general class of failure and the expected Recovery Time Objective (RTO). For data-related failures, this list should include an indication of the failure’s impact on the data. This can be specified by indicating a Recovery Point Objective (RPO). An RPO identifies the data impact by specifying a time that defines a level of data that will be available post recovery. Failures that should be considered include:

  • Single mail item lost
  • Single mailbox lost
  • Database lost or corrupted
  • Disk failure
  • Disk volume failure or corruption
  • Storage unit failure
  • Server failure
  • Network connectivity lost
  • Datacenter failure

In many organizations, the established availability requirements vary based on the type of user. For example, some users might use the messaging system to track deliveries or sales, while others might use it for non-critical messages. The RTO and RPO for users who rely on the message system for critical processes needs to be as short as possible, whereas those who use the messaging system for non-critical processes can tolerate a longer RTO and RPO.

Base Products and Components

The deployment of products and components should be based on their capability to meet stringent availability and reliability requirements. Consider these requirements as the cornerstone of the availability design. The additional investment required to achieve even higher levels of availability will be wasted and availability levels unmet if these base products and components are unreliable and prone to failure.

Service Management Processes

Effective service management processes contribute to higher levels of availability. Processes such as availability management, incident management, problem management, and change management play an important role in the overall management of the messaging service.

Systems Management

Systems management should provide the monitoring, diagnostic, and automated error recovery to enable fast detection and resolution of potential and actual failure.

High Availability Design

The design for high availability must consider the elimination of single points of failure and the provision of alternative components to provide minimal disruption to the business operation if a component failure occurs. The design also must eliminate or minimize the effects of planned downtime to the business operation normally required to accommodate maintenance activity, such as the implementation of changes to the infrastructure. Recovery criteria should define rapid recovery and service reinstatement as a key goal within the designing for recovery phase of design.

Special Solutions with Full Redundancy

To approach continuous availability in the range of 100 percent requires expensive solutions that incorporate full redundancy. Redundancy is the technique of improving availability by using duplicate components. For stringent availability requirements to be met, these components need to be working autonomously in parallel.