Understanding Availability, Reliability, and Scalability

 

Although this guide focuses mainly on achieving availability, it is important that you also understand how reliability and scalability are key components to planning and implementing a highly available Exchange 2003 messaging system.

Defining Availability

In the IT community, the metric used to measure availability is the percentage of time that a system is capable of serving its intended function. As it relates to messaging systems, availability is the percentage of time that the messaging service is up and running. The following formula is used to calculate availability levels:

Percentage of availability = (total elapsed time - sum of downtime)/total elapsed time

Availability is typically measured in "nines." For example, a solution with an availability level of "three nines" is capable of supporting its intended function 99.9 percent of the time—equivalent to an annual downtime of 8.76 hours per year on a 24x7x365 (24 hours a day/seven days a week/365 days a year) basis. The following table lists common availability levels that many organizations attempt to achieve.

Availability percentages and yearly downtime

Availability percentage 24-hour day 8-hour day

90%

876 hours (36.5 days)

291.2 hours (12.13 days)

95%

438 hours (18.25 days)

145.6 hours (6.07 days)

99%

87.6 hours (3.65 days)

29.12 hours (1.21 days)

99.9%

8.76 hours

2.91 hours

99.99%

52.56 minutes

17.47 minutes

99.999% ("five nines")

5.256 minutes

1.747 minutes

99.9999%

31.536 seconds

10.483 seconds

Unfortunately, measuring availability is not as simple as selecting one of the availability percentages shown in the preceding table. You must first decide what metric you want to use to qualify downtime. For example, one organization may consider downtime to occur when one database is not mounted. Another organization may consider downtime to occur only when more than half of its users are affected by an outage.

In addition, availability requirements must be determined in the context of the service and the organization that used the service. For example, availability requirements for your servers that host non-critical public folder data can be set lower than for servers that contain mission-critical mailbox databases.

For information about the metrics you can use to measure availability, and for information about establishing availability requirements based on the context of the service and your organizational requirements, see Setting Availability Goals.

Defining Reliability

Reliability measures are generally used to calculate the probability of failure for a single solution component. One measure used to define a component or system's reliability is mean time between failures (MTBF). MTBF is the average time interval, usually expressed in thousands or tens of thousands of hours (sometimes called power-on hours or POH), that elapses before a component fails and requires service. MTBF is calculated by using the following equation:

MTBF = (total elapsed time - sum of downtime)/number of failures

A related measurement is mean time to repair (MTTR). MTTR is the average time interval (usually expressed in hours) that it takes to repair a failed component. The reliability of all solution components—for example, server hardware, operating system, application software, and networking—can affect a solution's availability.

A system is more reliable if it is fault tolerant. Fault tolerance is the ability of a system to continue functioning when part of the system fails. Fault tolerance is achieved by designing the system with a high degree of hardware redundancy. If any single component fails, the redundant component takes its place with no appreciable downtime. For more information about fault tolerant components, see Making Your Exchange 2003 Organization Fault Tolerant.

Defining Scalability

In Exchange deployments, scalability is the measure of how well a service or application can grow to meet increasing performance demands. When applied to Exchange clustering, scalability is the ability to incrementally add computers to an existing cluster when the overall load of the cluster exceeds the cluster's ability to provide adequate performance. To meet the increasing performance demands of your messaging infrastructure, there are two scalability strategies you can implement: scaling up or scaling out.

Scaling up

Scaling up involves increasing system resources (such as processors, memory, disks, and network adapters) to your existing hardware or replacing existing hardware with greater system resources. Scaling up is appropriate when you want to improve client response time, such as in an Exchange front-end server Network Load Balancing (NLB) configuration. For example, if the current hardware is not providing adequate performance for your users, you can consider adding RAM or central processing units (CPUs) to the servers in your NLB cluster to meet the demand.

Windows Server 2003 supports single or multiple CPUs that conform to the symmetric multiprocessing (SMP) standard. Using SMP, the operating system can run threads on any available processor, which makes it possible for applications to use multiple processors when additional processing power is required to increase a system's capabilities.

Scaling out

Scaling out involves adding servers to meet demands. In a back-end server cluster, this means adding nodes to the cluster. In a front-end NLB scenario, it means adding computers to your set of Exchange 2003 front-end protocol servers. Scaling out is also appropriate when you want to improve client response time with your servers.

For information about scalability in regard to server clustering solutions, see "Performance and Scalability Considerations" in Planning Considerations for Clustering.

For detailed information about selecting hardware and tuning Exchange 2003 for performance and scalability, see the Exchange Server 2003 Performance and Scalability Guide.