Understanding Downtime

 

Downtime can significantly impact the availability of your messaging system. It is important that you familiarize yourself with the various causes of downtime and how they affect your messaging system.

Planned and Unplanned Downtime

Unplanned downtime is downtime that occurs as a result of a failure (for example, a hardware failure or a system failure caused by improper server configuration). Because administrators do not know when unplanned downtime could occur, users are not notified of outages in advance. In contrast, planned downtime is downtime that occurs when an administrator shuts down the system at a scheduled time. Because planned downtime is scheduled, administrators can plan for it to occur at a time that least affects productivity.

To remove or minimize planned downtime, you can implement server clustering. Even while performing maintenance on a primary node, server clustering provides continuous messaging availability for your organization (by means of temporarily failing over Exchange services to a standby computer in the Exchange cluster). For more information about clustering, see Planning for Exchange Clustering.

The following table lists common causes of downtime and specific examples for each cause.

Causes of downtime and examples of each cause

Causes of downtime Examples

Planned administrative downtime

Upgrades for hardware components, firmware, drivers, operating system, and software applications.

Component failures

Faulty server components, such as memory chips, fans, system boards, and power supplies.

Faulty storage subsystem components, such as failed disk drives and disk controllers.

Faulty network components, such as routers and network cabling.

Software defects or failures

Drive stops responding, operating system stops responding or reboots, viruses, or file corruption.

Operator error or malicious users

Accidental or intentional file deletion, unskilled operation, or experimentation.

System outages or maintenance

Software or systems requiring reboot, or system board failure.

Local disaster

Fires, storms, and other localized disasters.

Regional disaster

Earthquakes, hurricanes, floods, and other regional disasters.

Failure Types

An integral aspect to implementing a highly available messaging system is to ensure that no single point of failure can render a server or network unavailable. Before you deploy your Exchange 2003 messaging system, you must familiarize yourself with the following failure types that may occur and plan accordingly.

Note

For detailed information about how to minimize the impact of the following failure types, see Making Your Exchange 2003 Organization Fault Tolerant.

Storage failures

Two common storage failures that can occur are hard disk failures and storage controller failures. There are several methods you can use to protect against individual storage failures. One method is to use redundant array of independent disks (RAID) to provide redundancy of the data on your storage subsystem. Another method is to use storage vendors who provide advanced storage solutions, such as Storage Area Network (SAN) solutions. These advanced storage solutions should include features that allow you to exchange damaged storage devices and individual storage controller components without losing access to the data. For more information about RAID and SAN technologies, see Planning a Reliable Back-End Storage Solution.

Network Failures

Common network failures include failed routers, switches, hubs, and cables. To help protect against such failures, there are various fault tolerant components you can use in your network infrastructure. Fault tolerant components also help provide highly available connectivity to network resources. As you consider methods for protecting your network, be sure to consider all network types (such as client access and management networks). For information about network hardware, see "Server-Class Network Hardware" in Component-Level Fault Tolerant Measures.

Component Failures

Common server component failures include failed network interface cards (NICs), memory (RAM), and processors. As a best practice, you should keep spare hardware available for each of the key server components (for example, NICs, RAM, and processors). In addition, many enterprise-level server platforms provide redundant hardware components, such as redundant power supplies and fans. Hardware vendors build computers with redundant, hot-swappable components, such as Peripheral Component Interconnect (PCI) cards and memory. These components allow you to replace damaged hardware without removing the computer from service.

For information about using redundant components and spare hardware components see Component-Level Fault Tolerant Measures.

Computer Failures

You must promptly address application failures or any other problem that affects a computer's performance. To minimize the impact of a computer failure, there are two solutions you can include in your disaster recovery plan: a standby server solution and a server clustering solution.

In a standby server solution, you keep one or more preconfigured computers readily available. If a primary server fails, this standby server would replace it. For information about using standby servers, see "Spare Components and Standby Servers" in Component-Level Fault Tolerant Measures.

With server clustering, your applications and services are available to your users even if one cluster node fails. This is possible either by failing over the application or service (transferring client requests from one node to another) or by having multiple instances of the same application available for client requests.

Note

Server clustering can also help you maintain a high level of availability if one or more computers must be temporarily removed from service for routine maintenance or upgrades.

For information about Network Load Balancing (NLB) and server clustering, see "Fault Tolerant Infrastructure Measures" in System-Level Fault Tolerant Measures.

Site Failures

In extreme cases, an entire site can fail due to power loss, natural disaster, or other unusual occurrences. To protect against such failures, many businesses are deploying mission-critical solutions across geographically dispersed sites. These solutions often involve duplicating a messaging system's hardware, applications, and data to one or more geographically remote sites. If one site fails, the other sites continue to provide service (either through automatic failover or through disaster recovery procedures performed at the remote site), until the failed site is repaired. For more information, see "Using Multiple Physical Sites" in System-Level Fault Tolerant Measures.

Costs of Downtime

Calculating some of the costs you experience as a result of downtime is relatively easy. For example, you can easily calculate the replacement cost of damaged hardware. However, the resulting costs from losses in areas such as productivity and revenue are more difficult to calculate.

The following table lists the costs that are involved when calculating the impact of downtime.

Costs of downtime

Category Cost involved

Productivity

Number of employees affected by loss of messaging functionality and other IT assets

Number of administrators needed to manage a site increases with frequency of downtime

Revenue

Direct losses

Compensatory payments

Lost future revenues

Billing losses

Investment losses

Financial performance

Revenue recognition

Cash flow

Lost discounts (A/P)

Payment guarantees

Credit rating

Stock price

Damaged reputation

Customers

Suppliers

Financial markets

Banks

Business partners

Other expenses

Temporary employees

Equipment rental

Overtime costs

Extra shipping costs

Travel expenses

Impact of Downtime

Availability becomes increasingly important as businesses continue to increase their reliance on information technology. As a result, the availability of mission-critical information systems is often tied directly to business performance or revenue. Based on the role of your messaging service (for example, how critical the service is to your organization), downtime can produce negative consequences such as customer dissatisfaction, loss of productivity, or an inability to meet regulatory requirements.

However, not all downtime is equally costly; the greatest expense is caused by unplanned downtime. Outside of a messaging service's core service hours, the amount of downtime—and corresponding overall availability level—may have little to no impact on your business. If a system fails during core service hours, the result can have significant financial impact. Because unplanned downtime is rarely predictable and can occur at any time, you should evaluate the cost of unplanned downtime during core service hours.

Because downtime affects businesses differently, it is important that you select the proper response for your organization. The following table lists different impact levels (based on severity), including the impact each level has on your organization.

Downtime impact levels and corresponding effect on business

Impact level Description Business impact

Impact level 1

Minor impact on business results.

Low: Minimal availability requirement.

Impact level 2

Disrupts the normal business processes.

Minimal loss of revenue, low recovery cost.

Low: Prevention of business loss improves return on investment and profitability.

Impact level 3

Substantial revenue is lost; some is recoverable.

Medium: Prevention of business loss improves return on investment and profitability.

Impact level 4

Significant impact on core business activities.

Affects medium-term results.

High: Prevention of lost revenue improves business results. Business risk outweighs the cost of the solution.

Impact level 5

Strong impact on core business activities.

Affects medium-term results.

Company's survival may be at risk.

High: Business risk outweighs the cost of the solution.

Impact level 6

Very strong impact on core business activities.

Immediate threat to the company's survival.

Extreme: Management of the business risk is essential. Cost of the solution is secondary.