Understanding Downtime
Downtime can significantly impact the availability of your messaging system. It is important that you familiarize yourself with the various causes of downtime and how they affect your messaging system.
Unplanned downtime is downtime that occurs as a result of a failure (for example, a hardware failure or a system failure caused by improper server configuration). Because administrators do not know when unplanned downtime could occur, users are not notified of outages in advance. In contrast, planned downtime is downtime that occurs when an administrator shuts down the system at a scheduled time. Because planned downtime is scheduled, administrators can plan for it to occur at a time that least affects productivity.
To remove or minimize planned downtime, you can implement server clustering. Even while performing maintenance on a primary node, server clustering provides continuous messaging availability for your organization (by means of temporarily failing over Exchange services to a standby computer in the Exchange cluster). For more information about clustering, see Planning for Exchange Clustering.
The following table lists common causes of downtime and specific examples for each cause.
Causes of downtime | Examples |
---|---|
Planned administrative downtime |
Upgrades for hardware components, firmware, drivers, operating system, and software applications. |
Component failures |
Faulty server components, such as memory chips, fans, system boards, and power supplies. Faulty storage subsystem components, such as failed disk drives and disk controllers. Faulty network components, such as routers and network cabling. |
Software defects or failures |
Drive stops responding, operating system stops responding or reboots, viruses, or file corruption. |
Operator error or malicious users |
Accidental or intentional file deletion, unskilled operation, or experimentation. |
System outages or maintenance |
Software or systems requiring reboot, or system board failure. |
Local disaster |
Fires, storms, and other localized disasters. |
Regional disaster |
Earthquakes, hurricanes, floods, and other regional disasters. |
An integral aspect to implementing a highly available messaging system is to ensure that no single point of failure can render a server or network unavailable. Before you deploy your Exchange 2003 messaging system, you must familiarize yourself with the following failure types that may occur and plan accordingly.
Note
For detailed information about how to minimize the impact of the following failure types, see Making Your Exchange 2003 Organization Fault Tolerant.
Two common storage failures that can occur are hard disk failures and storage controller failures. There are several methods you can use to protect against individual storage failures. One method is to use redundant array of independent disks (RAID) to provide redundancy of the data on your storage subsystem. Another method is to use storage vendors who provide advanced storage solutions, such as Storage Area Network (SAN) solutions. These advanced storage solutions should include features that allow you to exchange damaged storage devices and individual storage controller components without losing access to the data. For more information about RAID and SAN technologies, see Planning a Reliable Back-End Storage Solution.
Common network failures include failed routers, switches, hubs, and cables. To help protect against such failures, there are various fault tolerant components you can use in your network infrastructure. Fault tolerant components also help provide highly available connectivity to network resources. As you consider methods for protecting your network, be sure to consider all network types (such as client access and management networks). For information about network hardware, see "Server-Class Network Hardware" in Component-Level Fault Tolerant Measures.
Common server component failures include failed network interface cards (NICs), memory (RAM), and processors. As a best practice, you should keep spare hardware available for each of the key server components (for example, NICs, RAM, and processors). In addition, many enterprise-level server platforms provide redundant hardware components, such as redundant power supplies and fans. Hardware vendors build computers with redundant, hot-swappable components, such as Peripheral Component Interconnect (PCI) cards and memory. These components allow you to replace damaged hardware without removing the computer from service.
For information about using redundant components and spare hardware components see Component-Level Fault Tolerant Measures.
You must promptly address application failures or any other problem that affects a computer's performance. To minimize the impact of a computer failure, there are two solutions you can include in your disaster recovery plan: a standby server solution and a server clustering solution.
In a standby server solution, you keep one or more preconfigured computers readily available. If a primary server fails, this standby server would replace it. For information about using standby servers, see "Spare Components and Standby Servers" in Component-Level Fault Tolerant Measures.
With server clustering, your applications and services are available to your users even if one cluster node fails. This is possible either by failing over the application or service (transferring client requests from one node to another) or by having multiple instances of the same application available for client requests.
Note
Server clustering can also help you maintain a high level of availability if one or more computers must be temporarily removed from service for routine maintenance or upgrades.
For information about Network Load Balancing (NLB) and server clustering, see "Fault Tolerant Infrastructure Measures" in System-Level Fault Tolerant Measures.
In extreme cases, an entire site can fail due to power loss, natural disaster, or other unusual occurrences. To protect against such failures, many businesses are deploying mission-critical solutions across geographically dispersed sites. These solutions often involve duplicating a messaging system's hardware, applications, and data to one or more geographically remote sites. If one site fails, the other sites continue to provide service (either through automatic failover or through disaster recovery procedures performed at the remote site), until the failed site is repaired. For more information, see "Using Multiple Physical Sites" in System-Level Fault Tolerant Measures.
Calculating some of the costs you experience as a result of downtime is relatively easy. For example, you can easily calculate the replacement cost of damaged hardware. However, the resulting costs from losses in areas such as productivity and revenue are more difficult to calculate.
The following table lists the costs that are involved when calculating the impact of downtime.
Category | Cost involved |
---|---|
Productivity |
Number of employees affected by loss of messaging functionality and other IT assets Number of administrators needed to manage a site increases with frequency of downtime |
Revenue |
Direct losses Compensatory payments Lost future revenues Billing losses Investment losses |
Financial performance |
Revenue recognition Cash flow Lost discounts (A/P) Payment guarantees Credit rating Stock price |
Damaged reputation |
Customers Suppliers Financial markets Banks Business partners |
Other expenses |
Temporary employees Equipment rental Overtime costs Extra shipping costs Travel expenses |
Availability becomes increasingly important as businesses continue to increase their reliance on information technology. As a result, the availability of mission-critical information systems is often tied directly to business performance or revenue. Based on the role of your messaging service (for example, how critical the service is to your organization), downtime can produce negative consequences such as customer dissatisfaction, loss of productivity, or an inability to meet regulatory requirements.
However, not all downtime is equally costly; the greatest expense is caused by unplanned downtime. Outside of a messaging service's core service hours, the amount of downtime—and corresponding overall availability level—may have little to no impact on your business. If a system fails during core service hours, the result can have significant financial impact. Because unplanned downtime is rarely predictable and can occur at any time, you should evaluate the cost of unplanned downtime during core service hours.
Because downtime affects businesses differently, it is important that you select the proper response for your organization. The following table lists different impact levels (based on severity), including the impact each level has on your organization.
Impact level | Description | Business impact |
---|---|---|
Impact level 1 |
Minor impact on business results. |
Low: Minimal availability requirement. |
Impact level 2 |
Disrupts the normal business processes. Minimal loss of revenue, low recovery cost. |
Low: Prevention of business loss improves return on investment and profitability. |
Impact level 3 |
Substantial revenue is lost; some is recoverable. |
Medium: Prevention of business loss improves return on investment and profitability. |
Impact level 4 |
Significant impact on core business activities. Affects medium-term results. |
High: Prevention of lost revenue improves business results. Business risk outweighs the cost of the solution. |
Impact level 5 |
Strong impact on core business activities. Affects medium-term results. Company's survival may be at risk. |
High: Business risk outweighs the cost of the solution. |
Impact level 6 |
Very strong impact on core business activities. Immediate threat to the company's survival. |
Extreme: Management of the business risk is essential. Cost of the solution is secondary. |