Identifying and Analyzing Barriers to Achieving High Availability

 

A barrier to high availability is defined as any issue that has the potential of limiting a messaging system's availability. Although it is impossible to protect your messaging environment from every barrier, it is important that you are familiar with the most common high availability barriers, including the risks associated with each.

Identifying Barriers

Barriers to achieving high availability include the following:

  • Environmental issues   Problems with the messaging system environment can reduce availability. Environmental issues include inadequate cabling, power outages, communication line failures, fires, and other disasters.

  • Hardware issues   Problems with any hardware used by the messaging system can reduce availability. Hardware issues include power supply failures, inadequate processors, memory failures, inadequate disk space, disk failures, network card failures, and incompatible hardware.

  • Communication and connectivity issues   Problems with the network can prevent users from connecting to the messaging system. Communication and connectivity issues include network cable failures, inadequate bandwidth, router or switch failure, Domain Name System (DNS) configuration errors, and authentication issues.

  • Software issues   Software failures and upgrades can reduce availability. Software failure issues include downtime caused by memory leaks, database corruption, viruses, and denial of service attacks. Software upgrade issues include downtime caused by application software upgrades and service pack installations.

  • Service issues   Services that you obtain from outside a business can exacerbate a failure and increase unavailability. Service issues include poorly trained staff, slow response time, and out-of-date contact information.

  • Process issues   The lack of proper processes can cause unnecessary downtime and increase the length of downtime caused by a hardware or software failure. Process issues include inadequate or nonexistent operational processes, inadequate or nonexistent recovery plans, inadequate or nonexistent recovery drills, and deploying changes without testing.

  • Application design issues   Poor application design can reduce the perceived availability of a messaging system.

  • Staffing issues   Insufficient, untrained, or unqualified staff can cause unnecessary downtime and lengthen the time to restore availability. Staffing issues include insufficient training materials, inadequate training budget, insufficient time for training, and inadequate communication skills.

Analyzing Barriers

After identifying high availability barriers, it is important that you estimate the impact of each barrier and consider which barriers are cost effective enough to overcome.

To determine an appropriate high availability solution, you must analyze how each barrier (and the corresponding risks) affects availability. Specifically, consider the following for each barrier:

  • The estimated time the system will be unavailable if a failure occurs

  • The probability that the barrier will occur and cause downtime

  • The estimated cost to overcome the barrier compared to the estimated cost of downtime

To illustrate how you can analyze a barrier's effect on availability, consider a hardware-related risk—specifically, the risk associated with the failure of a hard disk that contains the database files and transaction log files for 25 percent of your users. In this example, you should:

  1. Estimate the amount of time that messaging services will be unavailable to your users. The following are examples of two storage strategies that have different recovery time estimates:

    Note

    The amount of time it takes to recover the failed disk depends on the experience and training of the IT personnel who are working to solve the issue.

    • If the failed hard disk is part of a fault tolerant redundant array of independent disks (RAID) disk array, you do not need to restore the system from backup. For example, if the RAID array is made up of hot-swappable disks, you can replace the failed disk without shutting down the system. However, if the RAID array does not include hot-swappable disks, the amount of downtime equals the time it takes to shut down the necessary servers and then replace the failed disk. To minimize impact, you could perform these steps during non-business hours.

    • If the failed hard disk is not part of a RAID disk array, and if it has been backed up to tape or disk, you can replace the hardware, and then restore the Exchange database (or databases) to the primary server from backup. The amount of downtime equals the time it takes to replace the hardware restore from backup, and resubmit the transactions that occurred after the deletion (if these transactions are available). The amount of time depends on your backup media hardware and your Exchange 2003 server hardware.

  2. Estimate the probability that this barrier will occur. In this example, the probability is affected by the reliability and age of the hardware.

  3. Estimate the cost to overcome this barrier. The cost to prevent downtime depends on the solution you select. In addition, the cost to overcome this barrier may include additional IT personnel. To overcome this barrier, consider the following options:

    • If you decide that you want to implement RAID (either software RAID or hardware RAID), the cost to overcome the barrier is measured by the cost of the new hardware, as well as the expense of training and maintenance. Depending on the hardware class you select, these costs will vary extensively. The costs also depend on whether you decide to use a third-party vendor to manage the system, or if you will train your own personnel. This solution significantly minimizes downtime, but costs more to implement.

    • If you decide to replace the hardware and restore databases from backup, the cost to overcome the barrier is measured by the time it takes to restore the data from backup, plus the time it takes to resubmit the transactions that occurred after the disk failure. This solution results in more downtime, but costs less to implement. For information about calculating the cost of downtime, see "Costs of Downtime" in Understanding Downtime.

    Note

    When evaluating the cost to overcome a barrier, remember that a solution for one barrier may also remove additional barriers. For example, keeping a redundant copy of your messaging databases on a secondary server can overcome many barriers.

Determining and Evaluating High Availability Solutions

The high availability solutions discussed in this guide include recommendations regarding redundant components, redundant servers, and database backups and restorations. Each of these recommendations is integral to achieving a highly available messaging system.

The remaining sections discuss issues related to these solutions. After reading this guide, to help you deploy and maintain a highly available messaging system, see the documentation available at the Exchange Server TechCenter.