Component-Level Fault Tolerant Measures


Topic Last Modified: 2005-05-20

This section provides component-level considerations and strategies for increasing the fault tolerance of your Exchange 2003 organization. Specifically, component-level refers to the individual server hardware, storage hardware, and networking hardware in your organization's infrastructure. An effective hardware strategy can improve the overall availability of a system. These strategies can range from adopting common sense practices to using expensive fault tolerant equipment.

The hardware in your Exchange 2003 organization includes server hardware and network hardware. When adopting a hardware strategy, consider the following:

  • Make sure that your hardware is redundant.

  • Make sure that you implement server-class hardware.

  • Make sure that you select standardized hardware.

  • Make sure that you have spare hardware available.

The following sections discuss each of these considerations in detail. Overall, when selected and deployed correctly, your hardware can help meet the requirements of your SLAs.

For more information about fault tolerant hardware strategies and highly available system designs, see the Microsoft Solutions Framework Web site.

Hardware redundancy refers to using one or more hardware components to perform identical tasks. To minimize single points of failure in your Exchange 2003 organization, it is important that you use redundant server, network, and storage hardware. By incorporating duplicate hardware configurations, one path of data I/O or a server's physical hardware components can fail without affecting the operations of a server.

The hardware you use to minimize single points of failure depends on which components you want to make redundant. Many hardware vendors offer products that build redundancy into their server or storage solution hardware. Some of these vendors also offer complete storage solutions, including advanced backup and restore hardware designed for use with Exchange 2003.

Server-class hardware is hardware that provides a higher degree of reliability than hardware designed for workstations. When selecting hardware for your Exchange 2003 servers, storage subsystems, and network, make sure that you select server-class components.

Traditionally, servers that include server-class hardware also include special hardware or software monitoring features. However, if the hardware you purchase does not include monitoring features, make sure that you consider a monitoring solution as part of your design and deployment plan. For more information about how monitoring is important to maintaining a fault tolerant organization, see "Implementing a Monitoring Strategy" in System-Level Fault Tolerant Measures.

Server-class server hardware includes the following:

  • Redundant power supplies   If the primary power supply fails, redundant server and disk array uninterruptible power supply (UPS) units and battery backups provide a secondary power supply. Essentially, a UPS and battery backup provide protection against power surges and short power losses that can damage your servers and the data they contain.

  • Redundant fans   If a cooling fan stops functioning, redundant fans ensure that there is sufficient cooling inside the server. Servers without redundant fans may automatically shut down if a fan fails.

    If a server room exceeds a specific temperature, redundant fans may not be enough to keep the hardware operating correctly. For information about temperature and other safeguard considerations, see "Safeguarding the Physical Environment of Your Servers" in System-Level Fault Tolerant Measures.
  • Redundant memory   If a memory bank fails, redundant memory ensures that memory remains available. For example, copying the physical memory (known as memory mirroring) provides fault tolerance through memory replication. Memory-mirroring techniques include having two sets of RAM in one computer, each a mirror of the other, or mirroring the entire System State, which includes RAM, CPU, adapter, and bus states. Memory mirroring must be developed and implemented in conjunction with the original equipment manufacturer (OEM).

  • ECC memory   If a double-bit error occurs, Error Correction Code (ECC) memory detects and corrects single-bit errors and takes the memory offline.

  • Redundant network interface cards   If a network interface card (NIC) or a network connection fails, redundant NICs ensure that your servers will maintain network connectivity.

  • Power-on monitoring components   When the server is initially turned on, the server detects startup failure conditions, such as abnormal temperature conditions or a failed fan.

  • Prefailure monitoring components   While the server is running, prefailure conditions are monitored. If a component, such as a power supply, hard disk, fan, or memory, is beginning to fail, an administrator is notified before the failure actually occurs.

    For example, a failure detected by ECC memory is corrected by the ECC memory or routed to the redundant memory, preventing a server failure. An administrator is immediately notified to rectify the memory problem.

  • Power failure hardware monitoring components   When a power failure occurs, system shutdown software ensures a shutdown if necessary in conjunction with a UPS.

  • A redundant storage subsystem provides protection against the failure of a single disk drive or controller. You should consider implementing the following redundant components:

    • Redundant hardware on your back-end servers for connecting to the external array

    • Redundant paths to the disk array

    • Redundant storage controllers

  • In addition, use RAID to implement redundancy of the logical unit numbers (LUNs). For more information about implementing fault tolerance for your back-end storage solution, see "Implementing a Reliable Back-End Storage Solution" in System-Level Fault Tolerant Measures.

Server-class network hardware includes the following:

  • Redundant hubs, switches, network adapters, and wiring   For information about how to implement this redundant hardware in your network, consult the vendors who provide these components.

  • Redundant routers   Routers do not fail frequently. However, if they do, entire server organizations can shut down. Therefore, having redundant routing capability is critical. For information about how to protect against router failure, consult your router vendor.

For the servers on which you must maintain the highest degree of availability, use fixed Internet Protocol (IP) addresses and do not use Dynamic Host Configuration Protocol (DHCP). This prevents an outage due to the failure of the DHCP server. This can improve address resolution by DNS servers that do not handle the dynamic address assignment provided by DHCP.

To make sure that your hardware is fully compatibility with Windows operating systems, select hardware from the Windows Server Catalog.

When selecting your hardware from the Windows Server Catalog, adopt one standard for hardware and standardize it as much as possible. Specifically, select one type of computer and, for each computer you purchase, use the same components (for example, the same network cards, disk controllers, and graphics cards). The only parameters you should modify are the amount of memory, number of CPUs, and the hard disk configurations.

Standardizing hardware has the following advantages:

  • When testing driver updates or application-software updates, only one test is needed before deploying to all your computers.

  • Fewer spare parts are required to maintain an adequate set of replacement hardware.

  • Support personnel require less training because it is easier for them to become familiar with a limited set of hardware components.

When planning your hardware budget, consider including spare hardware components, spare servers, and even hot standby servers. (Hot refers to servers that are powered on and ready to replace a specific type of server in your organization.) Having these spare hardware components and servers accessible can significantly increase your ability to replace damaged hardware and recover from hardware failures.

Be sure to include spare components in your hardware budget, and keep these components on-site and readily available. One advantage to using standardized hardware is the reduced number of spare components that must be kept on-site. For example, if all your hard drives are the same type and from the same manufacturer, you do not need to stock as many spare drives.

The number of spare components you should have available correlates to the maximum downtime your organization can tolerate. Another concern is the market availability of replacement components. Some components, such as memory and CPUs, are fairly easy to locate and purchase at any time. Other components, such as hard drives, are often discontinued and may be difficult to locate after a short time. For these components, you should plan to buy spares when you buy the original hardware. Also, when considering solutions from hardware vendors, you should use service companies or vendors who promptly replace damaged components or complete servers.

Consider the possibility of maintaining a standby server, possibly even a hot standby server to which data is replicated automatically. If the costs of downtime are high and clustering is not a viable option, you can use standby servers to decrease recovery times. Using standby servers can also be important if server failure results in high costs, such as lost profits from server downtime or penalties from an SLA violation.

A standby server can quickly replace a failed server or, in some cases, act as a source of spare parts. Also, if a server experiences a catastrophic failure that does not involve the hard drives, it may be possible to move the drives from the failed server to a functional server (possibly in combination with restoring data from backup media).

In a clustered environment, this data transfer is done automatically.

One advantage to using standby servers to recover from an outage is that the failed server is available for careful diagnosis. Diagnosing the cause of a failure is important in preventing repeated failures.

Standby servers should be certified and, similar to production servers, should be running 24-hours-a-day, 7-days-a-week.