Planning Fault Tolerance and Avoidance

By Charlie Russel and Sharon Crawford

Chapter 7 from Microsoft Windows 2000 Server Administrator's Companion, published by Microsoft Press

Microsoft Windows 2000 Server, and especially Advanced Server with its clustering support, provides an excellent environment in which to build a truly fault-tolerant system. Of course, avoiding the faults in the first place is even better than handling them once they've happened, but the realistic system administrator knows that a problem will occur sooner or later, and he or she plans for it. Chapter 33 covered disaster planning in depth, so you should refer to that chapter for information on how to prepare for major problems and how to build a full disaster recovery plan to quickly resolve them.

This chapter focuses primarily on the hardware and software tools that will allow you to build a highly available and fault-tolerant Windows 2000 environment. Remember, however, that no matter what hardware and software you deploy, building and deploying for high availability and fault tolerance requires time and discipline. You'll need to make informed decisions about your real requirements as well as determine the resources available to meet those requirements. When planning for a highly available and fault-tolerant deployment, you should consider all points of failure and work to eliminate any single point of failure. Redundant power supplies, dual disk controllers, multiple network interface cards (multihoming), and fault-tolerant disk arrays (RAID) are all strategies that you can and should employ.

On This Page

Mean Time to Failure and Mean Time to Recover
Protecting the Power Supply
Disk Arrays
Distributed File System
Clustering
Summary

Mean Time to Failure and Mean Time to Recover

Two important metrics are most commonly used to measure fault tolerance and avoidance. These are mean time to failure (MTTF), the mean time until the device will fail, and mean time to recover (MTTR), the mean time it takes to recover once a failure has occurred. Keep in mind that even if you have a finite failure rate, if your MTTR is zero or near zero, this may be indistinguishable from a system that hasn't failed. Downtime is generally measured as MTTR/MTTF, but since it can be prohibitively expensive to increase MTTF beyond a certain point, you should spend both time and resources on managing and reducing the MTTR for your most likely and costly points of failure.

Most modern electronic components have a distinctive "bathtub" curve that represents their failure characteristics, as shown in Figure 35-1. During the early life of the component (referred to as the burn-in phase), it's more likely to fail; once this initial phase is over, a component's overall failure rate remains quite low until it reaches the end of its useful life, when the failure rate increases again.

Bb742464.planni01(en-us,TechNet.10).gif

Figure 35-1: The normal statistical failure rates for mechanical and electronic components: a characteristic "bathtub" curve.

The typical commodity hard disk of 10 years ago had an MTTF on the order of three years. Today, a typical MTTF for a commodity hard disk is more likely to be 35 to 50 years! But at least part of that difference is a direct result of counting only the portion of the curve in the normal aging section, taking externally caused failure out of the equation. So a hard disk that fails because of a power spike that wasn't properly filtered doesn't count against the MTTF of the disk. This may be nice for the disk manufacturer's statistics, but it doesn't do much for the system administrator whose system has crashed because of a disk failure. Consequently, you can understand the importance of looking at the total picture and carefully evaluating all the factors and failure points on your system. Only by looking at the whole system, including the recovery procedures and methodology, can you build a truly fault-tolerant system.

Protecting the Power Supply

The single biggest failure point for any network is its power supply. If you don't have power, you can't run your computers. Seems pretty obvious, and most of us slap an uninterruptible power supply (UPS) on the order when we're buying a new server or at least make sure that the current UPS can handle the extra load. However, this barely scratches the surface of what you can do to protect your network from power problems. You need to protect your network from four basic types of power problems:

  • Local power supply failure Failure of the internal power supply on a server, router, or other network component

  • Voltage variations Spikes, surges, sags, and longer-term brownouts

  • Short-term power outages External power failures lasting from fractions of a second to several minutes

  • Long-term power outages External power failures lasting from several minutes to several hours or even days

Each type of power problem poses somewhat different risks to your network and will likely require somewhat different protection mechanisms. The possible threat that each one poses to your environment varies depending on the area in which you live, the quality of power available to you, and the potential loss to your business if your computers are down.

Local Power Supply Failure

The weakest links in all networks are the mechanical moving parts, and the mechanical moving parts most likely to fail are the internal power supplies on the servers and network components. All the power conditioning, uninterruptible power supplies, and external generators in the world won't help much if your server's power supply fails. Most higher end servers these days either have a redundant power supply or have the option of including one. Take the option! The extra cost associated with adding a redundant power supply to a server or critical piece of network hardware is usually far less than the cost of downtime should the power supply fail.

If your server or another piece of network hardware doesn't have the option of a redundant power supply, order a spare power supply for it when you order the original hardware. Keep all of your spares in a central, well-known, and identified location, and clearly label the machine or machines the power supply is for.

Finally, practice replacing the power supplies of your critical hardware. Include clear, well-illustrated, detailed instructions on how to replace the power supplies of your critical hardware as part of your disaster recovery standard operating procedures. If you can change the power supply in a very short time, the cost of having it fail diminishes significantly. If you have to wait for your original equipment supplier to get a replacement to you, even if you're on a four-hour response service contract, the cost can be a lot higher than the cost of keeping a spare around.

Voltage Variations

Even in areas with exceptionally clean power that is always available, the power that is supplied to your network will inevitably fluctuate. Minor, short-term variations merely stress your electronic components, but major variations can literally fry them. You should never, ever simply plug your computer into an ordinary wall socket without providing some sort of protection against voltage variations. The following sections describe the types of variations and the best way to protect yourself from them.

Spikes

Spikes are large but short-lived increases in voltage. They can occur because of external factors, such as lightning striking a power line, or because of internal factors, such as a large motor starting. The most common causes of severe voltage spikes, however, are external and outside your control. And the effects can be devastating. A nearby lightning strike can easily cause a spike of 1000 volts or more to be sent into equipment designed to run on 110 to 120 volts. Few, if any, electronic components are designed to withstand large voltage spikes of several thousand volts, and almost all will suffer damage if they're not protected from them.

Protection from spikes comes in many forms, from the $19.95 power strip with built-in surge protection that you can buy at your local hardware store to complicated arrays of transformers and specialized sacrificial transistors that are designed to die so that others may live. Unfortunately, those $19.95 power strips just aren't good enough. They are better than nothing, but barely. They have a limited ability to withstand really large spikes.

More specialized (and more expensive, of course) surge protectors that are specifically designed to protect computer networks are available from various companies. They differ in their ability to protect against really large spikes and in their cost. There's a fairly direct correlation between the cost of these products and their rated capacity and speed of action within any company's range of products, but the cost for a given level of protection can differ significantly from company to company. As always, if the price sounds too good to be true, it is.

In general, these surge protectors are designed to work by sensing a large increase in voltage and creating an electrical path for that excessive voltage that doesn't allow it to get through to your server. In the most severe spikes, the surge protectors should destroy themselves before allowing the voltage to get through to your server. The effectiveness of these stand-alone surge protectors depends on the speed of response to a large voltage increase and the mechanism of failure when their capacity is exceeded.

Many of the newer UPSs also provide protection from spikes. They have built-in surge protectors, plus isolation circuitry that tends to buffer the effects of spikes. The effectiveness of the spike protection in a UPS is not directly related to its cost, however—the overall cost of the UPS is more a factor of its effectiveness as an alternative power source. Your responsibility is to read the fine print and understand the limitations of the surge protection a given UPS offers. Also remember that just as with simple surge protectors, large voltage spikes can cause the surge protection to self-destruct rather than allow the voltage through to your server. That's the good news; the bad news is that instead of having to replace just a surge protector, you're likely to have to repair or replace the UPS.

Finally, one other spike protection mechanism can be helpful—the constant voltage transformer. You're not likely to see one unless you're in a large industrial setting, but they are often considered to be a sufficient replacement for other forms of surge protection. Unfortunately, they're not really optimal for surge protection. They will filter some excess voltage, but a large spike is likely to find its way through. However, in combination with either a fully protected UPS or a good stand-alone surge protector, a constant voltage transformer can be quite effective. And they provide additional protection against other forms of voltage variation that surge protectors alone can't begin to manage.

Surges

Voltage surges and voltage spikes are often discussed interchangeably, but we'd like to make a distinction here. For our purposes, a surge lasts longer than most spikes and isn't nearly as large. Most surges last a few hundred milliseconds and are rarely over 1000 volts. They can be caused by many of the same factors that cause voltage spikes.

Providing protection against surges is somewhat easier than protecting against large spikes. Most of the protection mechanisms just discussed will also adequately handle surges. In addition, most constant voltage transformers are sufficient to handle surges and may even handle them better if the surge is so prolonged that it might threaten to overheat and burn out a simple surge protector.

Sags

Voltage sags are short-term reductions in the voltage delivered. They aren't complete voltage failures or power outages and are shorter than a full-scale brownout. Voltage sags can drop the voltage well below 100 volts on a 110- to 120-volt normal line and will cause most servers to reboot if protection isn't provided.

Stand-alone surge protectors provide no defense against sags. You need a UPS or a very good constant voltage transformer to prevent damage from a voltage sag. Severe sags can overcome the rating of all but the best constant voltage transformers, so you generally shouldn't use constant voltage transformers as the sole protection against sags. A UPS, with its battery power supply, is an essential part of your protection from problems caused by voltage sag.

Brownouts

A brownout is a planned, deliberate reduction in voltage from your electric utility company. Brownouts most often occur in the heat of the summer and are designed to protect the utility company from overloading. They are not designed to protect the consumer, however.

In general, a brownout will reduce the available voltage by 5 to 20 percent from the normal value. A constant voltage transformer or a UPS provides excellent protection against brownouts, within limits. Prolonged brownouts may exceed your UPS's ability to maintain a charge at the same time that it is providing power at the correct voltage to your equipment. Monitor the health of your UPS carefully during a brownout, especially because the risk of a complete power outage will increase if the power company's voltage reduction strategy proves insufficient.

The best protection against extended brownouts is a constant voltage transformer of sufficient rating to fully support your critical network devices and servers. This transformer will take the reduced voltage provided by your power company and increase it to the rated output voltage. A good constant voltage transformer can handle most brownouts for an extended time without problems, but you should still supplement the constant voltage transformer with a quality UPS and surge protection between the transformer and the server or network device. This extra protection is especially important while the power company is attempting to restore power to full voltage because during this period you run a higher risk of experiencing power and voltage fluctuations.

Short-Term Power Outages

Short-term power outages are those that last from a few milliseconds to a few minutes. They can be caused by either internal or external events, but you can rarely plan for them even if they are internal. A server that is unprotected from a short-term power outage will, at the minimum, reboot or, at the worst, fail catastrophically.

You can best protect against a short-term power outage by using a UPS in combination with high-quality spike protection. Be aware that many momentary interruptions of power are accompanied by large spikes when the power is restored. Further, a series of short-term power outages often occur consecutively, causing additional stress to electronic components.

Long-Term Power Outages

Long-term power outages, lasting from an hour or so to several days, are usually accompanied by other, even more serious problems. Long-term power outages can be caused by storms, earthquakes, fires, and the incompetence of electric power utilities, among other causes. As such, long-term power outages should be part of an overall disaster recovery plan. (See Chapter 33 for more on disaster planning.)

Protection against long-term power outages really becomes a decision about how long you will want or need to function if all power is out. If you need to function long enough to be able to gracefully shut down your network, a simple UPS or a collection of them will be sufficient, assuming that you've sized the UPS correctly. However, if you need to be sure that you can maintain the full functionality of your Windows 2000 network during an extended power outage, you're going to need a combination of one or more UPSs and an auxiliary generator.

If your situation requires an auxiliary generator to supplement your UPSs, you should carefully plan your power strategy to ensure that you provide power to all of the equipment that the network will require in the event of a long-term power outage. You should regularly test the effectiveness of your disaster recovery plans and make sure that all key personnel know how to start the auxiliary generator manually in the event it doesn't start automatically. Finally, you should have a regular preventive maintenance program in place that tests the generator and ensures that it is ready and functioning when you need it.

Disk Arrays

The most common hardware malfunction is probably a hard disk failure. Even though hard disks have become more reliable over time, they are still subject to failure, especially during their first month or so of use. They are also subject to both catastrophic and degenerative failures caused by power problems. Fortunately, disk arrays have become the norm for most servers, and good fault-tolerant RAID systems are available in Windows 2000 Server and RAID-specific hardware supported by Windows 2000.

The choice of software or hardware RAID, and the particulars of how you configure your RAID system, can significantly affect the cost of your servers. To make an informed choice for your environment and needs, you must understand the trade-offs and the differences in fault tolerance, speed, configurability, and so on.

Hardware vs. Software

RAID can be implemented at the hardware level, using RAID controllers, or at the software level, either by the operating system or by a third-party add-on. Windows 2000 supports both hardware RAID and its own software RAID.

Hardware RAID implementations require specialized controllers and cost much more than an equal level of software RAID. But for that extra price, you get faster, more flexible, and more fault-tolerant RAID. When compared to the software RAID provided in Windows 2000 Server, a good hardware RAID controller supports more levels of RAID, on-the-fly reconfiguration of the arrays, hot-swap and hot-spare drives (discussed later in this chapter), and dedicated caching of both reads and writes.

The Windows 2000 Server software RAID requires that you convert your disks to dynamic disks. The disks will no longer be available to other operating systems, although this really shouldn't be a problem in a production environment. However, you should consider carefully whether you want to convert your boot disk to a dynamic disk. Dynamic disks can be more difficult to access if a problem occurs, and the Windows 2000 setup and installation program provides only limited support. For maximum fault tolerance, we recommend using hardware mirroring on your boot drive; if you do use software mirroring, make sure that you create the required fault-tolerant boot floppy disk and test it thoroughly before you need it. (See Chapter 33.)

RAID Levels for Fault Tolerance

Except for level 0, RAID is a mechanism for storing sufficient information on a group of hard disks such that even if one hard disk in the group fails, no information is lost. Some RAID arrangements go even further, providing protection in the event of multiple hard disk failures. The more common levels of RAID and their appropriateness in a fault-tolerant environment are shown in Table 35-1.

Table 35-1 RAID levels and their fault tolerance

Level

Number of Disks*

Speed

Fault Tolerance

Description

0

N

+++

---

Striping alone. Not fault-tolerant, but provides for the fastest read and write performance.

1

2N

+

++

Mirror or duplex. Slightly faster read than single disk, but no gain during write operations. Failure of any single disk causes no loss in data and minimal performance hit.

3

N+1

++

+

Byte-level parity. Data is striped across multiple drives at the byte level with the parity information written to a single dedicated drive. Reads are much faster than with a single disk, but writes operate slightly slower than a single disk since parity information must be generated and written to a single disk. Failure of any single disk causes no loss of data but can cause a significant loss of performance.

4

N+1

++

+

Block-level parity with a dedicated parity disk. Similar to RAID-3 except that data is striped at the block level.

5

N+1

+

++

Interleaved block-level parity. Parity information is distributed across all drives. Reads are much faster than a single disk but writes are significantly slower. Failure of any single disk provides no loss of data but will result in a major reduction in performance.

0+1
(also known as level 10)

2N

+++

++

Striped mirrored disks. Data is striped across multiple mirrored disks. Failure of any one disk causes no data loss and no speed loss. Failure of a second disk could result in data loss. Faster than a single disk for both reads and writes.

Other

Varies

+++

+++

Array of RAID arrays. Different hardware vendors have different proprietary names for this RAID concept. Excellent read and write performance. Failure of any one drive results in no loss of performance and continued redundancy.

* In the Number of Disks column, N refers to the number of hard disks required to hold the original copy of the data. The plus and minus symbols show relative improvement or deterioration compared to a system using no version of RAID. The scale peaks at three symbols.

When choosing the RAID level to use for a given application or server, consider the following factors:

  • Intended use Will this application be primarily read-intensive, such as file serving, or will it be predominately write-intensive, such as a transactional database?

  • Fault tolerance How critical is this data, and how much can you afford to lose?

  • Availability Does this server or application need to be available at all times, or can you afford to be able to reboot it or otherwise take it offline for brief periods?

  • Performance Is this application or server heavily used, with large amounts of data being transferred to and from it, or is this server or application less I/O intensive?

  • Cost Are you on a tight budget for this server or application, or is the cost of data loss or unavailability the primary driving factor?

You need to evaluate each of these factors when you decide which type of RAID to use for a server or portion of a server. No one answer fits all cases, but the final answer will require you to carefully weigh each of these factors and balance them against your situation and your needs. The following sections take a closer look at each factor and how it weighs in the overall decision-making process.

Intended Use

The intended use, and the kind of disk access associated with that use, plays an important role in determining the best RAID level for your application. Think about how write-intensive the application is and whether the manner in which the application uses the data is more sequential or random. Is your application a three-square-meals-a-day kind of application, with relatively large chunks of data being read or written at a time, or is it more of a grazer or nibbler, reading and writing little bits of data from all sorts of different places?

If your application is relatively write-intensive, you'll want to avoid software RAID if possible and avoid RAID-5 if other considerations don't force you to it. With RAID-5, any application that requires greater than 50 percent writes to reads is likely to be at least somewhat slower if not much slower than it would be on a single disk. You can mitigate this to some extent by using more but smaller drives in your array and by using a hardware controller with a large cache to off-load the parity processing as much as possible. RAID-1, in either a mirror or duplex configuration, provides a high degree of fault tolerance with no significant penalty during write operations—a good choice for the Windows 2000 system disk.

If your application is primarily read-intensive, and the data is stored and referenced sequentially, RAID-3 or RAID-4 may be a good choice. Because the data is striped across many drives, you have parallel access to it, improving your throughput. And since the parity information is stored on a single drive, rather than dispersed across the array, sequential read operations don't have to skip over the parity information and are therefore faster. However, write operations will be substantially slower, and the single parity drive can become an I/O bottleneck.

If your application is primarily read-intensive and not necessarily sequential, RAID-5 is an obvious choice. It provides a good balance of speed and fault tolerance, and the cost is substantially less than RAID-1. Disk accesses are evenly distributed across multiple drives, and no one drive has the potential to be an I/O bottleneck. However, writes will require calculation of the parity information and the extra write of that parity, slowing write operations down significantly.

If your application provides other mechanisms for data recovery or uses large amounts of temporary storage that doesn't require fault tolerance, a simple RAID-0, with no fault tolerance but fast reads and writes, is a possibility.

Fault Tolerance

Carefully examine the fault tolerance of each of the possible RAID choices for your intended use. All RAID levels except RAID-0 provide some degree of fault tolerance, but the effect of a failure and the ability to recover from subsequent failures can be different.

If a drive in a RAID-1 mirror or duplex array fails, a full, complete, exact copy of the data remains. Access to your data or application is unimpeded, and performance degradation is minimal, although you will lose the benefit gained on read operations of being able to read from either disk. Until the failed disk is replaced, however, you will have no fault tolerance on the remaining disk.

In a RAID-3 or RAID-4 array, if one of the data disks fails, a significant performance degradation will occur since the missing data needs to be reconstructed from the parity information. Also, you'll have no fault tolerance until the failed disk is replaced. If it is the parity disk that fails, you'll have no fault tolerance until it is replaced, but also no performance degradation.

In a RAID-5 array, the loss of any disk will result in a significant performance degradation, and your fault tolerance will be gone until you replace the failed disk. Once you replace the disk, you won't return to fault tolerance until the entire array has a chance to rebuild itself, and performance will be seriously degraded during the rebuild process.

RAID systems that are arrays of arrays can provide for multiple failure tolerance. These arrays provide for multiple levels of redundancy and are appropriate for mission-critical applications that must be able to withstand the failure of more than one drive in an array.

Real World Multiple Disk Controllers Provide Increased Fault Tolerance

Spending the money for a hardware RAID system will increase your overall fault tolerance, but it can still leave a single point of failure in your disk subsystem: the disk controller itself. While failures of the disk controller are certainly less common, they do happen. Many hardware RAID systems are based on a single multiple-channel controller—certainly a better choice than those based on a single-channel controller, but an even better solution is a RAID system based on multiple identical controllers. In these systems, the failure of a single disk controller is not catastrophic but simply an annoyance. In RAID-1 this technique is known as duplexing, but it is also common with many of the proprietary arrays of arrays that are available from server vendors and in the third-party market.

Availability

All levels of RAID, except RAID-0, provide higher availability than a single drive. However, if availability is expanded to also include the overall performance level during failure mode, some RAID levels provide definite advantages over others. Specifically, RAID-1, mirroring/duplexing, provides enhanced availability when compared to RAID levels 3, 4, and 5 during failure mode. There is no performance degradation when compared to a single disk if one half of a mirror fails, while a RAID-5 array will have substantially compromised performance until the failed disk is replaced and the array is rebuilt.

In addition, RAID systems that are based on an array of arrays can provide higher availability than RAID levels 1 through 5. Running on multiple controllers, these arrays are able to tolerate the failure of more than one disk and the failure of one of the controllers, providing protection against the single point of failure inherent in any single-controller arrangement. RAID-1 that uses duplexed disks running on different controllers—as opposed to RAID-1 that uses mirroring on the same controller—also provides this additional protection and improved availability.

Hot-swap drives and hot-spare drives (discussed later in this chapter) can further improve availability in critical environments, especially hot-spare drives. By providing for automatic failover and rebuilding, they can reduce your exposure to catastrophic failure and provide for maximum availability.

Performance

The relative performance of each RAID level depends on the intended use. The best compromise for many situations is arguably RAID-5, but you should be suspicious of that compromise if your application is fairly write-intensive. Especially for relational database data and index files where the database is moderately or highly write-intensive, the performance hit of using RAID-5 can be substantial. A better alternative is to use RAID-0+1 (also known as RAID-10 from some vendors).

Whatever level of RAID you choose for your particular application, it will benefit from using more small disks rather than a few large disks. The more drives contributing to the stripe of the array, the greater the benefit of parallel reading and writing you'll be able to realize—and your array's overall speed will improve.

Cost

The delta in cost between RAID configurations is primarily the cost of drives, potentially including the cost of additional array enclosures because more drives are required for a particular level of RAID. RAID-1, either duplexing or mirroring, is the most expensive of the conventional RAID levels, since it requires at least 33 percent more raw disk space for a given amount of net storage space than other RAID levels.

Another consideration is that RAID levels that include mirroring/duplexing must use drives in pairs. Therefore, it's more difficult (and more expensive) to add on to an array if you need additional space on the array. A net 18-GB RAID-0+1 array, comprising four 9-GB drives, requires four more 9-GB drives to double in size, a somewhat daunting prospect if your array cabinet has bays for only six drives, for example. A net 18-GB RAID-5 array, however, can be doubled in size simply by adding two more 9-GB drives, for a total of five drives.

Hot-Swap and Hot-Spare Disk Systems

Hardware RAID systems can provide for both hot-swap and hot-spare capabilities. A hot-swap disk system allows failed hard disks to be removed and a replacement disk inserted into the array without powering down the system or rebooting the server. When the new drive is inserted, it is automatically recognized and either will be automatically configured into the array or can be manually configured into it. Additionally, many hot-swap RAID systems allow you to add hard disks into empty slots dynamically, automatically or manually increasing the size of the RAID volume on the fly without a reboot.

A hot-spare RAID configuration uses an additional, preconfigured disk or disks to automatically replace a failed disk. These systems usually don't support hot- swapped hard disks so that the failed disk can't be removed until the system can be powered down, but full fault tolerance is maintained by having the hot spare available.

Distributed File System

The Distributed file system (Dfs) is primarily a method of simplifying the view that users have of the available storage on a network—but it is also, when configured appropriately, a highly fault-tolerant storage mechanism. By configuring your Dfs root on a Windows 2000 domain controller, you can create a fault-tolerant, replicated, distributed file system that will give you great flexibility while presenting your user community with a cohesive and easy-to-navigate network file system.

When you create a fault-tolerant Dfs root on a domain controller and replicate it and the links below it across multiple servers, you create a highly fault-tolerant file system that has the added benefit of distributing the load evenly across the replicated shares, giving you a substantial scalability improvement as well. See Chapter 16 for more on setting up your Dfs and ensuring that replication works correctly.

Clustering

Windows 2000 Advanced Server supports two different kinds of clustering, either of which can greatly improve your fault tolerance:

  • For many TCP/IP-based applications, the Network Load Balancing service provides a simple, "shared nothing," fault-tolerant application server.

  • Server clusters provide a highly available fault-tolerant environment that can run applications, provide network services, and distribute loads.

Network Load Balancing

The Network Load Balancing service (called Windows Load Balancing Service in Microsoft Windows NT 4) allows TCP/IP-based applications to be spread dynamically across up to 32 servers. If a particular server fails, the load and connections to that server are dynamically balanced to the remaining servers, providing a highly fault-tolerant environment without the need for specialized, shared hardware. Individual servers within the cluster can have different hardware and capabilities, and the overall job of load balancing and failover happens automatically, with each server in the cluster running its own Windows 2000 copy of Wlbs.exe, the Network Load Balancing service.

Server Clusters

Server clusters, unlike network load balancing, depend on a shared resource between nodes of the cluster. This resource, which in the initial shipment of Windows 2000 Advanced Server must be a shared disk resource, is generally a shared SCSI or Fibre Channel–attached disk array. Each server in the cluster is connected to the shared resource, and the common database that manages the clustering is stored on this shared disk resource.

Nodes in the cluster generally have identical hardware and identical capabilities, although it is technically possible to create a server cluster with dissimilar nodes. In the initial release of Windows 2000 Advanced Server, only two node clusters are supported for server clusters, although this restriction and the restriction on the type of shared resource are likely to change with later releases.

Server clusters provide a highly fault-tolerant and configurable environment for mission-critical services and applications. Applications don't need to be specially written to be able to take advantage of the fault tolerance of a server cluster, although if the application is written to be clustering aware, it can take advantage of additional controls and features in a failover and fallback scenario.

Summary

Building a highly available and fault-tolerant system requires you to carefully evaluate both your requirements and your resources to eliminate single points of failure within the system. You should evaluate each of the hardware subsystems within the overall system for fault tolerance, and ensure that recovery procedures are clearly understood and practiced, to reduce recovery time in the event of a failure. Uninterruptible power supplies, RAID systems, distributed file systems, and clustering are all methods for improving fault tolerance. In the next chapter, we discuss the registry: what it is, how it's structured, and how to back it up and restore it.

About The Authors

Charlie Russel and Sharon Crawford are coauthors of numerous books on operating systems. Their titles include Running Microsoft Windows NT Server 4, UNIX and Linux Answers, NT and UNIX Intranet Secrets, and Upgrading to Windows 98.

Charlie Russel has years of system administration experience with a specialty in combined Windows NT and UNIX networks. In addition to his books with Ms. Crawford, he has also written *ABCs of Windows NT Workstation 4.*0 and SCO OpenServer and Windows Networking.

Sharon Crawford is a former editor now engaged in writing full time. She is the author of Windows 98: No Experience Required, ABCs of Windows 98 and the coauthor of Windows 2000 Professional for Dummies (with Andy Rathbone). Ms. Crawford also writes a regular column on Windows 2000 and Windows 98 for the online bookseller Fatbrain.com (at https://www.fatbrain.com/hottechnologies.html).

Copyright © 2000, Charlie Russel and Sharon Crawford

We at Microsoft Corporation hope that the information in this work is valuable to you. Your use of the information contained in this work, however, is at your sole risk. All information in this work is provided "as -is", without any warranty, whether express or implied, of its accuracy, completeness, fitness for a particular purpose, title or non-infringement, and none of the third-party products or information mentioned in the work are authored, recommended, supported or guaranteed by Microsoft Corporation. Microsoft Corporation shall not be liable for any damages you may sustain by using this information, whether direct, indirect, special, incidental or consequential, even if it has been advised of the possibility of such damages. All prices for products mentioned in this document are subject to change without notice.

International rights = English only.

Link
Click to order