Share via


Chapter 2 - Planning Your Cluster Environment

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

This chapter covers the options you must consider as you plan how to set up MSCS to work most efficiently in your organization.

The following planning topics are discussed:

  • Risk auditing 

  • Cluster models 

  • Capacity planning 

  • Domain models 

  • Disk fault tolerance 

  • Optimization 

For general information on risk auditing and Windows NT Server, see the Windows NT Server Resource Kit Networking Guide (Microsoft Press®).

Auditing Your Risk

When you audit network risk, you identify the possible failures that can interrupt access to network resources. A single point of failure is any component in your environment that would block data or applications if it failed. Single points of failure can be hardware, software, or external dependencies, such as power supplied by a utility company and dedicated wide area network (WAN) lines.

In general, you provide maximum reliability when you:

  • Minimize the number of single points of failure in your environment. 

  • Provide mechanisms that maintain service when a failure occurs. 

Addressing Risks with MSCS and Windows NT Server

With MSCS, you can use clustering technology and new administrative procedures to provide increased reliability. However, MSCS is not designed to protect all components of your workflow in all circumstances. For example, MSCS is not an alternative to backing up data—MSCS protects only availability of data, not the data itself.

Windows NT Server has built-in features that protect certain computer and network processes during failure. These features include mirroring and redundant array of inexpensive disks (RAID Level 5, striping with parity). When planning your MSCS environment, look for places where these features can help you in ways that MSCS cannot.

Table 2.1 lists common points of failure in a Windows NT Server environment and describes whether the point of failure can be protected, either by MSCS or by other means.

Table 2.1 Protecting Single Points of Failure 

Failure Point

MSCS solution

Other solutions

Network hub

Redundant networks

Utility-company power

Uninterruptible power supply (UPS)

Server connection

Failover

Disk

Hardware RAID, to protect the data on the disk

Other server hardware, such as CPU or memory

Failover

Server software, such as the operating system or specific applications

Failover

WAN links, such as routers and dedicated lines

Redundant links over the WAN, to provide secondary access to remote connections

Dial-up connection

Multiple modems

Client computer within your organization

Configuring multiple clients for the same level of access
(If one client fails, you still have access through other clients.)

To further increase the availability of network resources and prevent the loss of data:

  • Consider having replacement disks and controllers available at your site. (Always make sure that any spare parts you keep on hand exactly match the original parts, including network and SCSI components.) For example, the cost of two spare SCSI controllers can be a small fraction of the cost of having hundreds of clients unable to use data. 

  • Consider providing UPS protection for individual computers and the network itself, including hubs, bridges, and routers. Computers running Windows NT Server support UPS. Many UPS solutions provide power for 5 to 20 minutes, which is long enough for the operating system to do an orderly shutdown when power fails. 

For more information about the built-in features of Windows NT Server, see the Windows NT Server Version 4.0 Concepts and Planning and the Windows NT Server Version 4.0 Networking Supplement. To access the books online on a computer running Windows NT Server, click Start, point to Programs, and click Books Online.

Determining Applications to use with MSCS

Many—but not all—applications can be adapted to work with MSCS. Of those that can, not all need to be set up as MSCS failover resources in every environment. This section offers guidelines for making these decisions.

Two criteria determine whether an application can adapt to MSCS failover mechanisms:

  • The application must use TCP/IP. 

    Client/server applications must use TCP/IP (or DCOM, Named Pipes, or RPC over TCP/IP) for their network communications in order to work with MSCS. Any application that uses only NetBEUI or IPX protocols cannot take advantage of the MSCS failover feature. 

  • The application must be able to use remote storage for its data (that is, you must be able to specify where the application data is stored). 

    Any application you use with MSCS must be able to store its data in a configurable location, that is, on the disks attached to shared SCSI buses. Some applications that cannot store their data in a configurable location can still be configured to fail over. However, in such cases access to the application data is lost at failover because the data is available only on the disk of the failed node. 

Applications that can be failed over can be further divided into two groups: those that support the MSCS API and those that do not. Applications that support the MSCS API are defined as MSCS-aware. These applications can register with the Cluster Service to receive status and notification information, and can use the MSCS API to administer MSCS clusters.

Applications that do not support the MSCS API are defined as MSCS-unaware. If MSCS-unaware applications meet the TCP/IP and remote-storage criteria, they can still be used with MSCS and often can be configured to fail over.

In either case, applications that keep significant state information in memory are not the best applications for clustering because information that is not stored on disk is lost at failover.

Determining Failover Policies for Groups

You assign the failover policies for each group of resources in your MSCS environment. These policies determine exactly how a group behaves when failover occurs. You can choose which policies are most appropriate for each resource group you set up.

Failover policies for groups include three settings:

  • Failover timing 

    You can set a group for immediate failover when a resource fails, or you can instruct MSCS to try to restart the group a number of times before failover occurs. If it is possible that the resource failure can be overcome by restarting all resources within the group, then set MSCS to restart the group. 

  • Failback timing 

    You can set a group to fail back to its preferred node as soon as MSCS detects that the failed node has been restored, or you can instruct MSCS to wait until a specified hour of the day, such as after peak business hours. 

  • Preferred node 

    You can set a group so that it always runs on a designated node whenever that node is available. This is useful if one of the nodes is better equipped to host the group. 

Choosing a Cluster Model

MSCS clusters can be categorized in five configuration models. This section describes each model and gives examples of the types of applications that are suited to each.

Model A: High-Availability Solution with Static Load Balancing

This model provides high availability, acceptable performance when only one node is online, and high performance when both nodes are online. The model also allows maximum utilization of your hardware resources.

In this model, each of the two nodes makes its own set of resources available to the network in the form of virtual servers, which can be detected and accessed by clients. The capacity for each node is chosen so that the resources on each node run at optimum performance, but so that either node can temporarily take on the burden of running the resources from the other when failover occurs. Depending on the resource and server capacity specifications, all client services remain available during failover, but performance can suffer.

Cc723241.xcla_b01(en-us,TechNet.10).gif 

Figure 2.1 Configuration for high availability with static load balancing 

For example, you can use this model for a cluster dedicated to file-sharing and print-spooling services. Two file and print shares are established as separate groups, one on each node. If one node fails, the other node temporarily takes on the file-sharing and print-spooling services for both shares. The failover policy for the group that is temporarily relocated is set to prefer its original node. When the failed node is restored, the relocated group returns to the control of its preferred node, and operations resume at normal performance. Services are available to clients throughout the process with only a minor interruption.

  • Availability: High 

  • Suggested failover policies: Assign a preferred server to each group. 

  • Suggested failback parameters: Allow failback for all groups. 

Two Business Scenarios

Using Model A can solve two problems that typically occur in a large computing environment. The first problem occurs when a single server is running multiple large applications, causing a degradation in performance. To solve the problem, a second server is clustered with the first, and the applications are split across the servers.

The second problem involves related applications running on separate servers. The problem of availability arises when the two servers are not connected. By placing them in a cluster, the client is ensured greater availability of both applications.

Case One: Improving Performance

Suppose your corporate intranet relies on a server that runs two large database applications. Both of these databases are critical to hundreds of users, who repeatedly connect to this server throughout the day.

Problem: During peak connect times, the server cannot keep up with the demand and performance.

Solution: To alleviate the problem, you attach another server to your overloaded server, form a cluster, and balance the load. You now have two servers, each running one of the database applications. If one server ever fails, you are put back in your original situation of degraded performance, but only temporarily. When the failed server is restored, the application it was running fails back, and operations resume.

Cc723241.xcla_a15(en-us,TechNet.10).gif

Figure 2.2 Case 1: Adding a server to improve performance 

Case Two: Ensuring Availability

Suppose your retail business relies on two separate servers, one that provides external Internet Web services and another that provides a database application for inventory and ordering information.

Problem: Both of these services are essential to your business. Without Web access, customers cannot browse your online catalog. Without access to the database application, customers cannot place orders, and employees cannot access inventory or shipping information.

Cc723241.xcla_a17(en-us,TechNet.10).gif 

Figure 2.3 Case 2, the problem: two unclustered servers that depend on each other 

Solution: To ensure the availability of all services, you join the computers into a cluster. This solution is similar to the one in the previous scenario, but you use it to take advantage of clustering differently.

You create a cluster that contains two groups, one on each node. One group contains all of the resources needed to run the Web-service applications, such as IP addresses, and the other group contains all of the resources for the database application, including the database itself.

Cc723241.xcla_a07(en-us,TechNet.10).gif 

Figure 2.4 Case 2, the solution: clustering two servers to improve availability 

In the failover policies of each group, specify that both groups can run on either node, thereby assuring their availability if one node should fail.

Model B: "Hot Spare" Solution with Maximum Availability

This model provides the maximum availability and performance for your resources, but requires an investment in hardware that is not in use most of the time.

One node, called a primary node, supports all clients, while its companion node is idle. The companion node is a dedicated "hot spare," ready to be used whenever a failover occurs. If the primary node fails, the "hot spare" node immediately picks up all operations and continues to service clients at a rate of performance that is close or equal to that of the primary node. (Exact performance depends on the capacity of the "hot spare" node.)

Cc723241.xcla_b02(en-us,TechNet.10).gif 

Figure 2.5 "Hot spare" configuration 

This model is best suited for your organization's most important applications and resources. For example, if your organization relies on sales over the World Wide Web, you can use this model to provide "hot spare" nodes for all servers dedicated to supporting Web access, such as those servers running Internet Information Server (IIS). The expense of doubling your hardware in this area is justified by the protection of clients' access to your organization. If one of your Web servers fails, a second server is fully configured to take over its operations.

  • Availability: Very high 

  • Suggested failover policies: This depends on capacity. If your budget allows for a "hot spare" server with identical capacity to its primary node, then you do not need to set a preferred server for any of the groups. If one node has greater capacity than the other, setting the group failover policies to prefer the larger server keeps performance as high as possible. 

  • Suggested failback parameters: This, too, depends on capacity. If the "hot spare" node has identical capacity to the primary node, prevent failback for all groups. If the "hot spare" node has less capacity than the primary node, set the policy for immediate failback or for failback at a specified off-peak hour. 

Model C: Partial MSCS Solution

This model demonstrates the flexibility of using applications that cannot fail over on the same servers from which MSCS groups are set to fail over.

One of the steps in planning resource groups is identifying applications that you will not configure to fail over. Those applications can reside on servers that form MSCS clusters, but they must store their data on local disks, not on disks on the shared SCSI bus. Because these applications are not subject to failover, they have only normal availability. If the availability of these applications is important, you must find other methods of providing it.

Cc723241.xcla_b03(en-us,TechNet.10).gif

Figure 2.6 Configuration with mixed failover policies 

Figure 2.6 shows group failover. The applications in the other groups also serve clients on one of the servers, but because they are not MSCS-aware, no failover policies are established for them. For example, you might use an MSCS node to run a mail server that has not been designed to use MSCS failover ability, or for an accounting application that is used so infrequently that availability is not important.

When node failure occurs, the applications that are not configured with failover policies are unavailable unless they have built-in failover mechanisms of their own. They remain unavailable until the node on which they run is restored; you must either restart them manually or set Windows NT to automatically start them when the system software starts. The applications you configured with failover policies fail over as usual, according to those policies.

  • Availability: High for MSCS applications, normal for others 

  • Suggested failover policies: Variable 

  • Suggested failback parameters: Variable 

Model D: Virtual-Server-Only Solution (No Failover)

This model shows how you can use the virtual server concept with applications on an MSCS node to which no companion node is attached.

This cluster model makes no use of the MSCS failover capabilities. It is merely a way of organizing the resources on a server for administrative convenience and for the convenience of your clients. The main advantage is that both administrators and clients can readily see descriptively named virtual servers on the network rather than navigating a list of actual servers to find the shares they need.

Other advantages of this model include:

  • MSCS automatically restarts the various groups of applications and their dependent resources after the server is restored (following a failure). This is useful for applications that benefit from automatic restart but do not have their own mechanisms for accomplishing it. 

  • The single node can be clustered with a second node at a future time, and the resource groups are already in place. After you configure failover policies for the groups, the virtual servers are ready to operate. 

Cc723241.xcla_b04(en-us,TechNet.10).gif 

Figure 2.7 Virtual server only, with no second node or failover policies 

For example, you can use this model to locate all of your organization's file and print resources on a single machine, establishing separate groups for each department. When clients from one department need to connect to the appropriate file or print share, they can find the share as easily as they would find an actual computer.

  • Availability: Normal 

  • Suggested failover policies: Not applicable 

  • Suggested failback parameters: Not applicable 

Model E: Hybrid Solution

The final model is a hybrid of the others. Using this model, you can incorporate advantages of the previous models and combine them in one cluster. As long as you have provided sufficient capacity, many types of failover scenarios can coexist on the same two nodes. All failover activity occurs as normal, according to the policies you set up.

Cc723241.xcla_b05(en-us,TechNet.10).gif 

Figure 2.8 Hybrid configuration 

Figure 2.8 shows an example of static load balancing for two database shares, allowing somewhat reduced performance when both shares are on a single node. For administrative convenience, the two file-and-print shares in the cluster (which do not require failover ability) are grouped logically by department and configured as virtual servers. Finally, an application that cannot fail over resides on one of the clusters and operates as normal (without any failover protection).

  • Availability: High or very high for the MSCS resources set to fail over 

  • Suggested failover policy: Variable 

Capacity Planning

After you assess your clustering needs, you are ready to determine how many server computers you need and with what specifications (RAM, hard disk storage, and CPU memory). This is capacity planning.

Quantifying Your Groups

There are six steps you can take to organize your applications and other resources into groups:

  • List all your server-based applications. 

  • Sort the list of applications. 

  • List all non-application resources. 

  • List all dependencies for each resource. 

  • Make preliminary grouping decisions. 

  • Make final grouping assignments. 

List All Your Server-Based Applications

Most groups contain one or more applications. Make a list of all applications in your environment, regardless of whether or not you plan to use them with MSCS. Your total capacity needs are determined by the sum of the total number of groups (virtual servers) that will run in your environment and the total amount of other software you will run independently of groups.

Sort the List of Applications

Determine which of your applications can use the MSCS failover feature. These can include any applications that qualify for failover by the two criteria described in the previous section: TCP/IP and ability to store data remotely (that is, you must be able to specify where the data is stored).

Also list applications that will reside on MSCS nodes but which will not use the failover feature because it is inconvenient, unnecessary, or impossible to configure the applications for failover. Although you do not set failover policies for these applications or arrange them in groups, they still use a portion of the server capacity.

Licensing Considerations

Before clustering an application, review the application license, or check with the application vendor. Each application vendor sets its own licensing policies for applications running on MSCS clusters.

The current Microsoft server-application licensing policy applies to MSCS clusters: A Microsoft server application must be separately licensed for each server on which it is installed.

The Microsoft client licensing policy applies to MSCS cluster nodes. If you use "per-seat" Client Access Licenses for a server application, then those licenses apply when a client is accessing the application on either server in the cluster. If you use "per-server" (or "concurrent use") Client Access Licenses for the application, then each cluster node should have a sufficient number of per-server Client Access Licenses for the expected peak load of the application on that node. (Note that "per-server" Client Access Licenses do not fail over from one node in the cluster to the other.)

List All Non-Application Resources

Using Table 2.1 (earlier in this chapter), determine which hardware, connections, and operating-system software MSCS can protect in your network environment.

For example, MSCS can fail over print spoolers, to protect client access to printing services. Another example is a file-server resource, which you can set to fail over, maintaining client access to files. In both cases, capacity is affected, including the RAM required to service the clients and the disk space required to store the data.

List All Dependencies for Each Resource

When you create this list, include all resources that support the core resources. For example, if a Web-server application fails over, the Web addresses and disks on the shared SCSI buses containing the files for that application must also fail over if the Web server is to function. All these resources must be located in the same group. This ensures that MSCS keeps interdependent resources together at all times.

Two guidelines for quantifying resources are:

  • A resource and its dependencies must be together in a single group. 

  • A resource cannot span groups. 

    For example, if several applications depend on a particular resource, you must include all of those applications with that resource in a single group.

Make Preliminary Grouping Decisions

Another factor in the way you assign groups is administrative convenience. For example, you might put several applications into one group because viewing those particular applications as a single entity makes it easier to administer the network.

A common use of this technique is to combine file-sharing resources and print-spooling resources in a single group. All dependencies for those applications must also be in the group. You can give this group a unique name for the part of your organization it serves, such as "Accounting File&Print." Whenever you need to intervene with the file- and print-sharing activities for that department, you would look for this group in Cluster Administrator.

Another common practice is to put applications that depend on a particular resource into a single group. For example, suppose a Web-server application provides access to Web pages, and that those Web pages provide result sets that clients access by querying an SQL-database application, through the use of Hypertext Markup Language (HTML) forms. By putting the Web server and the SQL database in the same group, the data for both core applications can reside on a specific disk volume. Because both applications exist within the same group, you can also create an IP address and network name specifically for this resource group.

Make Final Grouping Assignments

After you list the resources that you want to group together, assign a different name to each group, and create a dependency tree. A dependency tree is useful for visualizing the dependency relationships between resources.

To create a dependency tree, first write down all the resources in a particular group. Then draw arrows from each resource to each resource on which the resource directly depends.

A direct dependency between resource A and resource B means that there are no intermediary resources between the two resources. An indirect dependency occurs when a transitive relationship exists between resources. If resource A depends on resource B and resource B depends on resource C, there is an indirect dependency between resource A and resource C. However, resource A is not directly dependent on resource C.

Figure 2.9 shows the resources in a final grouping assignment in a dependency tree.

Cc723241.xcla_b06(en-us,TechNet.10).gif 

Figure 2.9 Final grouping assignments, including dependencies 

In Figure 2.9, the File Share resource depends on the Network Name resource, which in turn depends on the IP Address resource. However, the File Share resource does not directly depend on the IP Address resource. In this example, both the Network Name resource and the IIS Virtual Root resource depend on the IP Address resource. However, there is no dependency between the Network Name resource and the IIS Virtual Root resource. In fact, these relationships can be viewed as two separate dependency trees: One tree includes the IIS Virtual Root resource and the IP Address resource, and the other tree contains the File Share resource, Network Name resource, and IP Address resource.

For information on using Cluster Administrator to set resource dependencies, see "Setting Resource Dependencies" in Chapter 4, "Managing MSCS."

Determining Server-Capacity Requirements

After you choose a cluster model, determine how to group your resources, and determine the failover policies required by each resource, then you are ready determine the hardware capacity required for each server in the cluster. The following sections explain the criteria for choosing computers for use as cluster nodes.

  • Hard-disk-storage requirements 

    Each node in a cluster must have enough hard-disk capacity to store permanent copies of all applications and other resources required to run all groups. Calculate this for each node as if all of these resources in the cluster were running on that node, even if some or all of those groups run on the other node most of the time. Plan these disk-space allowances so that either node can efficiently run all resources during failover. 

  • CPU requirements 

    Failover can strain the CPU processing capacity of an MSCS server when the server takes control of the resources from a failed node. Without proper planning, the CPU of a surviving node can be pushed beyond its practical capacity during failover, slowing response time for users. Plan your CPU capacity on each node so that it can accommodate new resources without unreasonably affecting responsiveness. 

  • RAM requirements 

    When planning your capacity, make sure that each node in your cluster has enough RAM to run all applications that may run on either node. Also, make sure your Windows NT paging files are set appropriately for each node's physical memory. 

Choosing a Domain Model

The following Windows NT Server, Enterprise Edition configurations are supported within a cluster:

  • Member servers 

    Each server is configured as a Windows NT member server (although each node must still be a member of the same domain). 

  • BDC-BDC 

    Each server is a backup domain controller (BDC) in an existing domain. 

  • PDC-BDC 

    Both servers are configured within a self-contained domain, with one server acting as a primary domain controller (PDC), and the other as a BDC. 

If you install MSCS on member servers or on two BDCs in an existing domain, you preserve your existing domain model. If you create a new domain and install MSCS on a PDC and a BDC, you must establish domain trusts with your existing domains so users can access the MSCS servers. These can be either two-way trusts or one-way trusts with your existing domains, depending on your domain model.

If you install MSCS on a PDC-BDC pair, install the first node on the PDC. This allows the service account to be accessed and modified without complicating domain security. The second node of the cluster should be the BDC. If the PDC fails, the BDC functions in its stead. However, clustering is not involved in this procedure. The roles of PDC and BDC cannot be made a failover resource. MSCS is concerned only with the services provided by the domain controllers, not the domain roles themselves.

If you plan to install MSCS on either PDC-BDC or BDC-BDC node pairs, review the hardware choices you made in capacity planning. You should account for the additional overhead that is incurred by the domain controller services. In large networks running on Windows NT, substantial resources can be required by domain controllers for performing directory-replication and server-authentication for clients. For this reason, many applications, such as Microsoft SQL Server and Microsoft Message Queue Server, recommend that you not install the application on domain controllers (DCs) for best performance. However, if you have a very small network in which account information rarely changes and in which users do not log on and off frequently, you can use DCs as cluster nodes.

This simplest and most efficient server configuration has all MSCS nodes installed on DCs in one domain that contains only MSCS nodes. This configuration provides the following benefits:

  • Because all groups and accounts have a domain scope, access-control problems with local users and local groups are eliminated. 

    Similarly, on non-domain controllers, you should not specify local accounts and groups when setting access permissions for files and folders that reside on any disk on the shared SCSI bus. This is not a problem on DCs because the local accounts and groups have security context on all DCs in the domain. 

  • Minimal resources are required to support logon requests and replication because a) cluster nodes typically do not log on and log off frequently, and b) few user-account changes will require little replication. 

  • Because you are adding only one domain to your existing domain structure, minimal trust-relationship changes are required. 

For more information on Windows NT domain models and establishing trusts, see Chapter 1 in Windows NT Server Version 4.0 Concepts and Planning. To access this book online on a computer running Windows NT Server, click Start, point to Programs, and click Books Online.

Planning for Fault-Tolerant Disks

Many groups include disk resources for disks on shared SCSI buses. In some cases, these are simple physical disks, but in other cases they are complex disk subsystems containing multiple disks. Almost all resource groups depend on the disks on the shared SCSI buses. An unrecoverable failure of a disk resource results in certain failure of all groups that depend on that resource.

For these reasons, you might decide to use special methods to protect your disks and disk subsystems from failures. One common solution is the use of a hardware-based redundant array of inexpensive disks (RAID) solution. RAID support ensures the high availability of data contained on disk sets in your clusters. Some of these hardware-based solutions are considered fault tolerant, which means that data will not be lost if a member of the disk set fails.

Although Windows NT Server includes support for software fault-tolerant disk sets, this option is not supported within MSCS.

For more information on managing disks, see Chapter 7 in Windows NT Server Version 4.0 Concepts and Planning. To access the book online on a computer running Windows NT Server, click Start, point to Programs, and click Books Online.

For information on recovering from disk failures, see "Recovering Disk-Subsystem Failures" in Chapter 4, "Managing MSCS."

Note The hardware configurations mentioned in this book are generic. For specific information about supported hardware, see the MSCS Hardware Compatibility List. For specific information about MSCS original-equipment-manufacturer (OEM) configurations that have been verified by Microsoft, check with you hardware vendor.

Hardware RAID

The MSCS hardware-software bundles validated by Microsoft use many different hardware RAID configurations. Because many hardware RAID solutions provide power, bus, and cable redundancy within a single cabinet, and track the state of each component in the hardware RAID firmware, they provide data availability with multiple redundancy, protecting against multiple points of failure.

Hardware RAID solutions also use an onboard processor and cache to provide outstanding performance.

Windows NT (and therefore MSCS) can use these disks as standard disk resources.

Software RAID solutions are not supported for MSCS clusters.

Error Recovery

With transaction logging and recovery, Windows NT file system (NTFS) ensures that the volume structure will not be corrupted, so all files remain accessible after a system failure. Windows NT also provides two kinds of disk-error recovery:

  • Dynamic data recovery using sector-sparing 

    This is available only on SCSI disks that are configured as part of a fault-tolerant volume. Sector sparing works on fault-tolerant volumes because a copy of the data on the sector with the error can be regenerated. 

  • NTFS disk cluster-remapping

    NTFS file systems use the recovery technique called cluster-remapping. When Windows NT returns a bad-sector error to the NTFS file system, NTFS dynamically replaces the disk cluster containing the bad sector and allocates a new disk cluster for the data. If the error occurs during a read, NTFS returns a read error to the calling program, and the data is lost (unless protected by RAID fault tolerance). When the error occurs during a write, NTFS writes the data to the new disk cluster, and no data is lost. NTFS puts the address of the disk cluster containing the bad sector in its Bad Sector file so that the bad sector is not reused.

Even with transaction logging and recovery, sector-sparing, and disk cluster-remapping, user data can be lost due to hardware failure if you do not use a fault-tolerant-disk solution.

Replacing Drives on the Shared SCSI Bus

MSCS relies on Windows NT Server to handle all normal operations on disk sets. For some configurations, a special issue arises concerning the signature that exists on every volume within a disk set.

Windows NT uses a unique number (called a disk signature) to store and retrieve information about the disk. Windows NT Disk Administrator writes a disk signature to each physical disk the first time you run Disk Administrator after installing the disk (it does this when it does not find a disk signature on the disk). When prompted, you should always click Yes; if you do not, Windows NT cannot access the disk.

MSCS relies on these NTFS disk signatures to track disk resources. When a drive on the shared SCSI bus fails and is replaced, you must again run the Windows NT Server Disk Administrator to initialize the disk signature for use with MSCS. When you run Disk Administrator, it automatically writes new disk signatures to each new disk it detects. It displays a dialog box before writing the signatures.

For more information on replacing drives on the shared SCSI bus, see "Adding or Replacing Drives on the Shared SCSI Bus" in Chapter 4, "Managing MSCS."

Optimizing Your Clusters

Because Windows NT Server uses an adaptable architecture, it is largely self-tuning; that is, it is able to allocate resources as needed to meet changing usage requirements. However, there are some hardware and configuration choices you can make to improve cluster performance. These choices are explained in the following sections:

  • Focusing on the right hardware resources 

  • Optimizing servers for different roles

  • Disk optimization though planning 

  • Optimizing the paging file size and location 

  • Using additional TCP/IP services 

  • Tuning services 

Focusing on the Right Hardware Resources

The goal in tuning MSCS and hosted applications is to determine which hardware resource will experience the greatest demand, and then to adjust the configuration to relieve that demand and maximize total throughput. A system should be structured so that its resources are used efficiently.

For example, if the primary role of the cluster is to provide high availability of file and print services, high disk use will be incurred due to the large number of files being opened and closed. File and print services also cause a heavy load on network adapters because of the large amount of data that is being transferred. It is important to make sure that your network adapter can handle the load. In this scenario, memory typically does not carry a heavy load (although memory usage can be heavy if a large amount of system memory is allocated to file-system cache). Processor utilization is also typically low in this environment. In such cases, memory and processor utilization usually do not need the optimizing that other components need.

In contrast, a server-application environment (such as one running Microsoft SQL Server, Microsoft Exchange Server, Microsoft Systems Management Server, and Microsoft SNA Server) is much more processor- and memory-bound than a typical file-and-print-server environment because much more processing is taking place at the server. In these cases, it is best to implement high-end multiprocessor machines or processors. The disk and network loads tend to be less utilized, because the amount of data being sent over the wire and to the disk is smaller. MSCS itself uses little of the system resources either for intracluster communications or for the operation of the cluster itself.

Optimizing Servers for Different Roles

You can optimize Windows NT Server for file-and-print-server performance or application-server performance using the Properties dialog box for the Server service (which is part of the Network option in Control Panel; on the Services tab, click Server, and then click Properties). You can maximize the server for file and print services or for application throughput (the default), or you can select a balance of the two. After you specify these settings, Windows NT Server makes other necessary registry modifications that affect performance in keeping with the balance you selected. It is recommended that you do not select Maximize Throughput for File Sharing unless the server will provide only file-and-print services. Selecting this option leaves very little memory available for any other applications.

Optimizing the Paging File Size and Location

The location and size of the paging files can greatly affect performance. Paging files cannot be located on the shared SCSI bus, but if multiple unshared drives are available on the individual servers, placing the paging file on a separate, fast, low-use drive can boost performance. The size setting of the paging file is also critical to performance. A paging file that must constantly expand and shrink requires additional processor and I/O overhead. Usually, set the size of your paging file at two or two-and-a-half times the amount of installed physical memory if active applications, such as Microsoft SQL Server or Microsoft Exchange Server, are present.

Using Additional TCP/IP Services

WINS, DHCP, and DNS server use additional system resources, which must be considered before implementing any of these services on a cluster server. In particular, WINS can require substantial overhead during peak logon times in large networks. In such cases, make sure that your system has the necessary capacity to perform adequately.

Important These services are not clustered, but the node acts as the server of the service. If any of these services are to run on a node of an MSCS cluster, make sure to implement backup servers for each of these services to provide redundancy, just as you would normally do on your network.

Tuning Services

You can tune or even stop two of the Windows NT services to further optimize performance: NetLogon and Browser. Of course, stopping these services has a limiting affect on the functionality of the server.

Tuning the NetLogon Service

The NetLogon service provides users with a single access point to a PDC and all BDCs. It also synchronizes the changes to the directory database stored on the PDC to all domain controllers. This service plays an important role in Windows NT networking but can negatively affect performance on cluster nodes that primarily serve applications.

On a domain controller, the Windows NT NetLogon service is responsible for:

  • Synchronizing the directory-services database between the PDC and BDCs 

  • Validating user-account logons 

  • Supporting trust relationships between domains 

  • Providing for membership of computers in a domain 

For a cluster server that also acts as a domain controller, it is possible to tune the NetLogon service to suit your requirements. Several registry parameters for synchronizing behavior with other domain controllers are associated with the NetLogon service. If necessary, you can even schedule the service by pausing it during busy periods and restarting it during off-peak periods. This can be accomplished with the Windows NT job scheduler (called AT) with jobs that allow the starting, pausing, and restarting of services.

For more information on the NetLogon service, see Chapter 2 of the Windows NT Server 4.0 Resource Kit Networking Guide.

Tuning the Browser Service

Users often need to know which domains and computers are accessible from their local computers. Viewing all the available network resources is called browsing. The Windows NT Browser service maintains a list (called the browse list) of all available domains and servers. The browse list can be viewed using Windows NT Explorer and is provided by a browser in the local computer's domain.

By default, every computer running Windows NT Server participates in this service. You can optimize cluster performance by preventing the cluster nodes from participating in the browser service. You can make this configuration change only by setting Windows NT registry parameters. This action does not prevent the server from browsing the network itself, nor does it prevent clients from browsing its resources.

For more information on the Browser service, see Chapter 3 of the Windows NT Server 4.0 Resource Kit Networking Guide.

Cc723241.spacer(en-us,TechNet.10).gif