Windows 2000 Performance Tuning

This white paper provides information on how to tune the Microsoft® Windows® 2000 operating system to achieve optimal performance. It also provides useful information on how to test the performance capabilities of Windows 2000; presents data generated using various IBM Netfinity servers and industry benchmarks that show the performance capabilities of Window 2000 when running in an optimized environment; and, finally, shows how to use the integrated performance monitoring tools in Windows 2000 to eliminate potential performance bottlenecks.

On This Page

About IBM Netfinity
Introduction
Client Performance
File Server Performance
Web Server Performance
Networking Performance
Performance Bottleneck Analysis
Conclusion
Appendix 1: Benchmark Configurations and Checklists
Appendix 2: Test Lab Configuration
Appendix 3: Offline Files
Appendix 4: Tuning for Gigabit Adapters
Appendix 5: Lists of Figures and Tables

About IBM Netfinity

perftu01

The Netfinity 5000 is an affordable, versatile, two-way Pentium processor server. Offering powerful performance at an entry-level price, this server is designed for small-to-medium business application and file serving. Available in tower or rack models, it features an integrated Advanced System Management processor and management software for superior control.

The Netfinity 7000 M10 is a four-way Pentium III Xeon processor server. This system is powerful, versatile and fast enough to handle the demands of large enterprises. With four-way symmetric multiprocessing (SMP) and subsystems balanced to coax every bit of power out of Intel's fastest chips, the Netfinity 7000 M10 easily handles extreme demands, such as sever consolidation, clustering, e-business intelligence or enterprise resource planning.

Introduction

This paper is intended above all to serve as a guide to configuring and tuning the Microsoft® Windows® 2000 operating system to achieve optimal performance in the following environments:

  • Client

  • Networking

  • File server

  • Web server

A brief description of what will be covered in this document follows.

Performance Benchmarks

This paper uses industry benchmarks and capacity planning tools to demonstrate the performance capabilities of Windows 2000. Most benchmarks represent a specific workload or Information Technology (IT) environment. Although good benchmarks are designed to simulate real customer workloads, it is impossible for any single benchmark to simulate every customer environment. However, benchmark results have long been used to describe and compare the performance capabilities of competing systems. Benchmark results are one of many criteria used by IT professionals when determining which platform best fits their needs.

The various tools used for this paper, such as WebLoad and WebBench, can also be used to test custom workloads. Because these benchmarks can be customized to adapt to a specific customer environment and traffic pattern, these tools allow customers to simulate and test the performance and capacity of their environments more accurately.

Furthermore, this paper provides you with detailed information on how to optimize Windows 2000 for certain workloads and benchmarks.

Tuning for Performance

For the most part, Windows 2000 is a self-tuning operating system. This means that in most cases, Windows 2000 automatically adapts to perform optimally right out of the box depending on the environment in which it's running—assuming that the hardware is properly configured. For instance, when you deploy Windows 2000 as a Web server, other services that are also present but not used are put into a state where they occupy very few system resources such as CPU and memory. However, as with any operating system, performance depends on many outside factors such as hardware, device drivers, applications, workload, the network, and so forth. In addition, there are certain practices and tuning guidelines that can be followed to optimize the performance of Windows 2000 in certain environments. These parameters will be discussed in detail throughout this paper.

Hardware

The selection of hardware is critical to ensuring maximum performance. If a system contains a component that has not been optimized for the operating system, performance is sure to suffer. For instance selecting a video card with a poorly written video driver can result in poor performance and/or a poor benchmark score on client computers. The same is true for other critical components such as network adapters (sometimes called network interface cards or NICs) and Redundant Array of Inexpensive Disks (RAID) controllers on a server. Each section of this paper provides a generic hardware configuration for each deployment scenario. We strongly recommend that you check with your hardware supplier to ensure that the key components used in the system (video, disk subsystem, RAID controller, network adapters, and so forth) have been optimized for Windows 2000.

Performance Results

Each section of this paper includes a number of performance results presented in chart form. Two different hardware configurations have been used to produce these results:

  • Minimum recommended configuration (small business): Single CPU, 256 megabytes (MB) of RAM, Fast Ethernet network

  • Enterprise configuration: Four CPUs, 2 gigabytes (GB) of RAM, Gigabit Ethernet network

To provide you with a baseline, we have done testing using both the Windows NT® 4.0 operating system and Windows 2000.

Performance Bottlenecks

In order to fully understand any benchmark result, it is crucial to understand the impediments to better system performance: bottlenecks. Specific sections of this paper are devoted to identifying and eliminating common bottlenecks on Windows 2000–based systems.

Client Performance

Introduction

Client systems are used for a wide variety of purposes, including application development, computer-aided design/computer-aided manufacturing (CAD/CAM), scientific simulation modeling, gaming, and for running office productivity applications (spreadsheets, word processors, e-mail clients, and browsers).

Client performance is heavily dependent on the intended purpose of the system. For instance, a client system used primarily to surf the Web and read e-mail has different requirements than a system used to design automobiles with a CAD/CAM software package. Thus it's important to select the appropriate configuration for the task.

Client Benchmarks

There are a number of benchmarks that attempt to simulate desktop operations in a manner similar to the operations of real users—Sysmark98 and Winstone99 are two such benchmarks. Both of these benchmarks provide insight into the performance of a given hardware configuration and, to a lesser extent, the performance of the underlying operating system. Both Sysmark98 and Winstone99 make use of real applications and both use application functions that real-world users would use. The use of real applications makes these benchmarks more realistic than benchmarks that simply time application programming interfaces (APIs) or call functions within tight loops. However, there are some important caveats to bear in mind with these benchmarks:

  • They drive the applications at super-human speeds. In driving applications at super human speeds, (without key stroke delays or think time), these benchmarks can execute thousands of operations in a given period of time, whereas a normal user may execute only a handful. As a result, the periodic functions of an operating system, such as those that write dirty pages to disk every second or two, or those that trim process working sets on a once per second basis, are largely reacting to the system's state instead of proactively working to keep the system well balanced.

  • They neglect to use the network. By neglecting to use the network, one of the most common causes of delays and fluctuations in interactive responsiveness is not taken into account. While this makes the running and setting up of benchmarks more convenient for testers, it does reduce the realism of the benchmarks.

    Interactive responsiveness is the time a user waits for an operation to complete or an application to respond. For many operations that take more than a second to complete, the disk or network wait time dominates the total time and the CPU is busy for just a small percentage of the total time.

  • They measure how long it takes large sets of operations to complete instead of measuring individual operations. In measuring the total time it takes a large number of operations to complete, instead of measuring the time it takes each operation individually, we find that system components that often have a small impact on how a real user interacts with the computer can have a significant impact on the benchmark scores. The use of video driver functionality is an example of this. Each video operation performed using one video card may be slightly slower than the same operation with another video card, thus making the aggregate time spent in doing video operations longer. However, as a percentage of time where a user would perceive slowness, the time spent in the video driver function may be insignificant. That is, no user can perceive the difference between an operation that takes 0.05 seconds to complete from one that takes 0.058 seconds to complete, and yet if a million of these operations are timed, the insignificant becomes significant in the overall time and total benchmark score.

Minimum Configuration

  • Pentium 133 MHz CPU

  • 32 MB of RAM

There are a number of other benchmarks that are outside the scope of this paper, such as graphic and engineering design benchmarks like SPECapc for Pro/Engineer and gaming benchmarks like Quake II and Unreal. For more information on some of the engineering design benchmarks, see https://www.spec.org/gpc.

Client Hardware Recommendations

The recommendations listed at left are intended for a typical business desktop user running e-mail, Web browsers, and office productivity software. It is always best to consult the documentation from the application vendor for the recommended hardware configuration for the application.

General Client Tuning Recommendations

  • New installation. When installing Windows 2000, it is recommended that you install it on a newly formatted partition using the appropriate file system: file allocation table (FAT), FAT32, or the Windows NT file system (NTFS). When installing Windows 2000 on a newly formatted partition, the Windows 2000 installation process automatically optimizes the file locations so defragmenting the hard disk may not be required.

    Recommended Configuration

    • Pentium 133 MHz CPU or better

    • 64 MB of RAM

  • Upgrade installation. When upgrading to Windows 2000, it is critical that you defragment the disk. The upgrade process often results in files being spread across the entire partition. By defragmenting the disk, the number of disk input/output (I/O) operations required is often reduced, thus improving performance.

  • File system conversion. Again, it is recommended that you choose the file system that will be used with the system during installation. However, if you decide to convert the file system after installation (for example, from FAT to NTFS) it is recommended that you defragment the hard disk to achieve optimal system performance.

  • System page file. If your configuration has two or more hard disk drives, we recommend that you move the system page file to a different drive than the one where the Windows 2000 operating system is installed.

Client Benchmark Results

Client Benchmark Checklist

  • Select appropriate hardware.

  • Install Windows 2000 on a newly formatted parition.

  • When installing a new version of Windows 2000 on a newly formatted partition you do NOT need to defragment the hard disk.

  • If you are upgrading to Windows 2000, you must defragment the hard disk before testing.

  • If you change the file system (for example, from FAT to NTFS), you must defragment the hard disk before testing.

Figures 1, 2, and 3 below show how various configurations of Windows 2000 Professional compare to prior versions of the Windows operating systems, as measured (respectively) using the Business Winstone99 benchmark from ZD Labs, the high-end Winstone99 also from ZD Labs, and in terms of operating system startup time.

Bb742460.perftu02(en-us,TechNet.10).gif

Figure 1: On configurations with 64 MB, Windows 2000 Professional is significantly faster than Windows 98 and comparable to Windows NT Workstation 4.0.

Bb742460.perftu03(en-us,TechNet.10).gif

Figure 2: Windows 2000 Professional is significantly faster than Windows NT Workstation 4.0 on high-end workloads. The test system was a Gateway PII 500 Mhz, ATI Rage Pro AGP, Adaptec AHA-2940U2/U2W PCI SCSI Controller.

Bb742460.perftu04(en-us,TechNet.10).gif

Figure 3: Graph showing the time it takes from starting up to logging into a domain. In cold boot, standby, and hibernate modes, Windows 2000 maintains application state .

Performance Bottlenecks

The CPU remains a prominent performance bottleneck for CAD/CAM applications or scientific modeling; however, for the typical user, the processing power of today's CPU is becoming less and less of a performance bottleneck. In fact, in typical circumstances, the CPU is often more than 90 percent idle while the computer is in use.

For a typical configuration running common applications such as word processors, e-mail programs and Web browsers, interactive responsiveness is affected more by disk I/O and network operations than by the speed of the processor.

Similarly, the interactive responsiveness of a typical hardware configuration connected to a network, even a 10–megabit per second (Mbps) corporate network, is often affected more by the speed with which network requests are satisfied than by the processing power of today's CPUs. A typical processor in computers sold today routinely executes 100 million instructions or more per second, but often it takes many seconds to download files or Web pages on fast network connections. Thus, the user is more often waiting on the network than on the CPU.

In operations where the network isn't involved, such as running a word processor from a local hard disk, the disk I/O usually has the greatest impact on performance. It takes a typical disk around 10 milliseconds to complete a random I/O request, and today's disks peak at about 100 random disk I/O operations per second. The first start of many of today's business productivity applications, such as Microsoft Word, often requires 300 or more random disk I/O operations, and as much as three seconds to complete. The three seconds are almost entirely due to disk I/O requests. Subsequent starts of the application are nearly instantaneous because the application is resident in memory—therefore, the CPU doesn't have to wait on time-consuming disk I/O operations.

File Server Performance

Introduction

File sharing has been a core service in the Windows NT Server line of operating systems (now Windows 2000 Server) since it was first introduced. Through the various releases, Microsoft has incorporated many enhancements in the operating system to provide higher levels of file server performance. This section focuses on setting up and configuring Windows 2000 for optimal file server performance, measuring file server performance, and common file server performance bottlenecks.

File Server Benchmarks

Benchmarks such as NetBench, among others, simulate a file server workload and provide an estimate of how a particular hardware and software configuration will perform under load.

Recommended Test Environment

  • 60 client systems for NetBench

  • At least 30 clients each with a 400 MHz Pentium II CPU, the remainder with at least a 200 MHz Pentium II CPU; faster servers will require faster clients

  • 64 MB of RAM in each client

  • A 100Base-TX network adapter in each client

  • A 500 MB disk minimum in each client

  • A full-duplex, switched network

Small Business/Departmental File Server Configuration

  • CPU: 500 MHz Pentium III

  • RAM: 256 MB

  • Network: 2 x 100Base-TX

  • Disk: RAID 5 controller with at least 4 SCSI disks

Enterprise File Server Configuration

  • CPU: 4 Xeon III 500 MHz

  • RAM: 1 GB

  • Network: 2 Gigabit Ethernet

  • Disk: RAID 5 controller with at least 10 SCSI disks

NetBench increases the load on a file server by adding client systems to the test. The number of client systems participating in a test mix does not represent the number of simultaneous users that a file server can support. This is because each client system represents a greater load than a real user would generate. Benchmarks are designed this way so they can use as few client systems as possible while still generating a load on a file server that is large enough to make full use of hardware resources.

The major shortcoming with the hyper-fast client benchmarking approach is that it does not let Windows 2000 self-adjust its performance or do house cleaning, such as by writing dirty pages to disk the way it would in normal use. Furthermore, NetBench represents a specific file server workload, which attempts to simulate how a user would access a file server (by executing a series of read, write, change, delete, and other operations). As a result, the numbers reported using NetBench as well as other benchmarks may be different than what you will experience when you actually deploy a file server.

File Server Configurations

For this paper, we tested two file server configurations: a small business/departmental configuration and an enterprise configuration. In order to simplify our testing procedures and to eliminate the disk as a bottleneck, we used the same RAID disk subsystem for both configurations. The Server Configurations section in Appendix 2: Test Lab Configuration gives the detailed configurations of the servers we used.

File Server Hardware Selection and Performance Tuning Concepts

The hardware components you choose to put into a file server can make a tremendous difference in the level of performance it attains. However, it is cost-effective to add only those components that will improve performance in your specific environment. For example, a higher speed CPU will boost performance on a server whose performance is hindered by the limitations of its CPU but will do very little for a system whose performance is limited by its disks. We'll talk more about how to find out what is limiting the performance of your server and how to eliminate such bottlenecks in the File Server Performance Bottlenecks section below.

Throughput is the most commonly used performance metric for file servers. It is the total bits of data the clients send to a server and a server returns to the clients. Throughput is measured in megabits per second (Mbps).

Average response time is the average time a file server takes to complete all of the different file system and I/O operations a client system requests.

In a file server, there are four key hardware components that you can tune to achieve optimal system performance:

  • CPU configuration. Changing the CPU speed, the CPU cache size, and/or the number of CPUs in a file server will have an effect on performance. Which option you choose depends on how much the CPU limits performance and how your server can be expanded.

  • Memory. The amount of memory in a file server can have a dramatic effect on performance. The more RAM a server has, the more aggressively it will be able to cache frequently requested files; however, in order for file servers to take advantage of system memory, a good disk subsystem is also required, especially during periods of heavy write operations (saves).

  • Disk subsystem. The configuration of the disk subsystem also has a major effect on file server performance and will most likely be the first bottleneck a file server will hit—which is why we recommend using a hardware RAID configuration with multiple disks rather than a single disk. As a rule of thumb, the more disks in a RAID, the faster the disk I/O. Also, more cache in a RAID controller will improve its performance. A RAID 0 configuration will be faster than a RAID 5 configuration with the same number of disks; however, RAID 5 provides data reliability.

  • Network subsystem. The network subsystem is a very common bottleneck for file servers. The type of network, the number of network adapters/segments, and the specific network adapters used will have a significant effect on file server performance. The total bandwidth of all of the network adapters in a file server limits the maximum number of data bits that can be sent to or received from all of the clients. For example, two 100Base-TX network adapters each with an effective maximum bandwidth of about 90 Mbps would limit file server throughput to180 Mbps. On systems where the CPU is the bottleneck and there is still bandwidth available from the network subsystem, you can improve throughput by using more network adapters or by using smart network adapters that make use of the advanced TCP/IP offloading features in Windows 2000. For optimal performance, the network adapters in a file server should be in full-duplex mode and connected to a switch.

Resource Partitioning

Another way to optimize the performance of your file server is by using resource partitioning. Resource partitioning means configuring the hardware, software, I/O requests, and data so that the load on a file server is divided as evenly as possible among the available resources. This simple optimization method can yield significant performance improvements. Specifically, you can distribute file server load among PCI buses, network adapters, CPUs, and data partitions in the following ways:

  • PCI buses. Distribute the network adapters and RAID controller across multiple buses, if your server has them, so that the total I/O bandwidth for each PCI bus is as equal as possible. In general, it is a good idea to put a RAID controller on a bus that does not have a network adapter. For the best results, we recommend using a server with a 64-bit, 66 MHz PCI bus.

  • Network adapters. Balance the client load among the network adapters. This is commonly accomplished by segmenting the network among the available network adapters.

  • CPUs. Windows 2000 includes general algorithms for balancing the load among CPUs in a multiprocessor system. However, you can improve performance in some cases by binding interrupts for selected controllers to specific CPUs.

  • Data partitions. Distribute data files among multiple RAID partitions, logical drives, or volumes so that the NTFS logging/volume resources are available among multiple partitions. This also has the added benefit of allowing an administrator to back up each partition separately, thereby keeping the rest of the file server available.

General File Server Tuning Recommendations

Table 1 below provides recommendations on tuning your hardware for optimal file server performance with Windows 2000.

Interrupt-Affinity Filter

IntFiltr allows a user to change the CPU-affinity of the interrupts in a multiprocessor system. Using this utility, you can direct the interrupts from any device to a specific processor or to a set of processors, instead of the default behavior of sending interrupts to all of the CPUs in a system. IntFiltr lets each device have its own interrupt-affinity setting. This tool can be downloaded from Microsoft's FTP site at: https://download.microsoft.com/download/win2000platform/utility/1.001/nt5/en-us/intfiltr.exe

Table 1 Recommended tuning for file server hardware.

Server Subsystem

Small Business Configuration

Enterprise Configuration

Processors (CPUs)

· Using CPUs with larger L2 cache has a positive impact on performance.

· Using CPUs with larger L2 cache has a positive impact on performance.
· Use the interrupt-affinity filter to assign interrupts from each network adapter to different CPUs.

Network

· Set network adapter receive buffers for optimal performance.
· Balance client I/O among network adapters.
· Use network adapters that support the TCP checksum offloading features in Windows 2000.
· Put the RAID controller in the fastest bus slot available, ideally a bus without network adapters or other controllers that generate many interrupts.
· Set cache memory: as large as possible.
· Set cache write policy: write back.
· Set cache read policy: read ahead.
· Set tripe size: 8 KB/16 KB for RAID 5 and 64 KB/128 KB or the maximum supported for RAID 0; the optimal size may vary by controller.
· Use multiple logical NTFS partitions. This can be set up using the RAID configuration software or logical disk manager in Windows 2000.
· Use 16 KB allocation size for formatting the NTFS volumes (format <drive>: /fs:ntfs /A:16K).
· Increase the NTFS log file size to 64 MB for large volumes (chkdsk /L:65536).

 

Data

· Distribute access to data on the server across multiple network segments if applicable.
· Distribute your data as evenly as possible across multiple data volumes.

 

Provided that you use hardware that has been optimized for Windows 2000, and you follow the hardware guidelines presented above in Table 1, Windows 2000 will perform well as a file server right out of the box, as Figure 4 below demonstrates. Therefore, you may find that these tuning recommendations for a Windows 2000–based file server may offer only a modest improvement in performance. Keep in mind that the specific hardware in your server may have different tuning requirements.

Bb742460.perftu05(en-us,TechNet.10).gif

Figure 4: Chart based on NetBench results showing the difference between running Windows 2000 in a tuned environment with Offline Files (CSC) enabled and directly out of the box without CSC enabled. Hardware was selected and configured based on the guidelines in this document. A higher value demonstrates better file server throughput.

Offline Files, sometimes referred to as client-side caching (CSC), is a new feature of Windows 2000 Professional that can also provide performance benefits in file server environments, as Figure 5 below shows. Offline Files are useful to clients that aren't always connected to the network on which their files are stored. Users who store their files locally can thus store them on a network file share for redundancy. Offline Files also provide some performance benefits when a client is connected to a network. For instance, in opening a file from a network share that has been marked for offline usage, the user in effect reads the file locally as opposed to over the network. See Appendix 3: Offline Files for details on configuring Offline Files.

Bb742460.perftu06(en-us,TechNet.10).gif

Figure 5: Chart, based on results generated using NetBench, demonstrating the performance benefit of enabling Offline Files, also known as client-side caching (CSC) A higher value demonstrates better file server throughput.

Table 11 in Appendix 1: Benchmark Configurations and Checklists provides the detailed steps you should take when testing Windows 2000 with NetBench. Note that other benchmarks, which represent different workloads, may have different tuning requirements.

File Server Performance Bottlenecks

There are many things that you can do to improve the performance of a Windows 2000–based file server. However, before you can improve its performance, you must first understand the limiting factor(s) or bottleneck(s). The Performance Bottleneck Analysis section has more details on common tools and practices for finding performance bottlenecks on Windows 2000. This section discusses some common file server bottlenecks.

  • Network bandwidth. This is one of the common file server bottlenecks. Using the performance monitoring tools in Windows 2000, you can determine the level of network traffic on the server. If this level is close to the maximum your network can support (for example, 100 Mbps on a 100BaseT network), and CPU resources are still available, then this is an indication that the network could be a performance bottleneck. If this is the case, adding additional network adapters or upgrading to a network with higher bandwidth can increase file server throughput. If the server has multiple network adapters, be sure to partition the load so that it is balanced among all network adapters in the system.

  • Disk performance. File servers are heavily dependent on the disk subsystem. This is especially true for file servers that experience a large number of write operations. By using a RAID controller with a larger cache size, faster disks, and more disks, you can increase the rate at which the server processes I/O requests. For example, a server with a single disk may only be capable of handling 100 I/Os per second. A server with a RAID controller with a good cache and several disks can easily handle more than a thousand I/Os per second.

  • CPU performance. When the CPU is saturated before memory, disk, or network bottlenecks appear, adding addition CPUs to the server can improve performance. In addition, adding CPUs with higher L2 caches can also improve performance. Figure 6 below shows the impact adding additional CPUs can have on performance. The lower line in the chart represents a single-processor server, while the upper line represents a four-processor server. For both configurations, the CPU is the bottleneck. Once a server's resources are 100 percent saturated and no additional CPUs can be added, there is little that can be done to improve performance. Upgrading to network adapters that support the advanced offloading features in the Windows 2000 TCP/IP stack provides additional throughput by freeing up CPU resources.

Bb742460.perftu07(en-us,TechNet.10).gif

Figure 6: Chart demonstrating the SMP file server scalability of Windows 2000 Server as measured by NetBench. A higher value demonstrates better file server throughput.

File Server Performance Results

Figures 7 through 10 below provide a comparison of the file server performance of Windows 2000 Server and Windows NT Server 4.0 in terms of throughput and response time. The results described in the figures have been generated using the NetBench 6.0 file server benchmark, and are representative only of the systems on which they have been tested. Different configurations can yield dramatically different results. For instance, the enterprise class configuration used Pentium III Xeon 500 MHz CPUs with a 512 KB L2 cache. Using the same CPUs with a 2 MB L2 cache would have produced significantly higher throughput. Or, on a similar note, using Pentium II 300 MHz CPUs would have yielded lower throughput. Changing other critical components such as the disk subsystem, network subsystem, or the amount of memory would have similarly affected throughput.

Bb742460.perftu08(en-us,TechNet.10).gif

Figure 7: Chart showing the performance improvements of Windows 2000 Server on a typical small business file server configuration as measured by NetBench. A higher value demonstrates better file server throughput.

Bb742460.perftu09(en-us,TechNet.10).gif

Figure 8: Chart showing the performance improvements of Windows 2000 Server on a typical enterprise file server configuration as measured by NetBench. A higher value demonstrates better file server throughput.

Bb742460.perftu10(en-us,TechNet.10).gif

Figure 9: Chart showing the improved responsiveness of Windows 2000 compared to Windows NT Server 4.0 on a typical small business configuration. A lower response time denotes better performance.

Bb742460.perftu11(en-us,TechNet.10).gif

Figure 10: Chart demonstrating the improved responsiveness of Windows 2000 compared to Windows NT Server 4.0 on a typical enterprise-level configuration. A lower response time denotes better performance.

Web Server Performance

Introduction

Web servers not only respond to user requests for static text and pictures but also serve as application platforms that interact dynamically with users. Perhaps the most important example of the use of dynamic content on the Web today is in e-commerce. In most cases, tuning a Web server for optimal performance requires balancing the need to send static data files quickly with the need to process dynamic content efficiently.

While this section provides some general tuning recommendations, it should be noted that every Web site differs in terms of content, file sizes, design (dynamic content in particular), request distribution, and many other factors. Because of these differences, some Web sites may respond well to these recommendations, while others may not. For this reason, we recommend that you thoroughly test your Web site using a load generation tool such as the WebLoad tool we used to obtain results for this section.

Web Server Benchmarks

The results in this paper were generated using two different performance tools to test Web server performance: the WebBench 3.0 benchmark from Ziff-Davis Benchmark Operations and WebLoad 3.5 from RadView Software, Inc. WebBench comes with several pre-defined tests that use a 60 MB file set to simulate different workloads including a static workload, an e-commerce workload, and dynamic workloads that use Internet Server API (ISAPI) and Common Gateway Interface (CGI) programs. All of these workloads model information that Ziff-Davis obtained from real Web sites. We used the default configurations for the WebBench tests except for the static tests of the enterprise-class server, for which we increased the number of threads on the clients from one to three in order to push the server to its peak performance.

Request rate (also called throughput) expressed in requests/second is the most commonly used Web server performance metric. It is the total of all client requests (such as HTTP GET requests) that a Web server responds to divided by the time period in seconds during which the requests are made.

Average response time is the average time a Web server takes to respond to client requests.

Error responses may occur during benchmarking. Make sure that the number of error responses is very small compared to the total number of requests made. We consider a test invalid when the error rate is greater than 2 percent. Usually a high error rate indicates a configuration problem or a very overloaded Web server.

WebLoad is a script-driven tool that is useful for capacity planning as well as benchmarking. Because it does not come with standard benchmark tests, we created the following four tests, each of which makes 80 percent static requests and 20 percent dynamic requests that each return 6,250 bytes of text (the average number of bytes returned by a static request):

  • A 60 MB file set with an ISAPI program; this is similar to a WebBench test except that the amount of data the ISAPI program returns is larger

  • A 240 MB file set with an ISAPI program

  • A 240 MB file set with a CGI program

  • A 240 MB file set with an Active Server Pages (ASP) program

For the performance results in this paper, we were careful to have the size of the returned dynamic content match the average size of a static request. This is because for any Web server benchmark, the measured request rate will vary inversely with the average number of bytes returned. That is, as the average response size decreases, the Web server request rate increases (up to a point). The converse is true as well. By making the dynamic response size equal to the static response size, the WebLoad test results can be used to make a fair comparison of the efficiency of using the various programming interfaces to generate dynamic content.

Both WebBench and WebLoad increase the load on a Web server by making requests from an increasing number of client systems or, optionally, by running multiple client threads on each system as a benchmark progresses. In our tests, the number of client systems or the total number of client threads making Web server requests does not represent the number of simultaneous users that a Web server can support because each request is made without including the time it would take for a real user to look at the returned data (what's known as think time). Thus, as with other benchmarks that do not pause to simulate a user reading the returned data, the performance you measure with WebBench and WebLoad may be different than what you will experience when you actually deploy a Web server.

Appendix 2: Test Lab Configuration shows the environment we used for the Web server tests reported in this paper. The benchmark configurations we used and detailed benchmark checklists can be found in Appendix 1: Benchmark Configurations and Checklists.

Small Business/Departmental Web Server Configuration

  • CPU: 500 MHz Pentium III

  • RAM: 256 MB

  • Network: 2 x 100Base-TX

  • Disk: At least 3 SCSI disks, one for Windows 2000, one for Web site files, and one for log files; if RAM is too small to hold all Web site files, either increase RAM or use at least a 4 disk RAID

Enterprise Web Server Configuration

  • CPU: 4 x 500 MHz Pentium III Xeon

  • RAM: 1 GB

  • Network: 2 Gigabit Ethernet

  • Disk: At least 3 SCSI disks, one for Windows 2000, one for Web site files, and one for log files; if RAM is too small to hold all Web site files, either increase RAM or use at least a 4 disk RAID

Web Server Configurations

For this paper, we tested two Web server configurations: a small business/departmental configuration (which we will simply refer to as small business) and an enterprise-class configuration (see the recommended configurations at left). In order to simplify our testing procedures and to eliminate the disk as a possible bottleneck, we used the same RAID disk subsystem for both configurations. The Server Configurations section in Appendix 2: Test Lab Configuration gives the detailed configurations of the servers we used.

Web Server Hardware Selection and Performance Tuning Concepts

Web server hardware components affect performance significantly. By looking at where performance bottlenecks are, you can select the hardware that will increase your server's performance most effectively. For example, a higher speed CPU will boost performance on a Web server that has a CPU bottleneck, but it will do very little for a system whose performance is limited by the amount of RAM in it. The Performance Bottleneck Analysis section below provides some guidance on finding and fixing bottlenecks.

Just as in the case of a file server, you can tune four key hardware components to achieve optimal system performance in a Web server:

  • CPU configuration. Changing the CPU speed, the CPU cache size, and/or the number of CPUs in a Web server can improve performance significantly. Using a faster CPU or one with a larger cache will always improve the performance of a Web server with a CPU bottleneck. For example, on Web servers that respond to a significant number of dynamic requests and use encryption, increasing the number of CPUs, CPU speed, or CPU cache size can be a very effective way to increase performance. However, adding CPU resources to a CPU-limited Web server sometimes will not improve performance. If you do not see much performance improvement for a highly dynamic site when adding CPU resources, the problem may be with the design of the dynamic content or Web application. For static workloads, the CPU is unlikely to be the bottleneck; the usual culprit is the network.

  • Memory. Web server performance is very sensitive to the amount of memory in a server. For example, Windows 2000 is able to cache highly demanded files in physical memory. By caching static files in memory, the server is able to process requests more efficiently since disk I/O is eliminated (except for logging). For the best performance, a Web server should have enough memory to hold all static files. If this is not possible, the disk subsystem becomes more critical.

  • Disk subsystem. The disk subsystem has very little effect on Web server performance for static workloads that can be cached in memory. For Web servers that have less memory or that use disk I/O to generate dynamic content, a RAID subsystem with at least four disks can improve performance substantially over a configuration that uses single disks connected to a SCSI controller.

  • Network subsystem. The total bandwidth available to the Web server spread across all of the network adapters in the server sets the limit for the number of bits that a server can send or receive. For Web servers on the Internet, this is probably the most common bottleneck. Correspondingly, the network bandwidth limits the request rate a server can handle. In order to determine the performance capabilities of a Web server, you must make sure that there is enough network bandwidth so that the tests reach the server's peak performance. For instance, if a powerful server only has 100 megabits of bandwidth, this will most likely be the factor that prevents the server from performing better. If the CPU is the bottleneck and there is still some network bandwidth available, using a network adapter that supports the offloading capabilities in Windows 2000 can free up CPU cycles to process more requests.

Resource Partitioning

Again, much as in the case of file servers, there are a number of components in Web servers whose performance you can enhance by means of resource partitioning (see the corresponding discussion of resource partitioning in the File Server Performance section):

  • PCI buses. Distribute the I/O among the different PCI buses in servers with multiple PCI buses. That is, network adapters and the RAID controller should be distributed as evenly as possible among the multiple buses, so that the total bandwidth for each PCI bus is as equal as possible. In general, it is a good idea to put a RAID controller on a bus that does not have a network adapter.

  • Network adapters. On servers with multiple network adapters, balance the client load among them. This is commonly accomplished by segmenting the network across the available network adapters.

  • Figure 11 below is a diagram of the test lab configuration we used for the testing related to this paper. It shows that we partitioned the client test systems into two networks. We were careful to assign our clients identifiers that would create a balanced load on both networks.

Bb742460.perftu12(en-us,TechNet.10).gif

Figure 11: Test lab setup with two networks. The even and odd numbered clients are on different networks that are managed by means of the switch.

  • CPUs. Windows 2000 includes general algorithms for balancing the load among CPUs in a multiprocessor system. However, you can improve performance in some cases by binding interrupts for selected controllers to specific CPUs. Microsoft provides a utility to bind the interrupts from network adapters to specific CPUs. This utility assigns the interrupt work associated with a network adapter to one or more CPUs.

  • Data partitions. Distribute Web site files on different disks, and on disks that are not used for Web server log files, so that the I/O bandwidth for each disk is balanced.

General Web Server Configuration and Tuning Recommendations

Table 2 below shows general Web-server tuning recommendations. You can find more detailed information about how to do the tuning in Appendix 1: Benchmark Configurations and Checklists. As always, the specific hardware in your server may have other performance tuning requirements.

Table 2 Recommended tuning for Web server hardware.

Server Component

Small Business Web Server

Enterprise Web Server

CPU

· Using CPUs with larger L2 cache has a positive impact on performance.

· Using CPUs with larger L2 cache has a positive impact on performance.
· Use the interrupt-affinity filter to assign interrupts from each network adapter to two CPUs.

Network

· Set network adapter receive buffers for optimal performance (see Appendix 4: Tuning for Gigabit Adapters for values by type of gigabit network adapter).
· Balance client I/O among network adapters.
· Enable TCP checksum offloading support if network adapters support it.

 

RAID/Disk

· Put the RAID controller in the fastest bus slot available, ideally a bus without network adapters or other controllers that generate many interrupts.
· Set cache memory: as large as possible.
· Set cache write policy: write back.
· Set cache read policy: read ahead.
· Set stripe size: 8 KB/16 KB for RAID 5 and 64 KB/128 KB or the maximum supported for RAID 0; the optimal size may vary by controller.
· Use multiple logical NTFS partitions. This can be set up using the RAID configuration software or logical disk manager in Windows 2000.
· Use 16 KB allocation size for formatting the NTFS volumes (format <drive>: /fs:ntfs /A:16K).
· Increase the NTFS log file size to 64 MB for large volumes (chkdsk /L:65536).
· If using separate SCSI disks, format them with an NTFS file system with a 16 KB allocation size.

 

Windows 2000

· Set as an application server.
· Set Internet Information Services (IIS) performance for 100,000+ hits per day.
· Tune the registry (see below).
· Remove script and script/execute permission on directories containing only static data. Keep these permissions on directories with ISAPI, ASP, and CGI programs.
· Set Application Protection to Low (IIS Process) on directories with ISAPI, ASP, and CGI programs.
· Turn off Index This Resource property for the Web site.
· Put the log file on a disk other than the one with the Web site files.

 

Important Web Server Registry Parameters

There are three Web server registry settings in

HKLM\System\CurrentControlSet\Services\InetInfo\Parameters

that can have a significant impact on Web server performance:

  • MemCacheSize is the size of the virtual memory that Internet Information Services (IIS) uses to cache static files. The larger this value, the more space IIS has to cache content for faster delivery. The IIS process manages this cache. Unless manually set, this is dynamically adjusted by IIS to 50 percent of the available physical memory.

    MemCacheSize

    Set to at least the size of the popular file-set, especially if physical memory is available.

    • Too small: Can result in high CPU usage, because popular files might not get cached.

    • Too large: Can cause excessive paging.

  • ObjectCacheTTL is the length of time a file remains in the IIS cache before being removed. This is measured from the time when the file was last requested. Each time a file is accessed, the time is reset. Data items that have not been accessed for the number of seconds defined by the ObjectCacheTTL setting are removed from the cache.

    ObjectCacheTTL

    The default is 60 seconds. Increase if physical memory is available.

    • Too small: Results in excessive CPU usage with files being removed from cache only to be added again.

    • Too large: Stale files keep other files from being cached.

  • MaxCachedFileSize is the size of the largest file that IIS will cache To keep large files from filling up the IIS cache, IIS can be set to not cache large files. It is set to 256 KB by default.

    MaxCachedFileSize

    The default setting of 25 6KB is usually adequate. Increase if physical memory is available.

    • Too small: Large files are not cached, even though space may be available.

    • Too large: Large files crowd out small files in the IIS cache.

The three settings are not completely independent. A large MemCacheSize is good, if you have the physical memory to hold it. It is recommended that the IIS cache size (MemCacheSize) be set to at least the size of the popular file-set—that is, it should be set to accommodate the files that are commonly requested. Since this setting affects the size of the virtual memory space, you may have to increase the page file size to accommodate it (a practical page file size limit is 1.5 GB).

Once IIS has reached the limit defined by MemCacheSize, it will not cache any more files until some files are removed. This is determined by the length of time a file can be in cache without being requested (the number of seconds defined by ObjectCacheTTL). Note that the IIS cache is not limited by physical memory, but a very large file cache can cause excessive paging, thus adversely affecting performance.

The popular file set will consist of files of various sizes. Some of these may be quite large. If your overall file set has a significant number of large files, and the amount of physical memory isn't large enough to cache your popular file set, you should consider not caching the larger files. By setting MaxCachedFileSize to keep the larger files out of the IIS cache, you will free up space for smaller popular files.

If you understand the access frequency of commonly accessed files, you can more accurately determine the value of the time each file should live in the cache (determined by ObjectCacheTTL). If ObjectCacheTTL is set to cache files for too long, the server won't be able to remove files from it in a timely manner. If the value of the ObjectCacheTTL setting is too small, the server will use more CPU resources to add and remove these files from the cache.

Tuning the registry in these ways will not, however, improve performance in every environment. For example, as the graph below in Figure 12 shows, these settings have little impact on the performance of Windows 2000 Server when tested using the static WebBench workload.

Bb742460.perftu13(en-us,TechNet.10).gif

Figure 12: Comparison of tuned and out-of-box (untuned) Web server request rate for the standard WebBench static-content-only test. Test was performed on a typical enterprise system configured following the hardware guidelines in this document. Higher values represent better Web server performance.

In other environments, the tuning recommendations outlined above can have a profound impact on the performance of a Web server. For instance, when running dynamic pages or Web applications, significant performance can be gained by running the application in the Web server process. However, this should only be done for applications that have been tested and deemed reliable. Figure 13 below shows the difference between running an ISAPI application in-process (the tuned configuration) and running one out-of-process (the default configuration).

In Process vs. Out of Process

In order to provide the highest levels of reliability, Web applications (ASP & ISAPI) running on Windows 2000 run in a separate process outside of the Web server process. While this is an option in Windows NT Server 4.0, it is not the default behavior as it is with Windows 2000 Server. As Figure 13 at right demonstrates, running applications in a separate process has a significant impact on performance. To achieve maximum performance, we recommend that you run trusted applications in-process. However, this decreases the reliability of the Web server. For heavy-duty requests, the overhead of running out-of-process is only a small fraction of the request execution time, so you will find little benefit in moving to in-process.

Bb742460.perftu14(en-us,TechNet.10).gif

Figure 13: Comparison of tuned and out-of-box (default) Web server request rate for the standard WebBench test using 80 percent static content, 20 percent dynamic content generated using an ISAPI program. The tuned system is running in-process while the out-of-box system is running out-of-process (see discussion at left). Higher values represent better Web server performance.

Web Server Performance Bottlenecks

As you eliminate bottlenecks, you will see Web server performance improve. Benchmarks allow you to find bottlenecks with a focused effort in a controlled environment and to verify that you eliminated them. Benchmarks also let you see how much performance scales as you increase server resources, which will give you a good sense of how well you can expect your server to perform.

The following are common Web server bottlenecks:

  • Memory. When there is not enough memory to hold all of the static Web site files, a Web server will have a memory bottleneck. The only way to eliminate this bottleneck is to add more memory. If the server memory cannot be expanded, then memory will be the factor preventing the system from performing better. In this case, using a RAID controller with as many disks as possible (the more the better) will provide somewhat better performance. Also, clustering two or more servers together using the Network Load Balancing services in Windows 2000 can also help compensate for this problem.

  • Network bandwidth. The network becomes the bottleneck when the server has saturated the available network bandwidth and there are still other server resources available (memory, CPU, etc). To remedy this, try one of the following:

    1. Partition the network load so that it is balanced among all network adapters in the server. This is most easily done using virtual local area networks (VLANs) in a switch connected to the server.

    2. Use smart network adapters that take advantage of the advanced offloading features in Windows 2000.

    3. If there is more bandwidth available on the network, add more network adapters to make better use of the available bandwidth.

    4. Change to a higher speed network—for example, upgrade to gigabit networking from 100Base-TX.

  • Disk performance. The way to deal with a Web server whose performance is limited by its disks depends on whether it uses a RAID or not:

    1. If the server uses a RAID, add more disk drives, using faster drivers if possible, and increase the memory cache size on the controller.

    2. If the server does not use a RAID, partition the data files among more disk drives, using multiple SCSI controllers if possible, and switch to higher speed disk drives.

  • CPU performance. When CPU usage reaches 100 percent before a memory, disk, or network bottleneck appears, you have several options:

    1. Switch to CPUs with a larger L2 cache.

    2. Switch to faster CPUs.

    3. Add CPUs. Figures 14 and 15 on the following page show the impact adding more CPUs has on performance. The two lower lines in each chart give the performance for single-processor systems while the two upper lines represent four-processor servers. In all cases, the CPUs were 100 percent saturated, so they became the ultimate bottleneck in the servers. Once this happens, after removing all other bottlenecks, there is nothing more that can be done to improve the performance of a single system.

    4. After you have done everything to eliminate a CPU bottleneck, you can still improve performance by clustering the Web server with another using the Network Load Balancing feature of Windows 2000.

Bb742460.perftu15(en-us,TechNet.10).gif

Figure 14: Chart, based on WebBench results, showing how SMP enterprise systems compare to single-processor, small business systems in terms of their static request rates. Higher values represent better Web server performance.

Bb742460.perftu16(en-us,TechNet.10).gif

Figure 15: Chart showing ASP performance scaling for SMP systems as compared to single-processor systems, as measured by WebLoad using a 240 MB file set with 80 percent static requests and 20 percent dynamic content requests that return 6,250 bytes, the same as the average static response size. Higher values represent better Web server performance.

Web Server Performance Results

We used both WebBench 3.0 and WebLoad 3.5 to generate the performance results you see in Figures 16, 17, and 18 below. These results are representative only of the systems on which the tests were run. Other systems and configurations can yield very different results. For example, an enterprise-class configuration that uses 500 MHz Pentium III Xeon CPUs with a 2 MB L2 cache instead of the 512 KB L2 cache that was used for these tests would yield significantly greater throughput. Of course, a server that uses 300 MHz Pentium II CPUs would produce substantially lower throughput. Changes to other system components such as the disk subsystem, the network subsystem, or the amount of memory can have similar effects on performance.

Bb742460.perftu17(en-us,TechNet.10).gif

Figure 16: Chart showing the Web server performance improvements of Windows 2000 as compared to Windows NT Server 4.0. This test uses the standard WebBench mix of 80 percent static, 20 percent dynamic content. Higher values represent better Web server performance.

Bb742460.perftu18(en-us,TechNet.10).gif

Figure 17: Enterprise Web server CGI, ASP, and ISAPI peak dynamic request rates for WebLoad using a 240 MB file set with 80 percent static requests and 20 percent dynamic content requests that return 6,250 bytes, the same as the average static response size. Higher values represent better Web server performance.

Bb742460.perftu19(en-us,TechNet.10).gif

Figure 18: Small business Web server CGI, ASP, and ISAPI peak dynamic request rates for WebLoad using a 240 MB file set with 80 percent static requests and 20 percent dynamic content requests that return 6,250 bytes, the same as the average static response size. Higher values represent better Web server performance.

Networking Performance

Introduction

Just about every server in use today is somehow connected to a network. How a server makes use of the network is entirely dependent on the tasks that it performs. Web servers for instance use HTTP over TCP/IP to communicate with Web browsers, while application servers may use their own proprietary protocols over a network protocol such as TCP/IP to exchange data with clients. Since TCP/IP is rapidly becoming the de facto standard for networking, this section will focus on TCP/IP and related services.

Figure 19: How the file system uses the networking layer to communicate over a network.

Figure 19: How the file system uses the networking layer to communicate over a network.

There are many different benchmarks that can be used to measure network performance. For instance, benchmarks that are commonly used to measure Web server performance—like SPECWeb 96—can also be used to measure network performance. Although SPECWeb incurs some overhead for processing HTTP or Web server requests, its overhead in processing static content is relatively light. Therefore, SPECWeb 96 is a good measure of both network and Web server performance. The Test Transmission Control Protocol (TTCP), on the other hand, isolates the networking stack and can be used to measure the throughput efficiency of the network. Although not as realistic as SPECWeb, it is still a good tool for measuring the efficiency of the network subsystem.

In a server environment, the goal of the networking stack is to send as much data as possible at acceptable response times over the network while consuming the fewest system resources (CPU, memory, and so forth). Therefore, the metric most commonly used to assess network performance is throughput, which is usually measured in megabits per second (Mbps) or gigabits per second (Gbps).

More than a good TCP/IP stack is required for good performance. The CPU resources and network adapters and drivers have a huge impact on network throughput. For instance, new instructions added to the Intel Pentium III Xeon CPU improve the efficiency of processing the TCP checksum compared to the Pentium II Xeon processor. Advanced network adapters are capable of offloading some TCP operations, such as the checksum calculation, to the network adapter. By using advanced CPUs such as the Pentium III Xeon and a network adapter that supports the TCP offloading features in Windows 2000, you will experience better network throughput and performance, while consuming fewer system resources.

Windows 2000 & TTCP

There are different versions of the Test TCP (TTCP). Some are designed to work across multiple operating systems, while others are designed to take advantage of advanced features of a specific operating system. When testing the network performance of Windows 2000 using TTCP, it is critical that you use the Windows 2000 version of TTCP (NTttcp), which can be found in the Windows 2000 Driver Development Kit (DDK). The Windows 2000 version of TTCP takes advantage of some of the advanced networking features found only in Windows 2000.

NTttcp Checklist

  • Make sure you use the NT version of TTCP. This is available with the Windows 2000 Driver Development Kit (DDK).

  • Select network adapters that take advantage of the offloading feature in Windows 2000.

Testing Environment

Since other benchmarks that can be used to test network throughput, such as SPECWeb and NetBench, are discussed in other sections, this section will focus primarily on setting up and testing network performance using Windows 2000 TTCP. TTCP is the benchmark used by most networking vendors to test the performance of networking hardware and drivers. When running TTCP on Windows 2000, it is critical that you use the NTttcp version.

There are two types of environments that can be tested with TTCP:

  • Back-to-back. This tests the network throughput performance of two machines over a dedicated link. The hardware requirements for this test are two simple machines capable of driving the dedicated network between them.

  • One-to-many. This test shows how fast a server can send data to multiple clients (or receivers) across many links. Depending on the type of network—100 Mbps, gigabit, or asynchronous transfer mode (ATM)—a server capable of keeping up with the network is required. For instance, more processing power is required to drive a gigabit network than a 100 Mbps network. In addition, we recommend that you run your tests on a switched network.

Another thing to keep in mind when deploying and testing gigabit networks is the Maximum Transmission Unit (MTU) size. In general the greater the MTU size, the better the throughput.

Hardware Considerations

  • CPU. The CPU resources and network adapter and driver have a huge impact on network throughput. Make sure that there are enough CPU resources to drive the network that you are testing or deploying. Also, processors with larger L2 cache will result in better network performance.

    Servers with Multiple PCI Buses

    It's not always best to distribute multiple network adapters across different PCI buses in servers with multiple PCI buses. It is recommended that you test the different possibilities when testing with multiple network adapters in a server with multiple PCI buses.

  • Memory. Although memory doesn't have a tremendous impact on network throughput when using NTttcp, how the memory boards are populated does. To achieve maximum performance, all memory banks should be filled (which optimizes memory interleaving).

  • PCI bus. The standard 32-bit 33 MHz PCI bus is only capable of sustaining a theoretical maximum of approximately 1 Gbps. However, due to additional PCI bus transaction overhead, this limit is never achieved in real-world systems. Although a 32-bit 33 MHz PCI bus is capable of driving 100 Mbps networks, to achieve maximum throughput over gigabit networks, multiple PCI buses and/or multiple adapters may be required. That is, to drive a gigabit network, it is recommended that a server with at least one 64-bit 66 MHz PCI bus be used. To achieve even better network throughput, it is recommend that you use a system with multiple PCI buses.

  • Network adapters. To achieve maximum network throughput with Windows 2000, it is recommended that you select a network adapter that supports the advanced offloading features in the Windows 2000 TCP/IP stack. The advantage of using an adapter that supports Windows 2000 checksum offloading is illustrated in Figure 20 below. Gigabit adapters from vendors such as Alteon, Intel, and SysKonnect support these advanced features.

    TCP Checksum Offloading

    Advanced networking adapters, especially Gigabit Ethernet adapters, support the new TCP offloading features in Windows 2000. By offloading tasks such as the checksum to the network adapter, the server has more CPU resources available for performing other operations. Therefore, to achieve optimal network throughput on Windows 2000, we recommend that you use adapters that support Windows 2000 TCP offloading.

Bb742460.perftu21(en-us,TechNet.10).gif

Figure 20: Chart demonstrating the advantage of using a network adapter that supports TCP checksum offloading in Windows 2000. NTttcp was used to test and report the network throughput and CPU usage of a gigabit network adapter. Higher throughput values represent better network performance, while lower CPU usage represents better efficiency.

General Tuning Recommendations

This section provides some general tuning recommendations that, when used properly, can provide better network throughput.

  • MTU window size. Larger MTU size, which can be manipulated according to your network adapter, requires fewer packets to transfer the data. Therefore, the system has to do less work to send and receive the data. MTU size can be set to a standard frame of 1.5 KB or a jumbo frame of 9 KB on Windows 2000 (see Figure 21 below for a comparison).

    Bb742460.perftu22(en-us,TechNet.10).gif

    Figure 21: Chart showing the impact MTU size has on network throughput. NTttcp was used to measure the network throughput of gigabit network adapters (NICs) in different configurations. Higher values represent better network performance.

  • TCP window size. This parameter determines the maximum TCP receive window size. The receive window specifies the number of bytes that a sender can transmit without receiving an acknowledgment. In general, larger receive windows improve performance over high-delay, high-bandwidth networks. However, if the window size is too large over an unreliable network, this will result in having to repeat an excessive number of transmissions. For the greatest efficiency, the receive window should be an even multiple of the TCP Maximum Segment Size (MSS).

    Registry Setting TcpWindowSize

    Tcpip\Parameters\TcpWindowSize

    REG_DWORD – Number of Bytes

    0xFFFF

  • Network adapter settings. Each adapter has different settings that can affect network throughput. See Appendix 4: Tuning for Gigabit Adapters for specific settings for different network adapters.

Potential Performance Bottlenecks

The most common bottlenecks that impede the performance of a network are the CPU and the available network bandwidth. Assuming that there is plenty of network bandwidth available, network throughput will increase as more network adapters and CPUs are added (as Figure 21 above demonstrates).

  • Number of network adapters. In the case of the back-to-back test, the processor on the receiver is usually the bottleneck, so adding additional adapters may not help. However in the case of the one-to-many test, adding additional network adapters to the server can significantly increase performance if CPU resources are available. This is very similar to a production environment where the network is the bottleneck and the CPU on the server isn't. So in this case, adding additional network adapters can improve network throughput.

  • CPU resources. Processing network packets at gigabit speeds can be CPU resource–intensive. Each packet that is received or sent by a server requires system resources to process. When transferring data at gigabit speeds, the system must process approximately 85,000 TCP/IP packets.

Performance Bottleneck Analysis

Introduction

The process of finding performance bottlenecks is an iterative one—as you eliminate one bottleneck another will be created. That is, whenever you eliminate one bottleneck, another one will potentially arise. For example, suppose you are pedaling as fast as you can on a bike in low gear reaching your peak speed of 5 miles per hour (MPH). You change gears so that your can go faster. Eventually you'll reach your maximum pedaling strength and a maximum speed of, for instance, 10 MPH. Although you're getting more performance (speed) out of the bike when you eliminate one bottleneck (gear ratio), you soon run into another one (pedaling strength). This is analogous to what can happen on a server that's under a severe load. You may add more disks to the server resulting in higher throughput only to find that the network is now saturated, thus producing a new bottleneck. In which case you would need to add more network adapters or update your networking infrastructure. At some point, you will reach the ultimate performance bottleneck.

Assuming that the application is capable of making full use of the hardware resources available to it, the ultimate performance bottleneck is one that you cannot remove because of a resource limitation in your hardware. For example, if you have a CPU bottleneck on a server, and additional CPUs or faster CPUs cannot be added, then the CPUs would constitute the ultimate performance bottleneck for the server. In general, if you have tuned your server optimally, and have reached an ultimate performance bottleneck, this is the best performance that the system is capable of delivering.

Windows 2000 provides a built-in tool called System Monitor (also known as the performance monitor) that helps you look at detailed aspects of your server's performance. This section discusses the use of System Monitor in conjunction with benchmarks to provide an overview of the techniques you can use to measure server performance, to analyze those measurements, and to eliminate performance bottlenecks.

System Monitor Concepts

System Monitor provides both real-time display and file-logging of hardware usage and system-service activity on local or remote computers. You can use System Monitor to plot graphs and histograms and to view reports of current or archived data collected by counter logs. This section will show graphs from System Monitor log files made from tests done for this document.

To use System Monitor, you need to specify the following:

  • Type of data. To select the data to be collected, you specify performance objects (a logical collection of counters that is associated with a resource or service that can be monitored), performance counters, (a value that expresses a specific aspect of performance for an object) and object instances (one or more instances of a performance object; for example, a four-processor system would have five CPU object instances—one for each CPU and one for the overall total). Some objects provide data on system resources (such as memory); others provide data on the operation of applications (for example, system services or applications running on your computer).

  • Source of data. System Monitor can collect data from a local computer or from other computers on the network (you must be logged on as an Administrator to take advantage of this function). In addition, you can include real-time data or data collected previously using counter logs.

  • Sampling parameters. System Monitor supports manual, on-demand sampling or automatic sampling based on a time interval you specify. When viewing logged data, you can also choose starting and stopping times so that you can view data spanning a specific time range.

Configuring System Monitor for Benchmarks

To use System Monitor for bottleneck analysis, we recommend creating log files with the performance counters specific to the benchmarks and the potential bottlenecks. If you want to minimize the impact on the server's performance caused by logging performance counters to a file, the log file should be on a disk connected to a controller that is not the one to which the data disks are connected. For the same reason, we suggest logging only the performance counters indicated below for each potential bottleneck and using the logging interval default of 15 seconds between samples.

Table 3 below shows the minimum set of performance counters to log in order to find potential bottlenecks in any type of server. (The format for performance counters is <object>\ <performance counter>.)

Table 3 Performance counters that serve to detect basic system bottlenecks for any type of server.

Potential Bottleneck

Performance Counter

Definition

Memory

Memory\ Available Bytes

The amount of physical memory in bytes available to processes running on the computer. This counter displays the last observed value only; it is not an average.

 

Memory\ Pages/sec

The number of pages read from or written to disk to resolve hard page faults. (Hard page faults occur when a process requires code or data that is not in its working set or elsewhere in physical memory, and must be retrieved from disk). This counter serves as a primary indicator of the kinds of faults that cause system-wide delays.

 

Cache\ Data Maps/sec

The frequency that a file system such as NTFS maps a page of a file into the file system cache to read the page.

Network
(Capture for each network adapter instance. You must install the Network Monitor Driver in order to collect performance data using the Network Segment object counters.)

Network Interface\ Bytes Total/sec

The rate at which bytes are sent and received on the interface, including framing characters.

 

Network Interface\ Bytes Sent/sec

The rate at which bytes are sent on the interface, including framing characters.

 

Network Interface\ Bytes Received/sec

The rate at which bytes are received on the interface, including framing characters.

 

Network Segment\ % Network Utilization

Percentage of network bandwidth in use on a network segment.

Disk

PhysicalDisk\ % Disk Time

The percentage of elapsed time that the selected disk drive is busy servicing read or write requests.

 

PhysicalDisk\ % Idle Time

Reports the percentage of time during the sample interval that the disk was idle.

 

PhysicalDisk\ Disk Reads/sec

The rate of read operations on the disk.

 

PhysicalDisk\ Disk Writes/sec

The rate of write operations on the disk.

 

PhysicalDisk\ Avg. Disk Queue Length

The average number of both read and write requests that were queued for the selected disk during the sample interval.

CPU

Processor\ Interrupts/sec

The average number of hardware interrupts the processor is receiving and servicing in each second. It does not include deferred procedure calls (DPCs), which are counted separately. This value is an indirect indicator of the activity of devices that generate interrupts, such as the system clock, the mouse, disk drivers, data communication lines, network interface cards, and other peripheral devices. These devices normally interrupt the processor when they have completed a task or require attention. Normal thread execution is suspended during interrupts. Most system clocks interrupt the processor every 10 milliseconds, creating a background of interrupt activity. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.

 

Processor\ % Processor Time
(Use the _Total instance to track performance of all processors in a multiprocessor system.)

The percentage of time that the processor is executing a non-idle thread. This counter was designed as a primary indicator of processor activity. It is calculated by measuring the time that the processor spends executing the thread of the idle process in each sample interval, and subtracting that value from 100 percent. (Each processor has an idle thread, which consumes cycles when no other threads are ready to run.) It can be viewed as the percentage of the sample interval spent doing useful work.

 

System\ % Privileged Time
(Use the _Total instance to track performance of all processors in a multiprocessor system.)

The percentage of non-idle processor time spent in privileged mode. (Privileged mode is a processing mode designed for operating system components and hardware-manipulating drivers. It allows direct access to hardware and all memory. The alternative, user mode, is a restricted processing mode designed for applications, environment subsystems, and integral subsystems. The operating system switches application threads to privileged mode to access operating system services.) % Privileged Time includes time servicing interrupts and DPCs. A high rate of privileged time might be attributable to a large number of interrupts generated by a failing device. This counter displays the average busy time as a percentage of the sample time.

 

System\ % User Time
(Use the _Total instance to track performance of all processors in a multiprocessor system.)

The percentage of non-idle processor time spent in user mode. (User mode is a restricted processing mode designed for applications, environment subsystems, and integral subsystems. The alternative, privileged mode, is designed for operating system components and allows direct access to hardware and all memory. The operating system switches application threads to privileged mode to access operating system services.) This counter displays the average busy time as a percentage of the sample time.

 

System\ Processor Queue Length

The number of threads in the processor queue. There is a single queue for processor time even on computers with multiple processors. Unlike the disk counters, this counter counts ready threads only, not threads that are running. A sustained processor queue of greater than two threads generally indicates processor congestion. This counter displays the last observed value only; it is not an average.

 

System\ System Calls/sec

The combined rate of calls to Windows 2000 system service routines by all processes running on the computer. These routines perform all of the basic scheduling and synchronization of activities on the computer, and provide access to non-graphic devices, memory management, and name space management. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.

System

Context Switches/sec

The combined rate at which all processors on the computer are switched from one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is preempted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service. It is the sum of Thread: Context Switches/sec for all threads running on all processors in the computer and is measured in numbers of switches. There are context switch counters on the System and Thread objects. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.

Web Server Performance Counters

For Web servers, we recommend that you log the following performance counters shown in Table 4 below in addition to the general performance counters.

Table 4 Performance counters that serve to detect Web server bottlenecks.

Performance Object

Performance Counter

Definition

Process (inetinfo)

% Processor Time

The percentage of elapsed time that all of the threads of this process used the processor to execute instructions. An instruction is the basic unit of execution in a computer; a thread is the object that executes instructions; and a process is the object created when a program is run. Code executed to handle some hardware interrupts and trap conditions are included in this count. On multi-processor machines the maximum value of the counter is the number of processors multiplied by 100 percent.

 

% Privileged Time

The percentage of elapsed time that the threads of the process have spent executing code in privileged mode. When a Windows 2000 system service is called, the service will often run in privileged mode to gain access to system-private data. Such data is protected from access by threads executing in user mode. Calls to the system can be explicit or implicit, such as page faults or interrupts. Unlike some early operating systems, Windows 2000 uses process boundaries for subsystem protection in addition to the traditional protection of user and privileged modes. These subsystem processes provide additional protection. Therefore, some work done by Windows 2000 on behalf of your application might appear in other subsystem processes in addition to the privileged time in your process.

 

% User Time

The percentage of elapsed time that the threads of the process have spent executing code in user mode. Applications, environment subsystems, and integral subsystems execute in user mode. Executing code in user mode cannot damage the integrity of the Windows NT Executive, Kernel, and device drivers. Unlike some early operating systems, Windows 2000 uses process boundaries for subsystem protection in addition to the traditional protection of user and privileged modes. These subsystem processes provide additional protection. Therefore, some work done by Windows 2000 on behalf of your application might appear in other subsystem processes in addition to the privileged time in your process.

 

Thread Count

The number of threads currently active in the process. An instruction is the basic unit of execution in a processor, and a thread is the object that executes instructions. Every running process has at least one thread.

 

Page File Bytes

The current number of bytes the process has used in the paging file(s). Paging files are used to store pages of memory used by the process that are not contained in other files. Paging files are shared by all processes, and lack of space in paging files can prevent other processes from allocating memory.

WebService

Bytes Total/sec

The sum of Bytes Sent/sec and Bytes Received/sec. This is the total rate of bytes transferred by the Web service.

 

Get Requests/sec

The rate at which HTTP requests using the GET method are made. Get requests are generally used for basic file retrievals or image maps, though they can be used with forms.

 

CGI Requests/sec
(only if you use CGI programs)

The rate of CGI requests that are simultaneously being processed by the Web service.

 

ISAPI Extension Requests/sec
(only if you use ISAPI extensions)

The rate of ISAPI Extension requests that are simultaneously being processed by the Web service.

Active Server Pages
(only if you use ASP programs)

Requests/sec

The number of requests executed per second.

Internet Information Services Global

Current Files Cached

Current number of files whose content is in the cache for World Wide Web (WWW) and File Transfer Protocol (FTP) services.

 

File Cache Hits %

The ratio of file cache hits to total cache requests. A file cache hit is a successful lookup in the system's file cache.

File Server Performance Counters

For file servers, we recommend that you log the additional performance counters shown in Table 5 below.

Table 5 Performance counters that serve to detect file server bottlenecks.

Performance Object

Performance Counter

Definition

Server

Bytes Transmitted/sec

The number of bytes the server has sent on the network. Indicates how busy the server is.

 

Bytes Received/sec

The number of bytes the server has received from the network. Indicates how busy the server is.

System

File Control Operations/sec

The combined rate of file system operations that are neither read operations nor write operations, such as file system control requests and requests for information about device characteristics or status. This is the inverse of System: File Data Operations/sec and is measured in number of operations per second. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.

Server Work Queues

Queue Length

The current length of the server work queue for the CPU. A sustained queue length greater than four might indicate processor congestion. This is an instantaneous count, not an average over time.

Memory

Cache Bytes

The number of bytes currently being used by the file system cache. The file system cache is an area of physical memory that stores recently used pages of data for applications. Windows 2000 continually adjusts the size of the cache, making it as large as it can while still preserving the minimum required number of available bytes for processes. This counter displays the last observed value only; it is not an average.

 

Page Faults/sec

The overall rate-faulted pages are handled by the processor. It is measured in numbers of pages faulted per second. A page fault occurs when a process requires code or data that is not in its working set (its space in physical memory). This counter includes both hard faults (those that require disk access) and soft faults (where the faulted page is found elsewhere in physical memory). Most processors can handle large numbers of soft faults without consequence. However, hard faults can cause significant delays. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.

 

Transition Faults/sec

Transition Faults/sec is the number of page faults resolved by recovering pages that were on the modified page list, on the standby list, or being written to disk at the time of the page fault. The pages were recovered without additional disk activity. Transition faults are counted in numbers of faults, without regard for the number of pages faulted in each operation. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.

How to Find Performance Bottlenecks

The process for finding performance benchmarks can be distilled into five basic steps:

  1. Put a repeatable load on a server.

  2. Measure how system resources perform.

  3. Analyze those measurements.

  4. Modify server hardware and software to eliminate the bottlenecks.

  5. Start the process over until you have reached the ultimate performance bottleneck.

To find bottlenecks, you can use an actual load on your server, capacity planning tools, or a benchmark that simulates the load on your server. Benchmarks and capacity planning tools allow you to find bottlenecks with a focused effort in a controlled environment and to verify that you eliminated them. They also let you see how much performance scales as you increase server resources, which will give you a good sense of the highest performance you can expect from your server after you deploy it. If you use actual loads for server tuning, you will likely find that:

  • Loads may not be repeatable because they depend on which users were doing what at a particular time of day.

  • Some tuning changes will disrupt server use and upset users.

Strategies for Testing and Tuning Servers

To make the task of testing and tuning a server more efficient and orderly, we suggest looking for bottlenecks in the following sequence:

  1. CPU bottlenecks

  2. Memory bottlenecks

  3. Disk bottlenecks

  4. Network bottlenecks

We also recommend the following strategies in conducting your server testing and tuning:

  • Make one change at a time, such as adding hardware resources or changing an operating system setting. In some cases, a problem that appears to relate to a single component may be the result of bottlenecks involving several components. For this reason, it is important to address problems individually. Making multiple changes simultaneously may make it impossible to assess the impact of each individual change.

  • Repeat monitoring after every change. This is important for understanding the effect of the change and to determine whether additional changes are required. Proceed methodically, making one change to the identified resource at a time and then testing the effects of the changes on performance. Because tuning changes can affect other resources, it's important to keep records of the changes you make and to re-monitor after you make a change.

  • In addition to monitoring, review event logs, because some performance problems generate output you can display in the Event Viewer.

  • Because capturing or displaying performance counters will increase use of system resources, thereby reducing the measured benchmark performance, we suggest that you first run a benchmark to get a performance curve without gathering performance counters and then run the same benchmark while running System Monitor. You can usually shorten this second benchmark run by collecting performance counters only during the parts of the test around the peak performance point. Be sure to get data for a few points below the peak as well as a point or two after it.

CPU Bottleneck Analysis and Removal

Removing CPU Bottlenecks

  • Switch to CPUs with a larger L2 cache.

  • Switch to faster CPUs.

  • For multiprocessor systems, add more CPUs.

  • On multiprocessor computers, manage the processor affinity with respect to process threads and interrupts.

  • After you have done everything to eliminate a CPU bottleneck, you can still improve performance by clustering the server with another.

Figure 22 below demonstrates how adding additional hardware resources can affect Web server performance. Similar results can be seen for other workloads such as those generated by file servers, networking, or applications.

Bb742460.perftu23(en-us,TechNet.10).gif

Figure 22: Web server static request performance for small business- and enterprise-class Web servers.

Figure 23 below shows a System Monitor graph taken from a log file made during a WebBench static content test of a small business-class system. The vertical lines occur at the end of each WebBench test mix as the benchmark ends one mix and prepares to start the next. The CPU usage percentage, shown in red, peaks at 98.5 percent and stays level for the last eight test mixes as more clients are added (the plateaus indicate the test mixes). At the same time, you can see that there are still over 100 MB of memory available (the maroon line). The two network interfaces each reach a little over 8.6 MB/second of I/O. This is less than 70 Mbps, indicating that the networks are reasonably loaded but not the cause of the performance bottleneck. We have left the disks off this System Monitor graph for clarity and because there is very little disk activity due to the data files being cached in memory.

Bb742460.perftu24(en-us,TechNet.10).gif

Figure 23: System Monitor graph showing a CPU bottleneck for a one-processor system. Notice that the top line (in pink here), which represents the CPU usage, is at 100 percent.

Table 6 below shows the thresholds of the essential performance counters that indicate a CPU bottleneck. It also gives some suggestions for eliminating CPU bottlenecks.

Table 6 CPU performance counter thresholds that indicate a bottleneck.

Resource

Object\ Counter

Suggested Threshold

Comments

Processor

Processor\ % Processor Time

95%

Upgrade to a processor with a larger L2 cache, a faster processor, or install an additional processor.

Processor

Processor\ Interrupts/sec

Depends on processor.

A dramatic increase in this counter value without a corresponding increase in system activity indicates a hardware problem. Identify the network adapter causing the interrupts. Use the affinity tool to balance interrupts in a multiprocessor system.

Processor

Processor\ % Interrupt Time

Depends on processor.

An indirect indicator of the activity of disk drivers, network adapters, and other devices that generate interrupts.

Server

Server Work Queues\ Queue Length

4

Tracks the current length of the server work queue for the computer. If the value reaches this threshold, there may be a processor bottleneck. This is an instantaneous counter; observe its value over several intervals.

Multiple Processors

System\ Processor Queue Length

2

This is an instantaneous counter; observe its value over several intervals. A queue of two or more items indicates a bottleneck. If more than a few program processes are contending for most of the processor's time, installing a faster processor or one with a larger L2 cache will improve throughput. An additional processor can help if you are running multithreaded processes, but be aware that scaling to additional processors may have limited benefits.

Managing Processor Affinity on Multiprocessor Systems

If you want to assign a particular process or program to a single processor to improve its performance at the expense of other processes, in Task Manager, click Set Affinity. This option is available only on multiprocessor systems.

Controlling processor affinity can improve performance by reducing the number of processor cache-flushes as threads move from one processor to another. This might be a good option for dedicated file servers. However, be aware that dedicating a program to a particular processor may not allow other program threads to migrate to the least-busy processor.

You may also want to control processor affinity for interrupts generated by disk or network adapters. The IntFiltr tool, discussed earlier in this paper, enables you to manage interrupts in this way.

Memory Bottleneck Analysis and Removal

Removing Memory Bottlenecks

  • Increase physical memory above the minimum required.

  • Create multiple paging files.

  • Determine the correct size for the paging file.

  • Ensure that memory settings are properly configured.

Memory can become a bottleneck when there is not enough physical RAM to cache files and data such as HTML files on a Web server or data on a database server. Figure 24 below shows the performance of the same enterprise-class Web server configured first with 128 MB, then 256 MB, and then 2 GB of memory. Notice that as physical RAM is increased, memory becomes less of a performance bottleneck. In this case, the memory is eliminated as a performance bottleneck by adding enough physical RAM to cache all of the static HTML files.

Bb742460.perftu25(en-us,TechNet.10).gif

Figure 24: Memory limits Web server performance for these ASP tests.

Figure 25 below shows a System Monitor graph taken from a log file made during a WebLoad ASP test of an enterprise-class system with 128 MB of memory. The vertical lines occur at the end of each WebLoad test mix as the benchmark ends one mix and prepares to start the next. CPU usage and network throughput climb throughout the test even though the number of GET requests/second flattens out quickly. At the same time, you can see that the % Disk Read Time for the G: drive reaches 100 percent very quickly and stays there throughout the test. This could indicate that the disk subsystem is a performance bottleneck. However, by looking at the Memory\ Pages/Sec performance counter, you can see that there is not enough memory to hold the entire set of Web site files in memory and the system is forced to page to or from disk at over 1500 pages/second. Notice that the Pages/Sec curve matches the GET requests/second curve, indicating that the two are tied together.

Bb742460.perftu26(en-us,TechNet.10).gif

Figure 25: Memory bottleneck in an enterprise-class server with 128 MB memory when accessing a 240 MB Web site with 20 percent of the access through ASP programs.

Figure 26 below is a System Monitor graph from the same system with 256 MB of memory. It shows that memory is still a bottleneck but much less so than in the case of the 128 MB memory configuration depicted above. Consequently, more requests are being cached in memory, thus resulting in more pages/second being served from memory. In addition, notice that the Get Requests/Second has also gone up.

Bb742460.perftu27(en-us,TechNet.10).gif

Figure 26: Memory bottleneck on an enterprise-class Web server with 256 MB of memory. This is based on a run of the same test as shown in Figure 26.

Table 7 below lists the performance counters you should monitor in order to find any memory-related bottlenecks.

Table 7 Performance counters to watch for memory-related bottlenecks.

Resource

Object\ Counter

Suggested Threshold

Comments

Memory

Memory\ Available Bytes

Less than 4 MB

Research memory usage and add memory if needed.

Memory

Memory\ Pages/sec

20

Research paging activity.

Server

Server\ Pool Paged Peak

Amount of physical RAM

This value is an indicator of the maximum paging file size and the amount of physical memory.

Disk Bottleneck Analysis and Removal

Removing Disk Bottlenecks

  • If the server uses a RAID, add more disk drives, using faster drives if possible, and increase the memory cache size on the controller.

  • If the server does not use a RAID, switch to higher-speed disk drives.

  • Use Disk Defragmenter to optimize disk space.

Disk-usage statistics help you balance the workload of network servers. System Monitor provides physical disk counters for troubleshooting, capacity planning, and for measuring activity on a physical volume. When testing disk performance, log performance data to another disk or computer so that it does not interfere with the disk you are testing.

To create a disk bottleneck, we configured the enterprise-class server with a three-disk RAID. Figure 27 below shows the file server performance we obtained for the three-disk RAID and 10-disk RAID configurations. Clearly, the disk contention was significantly reduced when we increased the RAID to 10 disks.

Bb742460.perftu28(en-us,TechNet.10).gif

Figure 27: Disk bottleneck shows up in an enterprise-class file server with fewer disks in a RAID (CSC = Offline Files).

Figure 28 (below), a chart from the System Monitor, shows that the % Disk Time for the RAID that contained the shared files was pegged at 100 percent (except for the downward spikes between NetBench test mixes). You can see from the graph that the CPU and the network are not being used to full capacity; therefore, they are not bottlenecks.

Bb742460.perftu29(en-us,TechNet.10).gif

Figure 28: Disk bottleneck in an enterprise-class file server

Determining Workload Balance

To balance loads on network servers, you need to know how busy the server disk drives are. Use the Physical Disk\ % Disk Time counter, which indicates the percentage of time a drive is active. If % Disk Time is high (over 90 percent), check the Physical Disk\ Current Disk Queue Length counter to see how many system requests are waiting for disk access. The number of waiting I/O requests should be sustained at no more than one and a half to two times the number of spindles making up the physical disk.

Most disks have one spindle, although RAID devices usually have more. A hardware RAID device appears as one physical disk in System Monitor; RAID devices created through software appear as multiple drives (instances). You can either monitor the Physical Disk counters for each physical drive (other than RAID), or you can use the _Total instance to monitor data for all the computer's drives.

Use the values of the Current Disk Queue Length and % Disk Time counters to detect bottlenecks with the disk subsystem. If Current Disk Queue Length and % Disk Time values are consistently high, consider upgrading the disk drive or moving some files to an additional disk or server. Table 8 below lists the performance counters you can monitor in order to detect a disk bottleneck.

Table 8 Performance counters to monitor for a disk bottleneck.

Resource

Object\ Counter

Suggested Threshold

Comments

Disk

PhysicalDisk\ % Disk Time

90%

Add more disk drives and partition the files among all of the drives.

Disk

PhysicalDisk\ Disk Reads/sec, PhysicalDisk\ Disk Writes/sec

Depends on manufacturer's specifications

Check the specified transfer rate for your disks to verify that this rate doesn't exceed the specifications. In general, Ultra Wide SCSI disks can handle 50 I/O operations per second.

Disk

Physical Disk\ Current Disk Queue Length

Number of spindles plus 2

This is an instantaneous counter; observe its value over several intervals. For an average over time, use Physical Disk\ Avg. Disk Queue Length.

Network Bottleneck Analysis and Removal

We created a network bottleneck in an enterprise-class Web server when we replaced the two Gigabit Ethernet adapters in it with a single 100Base-TX network adapter. Figure 29 below shows the performance of, on the one hand, an enterprise-class server with a network bottleneck, and on the other, one without.

Bb742460.perftu30(en-us,TechNet.10).gif

Figure 29: The effect a network bottleneck can have on Web server performance.

Figure 30 (below) is a snapshot of key performance counters taken during the testing of the enterprise-class system constrained by a network bottleneck. The bottleneck was created by limiting the available bandwidth to 100 Mbps. The Total Bytes/Second exceeds the 100 Mbps bandwidth of the one network adapter and CPU usage does not change as the load on the Web server is increased, thus indicating a bottleneck.

Bb742460.perftu31(en-us,TechNet.10).gif

Figure 30: Network bottleneck on an enterprise-class Web server with a single 100Base-TX network adapter.

Monitoring Overall Network Traffic

The performance counters that you can use to monitor your system for network bottlenecks are listed below in Table 9.

If network traffic exceeds local area network (LAN) capacity, performance typically suffers across the network. To prevent this situation, it is important to monitor network-wide traffic levels, particularly on larger networks with bridges and routers, using the Network Segment object. When monitoring network traffic, the three Network Segment object counters are of special interest. To analyze the statistics for your network segment, install Network Monitor.

Network monitoring typically consists of observing server resource usage and measuring overall network traffic. With System Monitor you can handle both of these activities, although for in-depth traffic analysis, you should use Network Monitor.

Table 9 Performance counters to monitor for network bottlenecks.

Resource

Object\ Counter

Suggested Threshold

Comments

Network

Network Segment\ % Net Utilization

Depends on type of network

For full-duplex, switched Ethernet networks, for example, 80 percent can indicate a bottleneck.

Processor

Processor\ Interrupts/sec

Depends on processor

A dramatic increase in this counter value without a corresponding increase in system activity indicates a hardware problem. Identify the network adapter causing the interrupts. Use the affinity tool to balance interrupts in a multiprocessor system.

Server

Server\ Work Item Shortages

3

If the value reaches this threshold, consider tuning InitWorkItems or MaxWorkItems in the registry (under HKEY_LOCAL_MACHINE \SYS
TEM\CurrentControlSet\Services\LanmanServer).

Server

Server\ Bytes Total/sec

 

If the sum of Bytes Total/sec is roughly equal to the maximum transfer rates of your network, you may need to segment the network.

Network Segment

Network Segment\ Broadcast frames received/second

Depends on network

Can be used to establish a baseline if monitored over time. Large variations from the baseline can be investigated to determine the cause of the problem. Because each computer processes every broadcast, high broadcast levels mean lower performance.

Network Segment

Network Segment\ % Network utilization

Depends on network

Indicates how close the network is to full capacity. The threshold depends on your network infrastructure and topology. If the value of the counter is above 30 to 40 percent, collisions can cause problems.

Network Segment

Network Segment\ Total frames received/second

Depends on network

Indicates when bridges and routers might be flooded.

To concentrate on network-related resource usage, add the counters that correspond to the various layers of your network configuration. Abnormal network counter values often indicate problems with a server's memory, processor, or disks. For that reason, the best approach to monitoring a server is to watch network counters in conjunction with Processor\ % Processor Time, PhysicalDisk\ % Disk Time, and Memory\ Pages/sec.

Removing Network Bottlenecks

  • Partition the network load so that it is balanced among all network adapters in the server. This is most easily done using VLANs in a switch connected to the server.

  • Use smart network adapters to take advantage of the advanced offloading features in Windows 2000.

  • Add more network adapters to increase the available bandwidth, although doing so will increase the interrupt load on the CPU if you use "dumb" network adapters.

  • Change to a higher-speed network, for example, upgrade to gigabit networking from 100Base-TX.

  • Unbind infrequently used network adapters.

For example, if a dramatic increase in Pages/sec is accompanied by a decrease in Bytes Total/sec handled by a server, the computer is probably running short of physical memory for network operations. Most network resources, including network adapters and protocol software, use non-paged memory. If a computer is paging excessively, it could be because most of its physical memory has been allocated to network activities, leaving a small amount of memory for processes that use paged memory. To verify this situation, check the computer's system event log for entries indicating that it has run out of paged or non-paged memory.

Understanding the bottlenecks and potential bottlenecks of an environment will allow you to get the most out of your server. Furthermore, this knowledge can also prevent investing in areas that will not improve system performance. For example, investing in a server with more processing power when the network is a bottleneck will not increase system performance.

Conclusion

Getting the best performance out of your system is a science that has been practiced for many years. One of the main goals of Windows 2000, like Windows NT versions before it, has been to provide high performance on the right hardware straight out of the box. This guide provides the basics on getting the best performance out of Windows 2000 in the outlined scenarios; however, it shouldn't be considered the final word. Other resources, available from Microsoft and others, can provide more details and advanced planning guidelines for getting the most out of your Windows 2000 environment. The Windows 2000 Resource Kit, for example, provides some excellent tools and guidelines for planning your Windows 2000 environment.

For More Information

For the latest information on Windows 2000, check out Microsoft TechNet or our Web site at https://www.microsoft.com/windows2000.

Appendix 1: Benchmark Configurations and Checklists

General Guidelines for Benchmark Configuration, Server Tuning and Record Keeping

When you run different types of benchmarks it is important to be sure that you start with a known initial state for both the server and client systems. Otherwise, the benchmark results you obtain may be inaccurate and may not be reproducible. These guidelines will help you run accurate benchmarks with reproducible results:

  1. Apply all registry, service, and configuration tuning applicable to the benchmark you plan to run. The subtlety here is that the tuning you apply for one benchmark may not be the best for another. All of the tuning recommendations in this paper assume that the server is starting off in the default installation state.

  2. For file server benchmarks and any others that write data to files, reformat the data partitions and copy back or recreate the test data. This will make sure that the file system is in the same state each time you run a test.

  3. Reboot the server.

  4. Verify that any tuning enhancement that could change after a reboot is in the correct state.

  5. Reboot the client systems so they will be in the same initial state each time you run a test. Be especially careful to clean out the Offline Files cache before running any benchmark, even if you are going to test with the Offline Files feature turned on.

  6. Run the benchmark.

  7. Save the benchmark results in files with names that indicate the name of the server, the operating system, the benchmark, the server configuration, tuning information, and so forth. For example, a file named N7k-win2k-wb-ecomm-1024bkey-2GB-std-tunes would indicate a test of a Netfinity 7000 system running Windows 2000 using WebBench's e-commerce test with a 1024-bit key used in the certificate, and that the server had 2 GB of RAM and used the standard tuning configuration you specified for the server. As you build up a history of tests, the names you pick for the result files will save a great deal of time in figuring out which results you want to look at.

  8. Keep a notebook with the results of each test, including the name of the file you saved the results in, server configuration information, and any other notes that will help you know the conditions of each test and your test lab.

Configuring Client Systems for Benchmarks

The number of client systems you need to use for a benchmark depends on their speed and amount of memory, the speed of the server under test, and the benchmark you are using. For example, the standard WebBench, NetBench, and ServerBench test suites use 60 client systems running a single thread or process each to generate the load on the server under test. However, for the WebBench static tests on Windows 2000 using the enterprise-class server, we needed to increase the number of threads making requests to three in order to put enough load on the server to make it reach its peak performance. If we had used faster client systems, we would have been able to push the server to peak performance with fewer clients each running a single thread.

Appendix 2: Test Lab Configuration describes the client configurations we used for the tests reported in this paper. As long as you use enough client systems to make a Web server reach its maximum performance, it does not matter how fast your client systems are.

Web Server Benchmark Configuration Checklist

The checklist in Table 10 below provides general guidelines to help you prepare a Web server for optimal performance.

Table 10 Web server benchmark configuration checklist.

Step

Actions

Install Windows 2000 on a new partition.

· Make sure that you do a stand-alone server installation—that is, do not install Windows 2000 as a domain controller.
· Be sure to format the partition with the NTFS file system. This should be a separate partition from your data partitions.

Configure the RAID/disk subsystem.

If the server uses a RAID:
· Put the RAID controller in the fastest bus slot available, ideally a bus without network adapters or other controllers that generate many interrupts.
· Set cache memory: as large as possible.
Set cache write policy: write back.
· Set cache read policy: read ahead.
· Set stripe size: 8 KB/16 KB for RAID 5 and 64 KB/128 KB or the maximum supported for RAID 0; the optimal size may vary by controller.
· Use multiple logical NTFS partitions. This can be set up using the RAID configuration software or logical disk manager in Windows 2000.
· Use 16 KB allocation size for formatting the NTFS volumes (format <drive>: /fs:ntfs /A:16K).
· Increase the NTFS log file size to 64 MB for large volumes (chkdsk /L:65536). By default, this value is set dynamically based on the size of the volume.
· If using separate SCSI disks, format them with an NTFS file system with a 16 KB allocation size and Increase the NTFS log file size to 64 MB for large volumes (chkdsk /L:65536).

Configure the network subsystem.

· Set network adapter receive buffers for optimal performance (see Appendix 4: Tuning for Gigabit Adapters for values by type of network adapter).
· Balance client I/O among network adapters.
· Enable TCP checksum offloading support if network adapters support it.

Install the Web server benchmark data on the server.

· If the data is larger than the available memory in the server, put the data on a RAID for best performance. If the server has only individual SCSI disks, spread the data across separate disks on separate controllers, if available, to balance the load. Be sure not to use the disk with the Windows 2000 paging file or the Web server log file.

Configure Windows 2000.

· Set the Application Response to Optimize Performance for Applications on the server. Do this by right-clicking the My Computer icon on the desktop and selecting Properties. Then select the Advanced tab and click the Performance Options button.
· Set the File and Printer Sharing for Microsoft Networks to optimize performance for applications. Do this by right-clicking My Network Places on the desktop then right-clicking a network connection. Select File and Printer Sharing for Microsoft Networks then click the Property button.
· Set Internet Information Services (IIS) performance for 100,000+ hits per day.
· Remove script and script/execute permission on directories containing only static data. Keep these permissions on directories with ISAPI, ASP, and CGI programs.
· Set Application Protection to Low (IIS Process) on directories with ISAPI, ASP, and CGI programs. Note: This may decrease the reliability of your web server. An errant ISAPI or ASP component can bring down the entire web server, which it could not do in the default configuration.
· Turn off Index This Resource property for the Web site.
· Put the log file on a disk other than the one with the Web site files.
· Set the following values in the registry key HKLM\SYSTEM\CurrentControlSet\Services\InetInfo\Parameters\
· ObjectCacheTTL=dword: (seconds)
· MaxCachedFileSize=dword: (bytes)
· MemCacheSize=dword: (bytes)
(See the recommendations in Important Web Server Registry Parameters in the Web Server Performance section.)

NetBench Checklist

The following table provides information on setting up, configuring, and tuning Windows 2000 for NetBench testing.

Table 11 Checklist for benchmarking Windows 2000 using NetBench.

Step

Actions

Install Windows 2000 on a new partition.

· Be sure to format the partition with the NTFS file system. This should be a partition separate from your data partitions.

Configure the RAID subsystem.

The optimal configuration will depend on the RAID controller used. We suggest starting with the following:
· Put the controller in the fastest bus slot available, ideally a bus without network adapters or other controllers that generate many interrupts.
· Set cache memory: as large as possible.
· Set cache write policy: write back.
· Set cache read policy: read ahead.
· Set RAID level: 5 for reliability, 0 for speed.
· Set stripe size: 8 KB/16 KB for RAID 5 and 64 KB/128 KB for RAID 0; the optimal size may vary by controller and RAID level.

Create the data partitions.

· Create at least four data partitions or logical drives on the RAID using the disk manager (right-click on the system icon on the desktop and choose Manage). The remainder of this checklist will assume that you are using four partitions. You can change the number of partitions used, the drive letters, and share names as appropriate for your environment.
· We suggest creating batch files for the commands given in this checklist to simplify the testing process and to assure consistency between tests.
· Format each partition and increase its file system log size using the commands:
format <drive>: /fs:ntfs /v:RAID1 /a:16K /q
· Now set log size:
chkdsk <drive>: /l:65536

Install NetBench on the server.

· We strongly recommend that you install NetBench on a disk or partitions other than the ones used for the data, if possible. Doing so will simplify the testing.

Configure the NetBench test suite to use multiple data partitions.

· Create an ASCII file for the paths to the files on the data partitions for each client. It will follow a pattern similar to the four-partition pattern below:
M:\Clients\Client1
N:\Clients\Client2
O:\Clients\Client3
P:\Clients\Client4
· Similarly, create an ASCII file for the alternate directory paths.
· Use the Copy from ASCII File button on the Test Directories tab to change where NetBench looks for its files as you edit a test mix definition.

Configure Windows 2000.

· Set the Application Response to Optimize Performance for Background Services on the server. This is the default setting and can be verified by right-clicking the computer icon on the desktop and selecting Properties. Then select the Advanced tab and click the Performance Options button.
· Also, set the File and Printer Sharing for Microsoft Networks to optimize performance for file sharing. This is the default setting and can be verified by right-clicking the My Network Places on the desktop then right clicking a network connection. Select File and Printer Sharing for Microsoft Networks then click the Property button.
· Set the following registry value under the key HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management: PagedPoolSize = dword:192000000 (decimal)
· Set network adapter tuning (tuning may vary depending on the adapter and driver you use).
· For the IBM and Intel 100Base-TX adapters we set the receive buffers to 1024. To do this, right-click the My Network Places icon on the desktop and choose Properties. Then right-click each network connection in turn and select Properties. Click the Configure button and choose the Advanced tab from the dialog box that appears.

Share the data partitions.

· Choose whether you want to test with the Windows 2000 Offline Files feature enabled or not.
· To share the data partitions with Offline Files enabled, use commands such as:
net share RAID1=<drive/volume>:\ /CACHE:Automatic
· To share the data partitions with Offline Files disabled, use commands such as:
net share RAID1=<drive/volume):\

Run NetBench.

· Before you start NetBench, you need to connect to the shared data partitions. The following commands will do this in a way consistent with the suggested NetBench configuration:
net use M: \\groucho2\RAID1 /persistent:no
net use N: \\groucho2\RAID2 /persistent:no
net use O: \\groucho2\RAID3 /persistent:no
net use P: \\groucho2\RAID4 /persistent:no
· In order to generate reproducible results, we suggest formatting the RAID before each test. To do so, first you will need to stop sharing the RAID partitions. You can do this with the following commands:
net share RAID1 /delete
net share RAID2 /delete
net share RAID3 /delete
net share RAID4 /delete
· If you enabled Offline Files, you must delete the cache information from each client before running another test. Otherwise, your tests may stop responding or report inaccurate results. See Appendix 3: Offline Files below for directions on removing the cache information.
· Be sure to share the data partitions again either with or without Offline Files enabled.

Appendix 2: Test Lab Configuration

Test Clients

The test lab we used for this paper was made up of three different types of clients shown in Table 12 below. All of the client systems, the server under test, and the test control system were connected by means of an HP ProCurve 4000M switch with two tagless VLANs.

Table 12 Test clients used for this paper.

System Component

Configuration

Type A Client

 

CPU

Intel Celeron, 466 MHz, 128 KB L2 cache

Memory

128 MB, 100 MHz ECC SDRAM

Network

1 Intel PRO/100+ management adapter

Disks

1 x 4.3 GB ATA/66

Operating System

Windows 2000 Professional

Type B Client

 

CPU

Intel Pentium II, 266 MHz, 256 KB L2 cache

Memory

64 MB, 66 MHz EDO

Network

1 Intel EtherExpress PRO/100+ adapter

Disks

1 x 4.3 GB IDE

Operating System

Windows 2000 Professional

Type C Client

 

CPU

Intel Pentium Pro, 200 MHz, 256 KB L2 cache

Memory

64 MB, 66 MHz EDO

Network

1 3Com 3C905B adapter

Disks

1 x 4.3 GB IDE

Operating System

Windows 2000 Professional

Server Configurations

For this paper, we tested two server configurations: a small business/departmental configuration (shown in Table 14 below) and an Enterprise configuration (shown in Table 15 below). In order to simplify our testing procedures, we used the same RAID disk subsystem for both configurations. Table 13 shows the RAID configuration we used.

Table 13 RAID configuration.

RAID Subsystem Component

Configuration

RAID Adapter

IBM ServRAID-3H adapter in a 64-bit PCI bus: RAID 5, 32 MB cache, 16 KB stripe size; write cache mode: write back, read-ahead cache enabled

Disk Drives

10 Seagate Cheetah ST34502LC 4.51 GB drives

Partitions/Logical Drives

4 logical drives of 4 MB each; file system: NTFS; allocation unit size: 16 KB; log file size: 65,536 bytes

Table 14 Small business/departmental server configuration (IBM Netfinity 5000).

System Component

Configuration

CPU

Intel Pentium III, 500 MHz, 512 KB L2 cache

Memory

256 MB, 100 MHz ECC SDRAM

Network

1 IBM 10/100 EtherJet PCI Adapter, 1 Intel PRO/100+ management adapter

Disks

OS on a 2 GB partition of an IBM-PCCO DDRS-34560Y 9.1 GB disk

Operating System

Windows 2000 Advanced Server

Table 15 Enterprise server configuration (IBM Netfinity 7000 M10).

System Component

Configuration

CPU

4 Intel Pentium III Xeon, 500 MHz, 512 KB L2 cache

Memory

2 GB, 100 MHz ECC SDRAM

Network

2 Alteon WebSystems PCI Gigabit Ethernet adapters

Disks

OS on a 2GB partition of a Seagate Cheetah ST39102LC 9.1 GB disk

Operating System

Windows 2000 Advanced Server

Appendix 3: Offline Files

Offline Files, sometimes referred to as client-side caching (CSC), is a new feature in Windows 2000 that allows users to keep a local copy of files stored on a network share. This is useful for clients that aren't always connected to the network on which they store their files. It also provides some performance benefits when connected to the network. For instance, in opening a file from a network share that has been marked for offline usage, a user can read it locally rather than over the network.

The Offline Files feature is enabled on a file server when a share is created with a command such as:

net share RAID1=g:\ /CACHE:Automatic

When a Windows 2000 client accesses a file on a share with Offline Files enabled, parts of the data are cached in the %Windir%\CSC folder. It is essential that you remove all of the cached information in this folder before starting another test. Otherwise, the benchmark may stop responding or report inaccurate results.

To remove the files and folders in the %Windir%\CSC folder, we recommend that you create a batch file to run on each client that will take the following actions in the order indicated:

  1. Remove all shares currently in use.

  2. Stop the computer browser service.

  3. Stop the messenger service.

  4. Stop the workstation service.

  5. Remove all files and folders in the %Windir%\CSC folder.

  6. Start the workstation service.

  7. Start the messenger service.

  8. Start the computer browser service.

Appendix 4: Tuning for Gigabit Adapters

Table 16 Windows 2000 tuning parameters for Gigabit Ethernet adapters.

Registry Parameter

Default Value

Recommended Value

Description

Compaq Gigabit Adapter (Intel-based)

 

 

 

CoaleseceBuffers

8

32

Number of buffers before an interrupt occurs

ReceiveBuffers

48

500

Number of posted receive buffers on the adapter

Alteon - ACENic

 

 

 

TransmitControlBlocks

32

64

Number of transmits queued up in the miniport

HostTracing

1

1

Enable debugging on host system

JumboFrames

Off

-

Enable Jumbo Frame mode

JumboMtu

1500

-

Specify MTU size depending on Jumbo Frames entry

LinkNegotiation

On

On

Negotiate link before declaring it active

NicTracing

0

0

Enable debugging on adapter

PciReadMax

0

0

Amount of data read per burst from host system to adapter

RecvCoalMax

3

20

Max number of buffers received before interrupting

RecvCoalTicks

1000

1000

Max number of clock ticks before sending an interrupt

SendCoalMax

0

20

Max number of buffers sent before interrupting

SendCoalTicks

100000

1000

Max number of clock ticks before sending an interrupt

Intel Pro 1000

 

 

 

NumberOfReceiveBuffers

200

768

Number of posted receive buffers on the adapter

NumberOfCoalesceBuffers

200

512

Number of buffers before an interrupt occurs

NumberOfTransmitDescriptors

448

512

Number of transmit buffer descriptors on the adapter

ReceiveChecksumOffloading

Off

On

Task offloading onto hardware in receiver path

TransmitChecksumOffloading

On

On

Task offloading onto hardware in sender path

SysKonnect SKNET

 

 

 

HardwareChecksumming

On

On

Enable task offloading onto hardware

MaxFragstoDMAperTxFrame

5

5

The maximum number of allowed fragments per TX frame; set to 0 lets driver always copy all fragments

MaxIRQperSec

5000

0

The maximum number of IRQ/sec; values < 1000 will be treated as 1000

MaximumFrameSize

1514

-

Maximum Transmission Unit (MTU)

MinFragLengthforDMA

100

100

Minimum length of a TX fragment to be transferred by means of direct memory access (DMA); set to 0 disables fragment coalescing

NumberOfReceiveBuffers

100

100

Number of posted receive buffers on the adapter

NumberOfTransmitBuffers

50

50

Number of posted send buffers on the adapter

WaitforRxResources

On

Off for Jumbo

Checks for receive resources before waiting to get them back from the protocol

Appendix 5: Lists of Figures and Tables

List of Figures

Figure 1. On configurations with 64 MB, Windows 2000 Professional is significantly faster than Windows 98 and comparable to Windows NT Workstation 4.0.

Figure 2. Windows 2000 Professional is significantly faster than Windows NT Workstation 4.0 on high-end workloads. The test system was a Gateway PII 500 Mhz, ATI Rage Pro AGP, Adaptec AHA-2940U2/U2W PCI SCSI Controller.

Figure 3. Graph showing the time it takes from starting up to logging into a domain. In cold boot, standby, and hibernate modes, Windows 2000 maintains application state.

Figure 4. Chart based on NetBench results showing the difference between running Windows 2000 in a tuned environment with Offline Files (CSC) enabled and directly out of the box without CSC enabled. Hardware was selected and configured based on the guidelines in this document. A higher value demonstrates better file server throughput.

Figure 5. Chart, based on results generated using NetBench, demonstrating the performance benefit of enabling Offline Files, also known as client-side caching (CSC) A higher value demonstrates better file server throughput.

Figure 6. Chart demonstrating the SMP file server scalability of Windows 2000 Server as measured by NetBench. A higher value demonstrates better file server throughput.

Figure 7. Chart showing the performance improvements of Windows 2000 Server on a typical small business file server configuration as measured by NetBench. A higher value demonstrates better file server throughput.

Figure 8. Chart showing the performance improvements of Windows 2000 Server on a typical enterprise file server configuration as measured by NetBench. A higher value demonstrates better file server throughput.

Figure 9. Chart showing the improved responsiveness of Windows 2000 compared to Windows NT Server 4.0 on a typical small business configuration. A lower response time denotes better performance.

Figure 10. Chart demonstrating the improved responsiveness of Windows 2000 compared to Windows NT Server 4.0 on a typical enterprise-level configuration. A lower response time denotes better performance.

Figure 11. Test lab setup with two networks. The even and odd numbered clients are on different networks that are managed by means of the switch.

Figure 12. Comparison of tuned and out-of-box (untuned) Web server request rate for the standard WebBench static-content-only test. Test was performed on a typical enterprise system configured following the hardware guidelines in this document. Higher values represent better Web server performance.

Figure 13. Comparison of tuned and out-of-box (default) Web server request rate for the standard WebBench test using 80 percent static content, 20 percent dynamic content generated using an ISAPI program. The tuned system is running in-process while the out-of-box system is running out-of-process (see discussion at left). Higher values represent better Web server performance.

Figure 14. Chart, based on WebBench results, showing how SMP enterprise systems compare to single-processor, small business systems in terms of their static request rates. Higher values represent better Web server performance.

Figure 15. Chart showing ASP performance scaling for SMP systems as compared to single-processor systems, as measured by WebLoad using a 240 MB file set with 80 percent static requests and 20 percent dynamic content requests that return 6,250 bytes, the same as the average static response size. Higher values represent better Web server performance.

Figure 16. Chart showing the Web server performance improvements of Windows 2000 as compared to Windows NT Server 4.0. This test uses the standard WebBench mix of 80 percent static, 20 percent dynamic content. Higher values represent better Web server performance.

Figure 17. Enterprise Web server CGI, ASP, and ISAPI peak dynamic request rates for WebLoad using a 240 MB file set with 80 percent static requests and 20 percent dynamic content requests that return 6,250 bytes, the same as the average static response size. Higher values represent better Web server performance.

Figure 18. Small business Web server CGI, ASP, and ISAPI peak dynamic request rates for WebLoad using a 240 MB file set with 80 percent static requests and 20 percent dynamic content requests that return 6,250 bytes, the same as the average static response size. Higher values represent better Web server performance.

Figure 19. (At right.) How the file system uses the networking layer to communicate over a network.

Figure 20. Chart demonstrating the advantage of using a network adapter that supports TCP checksum offloading in Windows 2000. NTttcp was used to test and report the network throughput and CPU usage of a gigabit network adapter. Higher throughput values represent better network performance, while lower CPU usage represents better efficiency.

Figure 21. Chart showing the impact MTU size has on network throughput. NTttcp was used to measure the network throughput of gigabit network adapters (NICs) in different configurations. Higher values represent better network performance.

Figure 22. Web server static request performance for small business- and enterprise-class Web servers.

Figure 23. System Monitor graph showing a CPU bottleneck for a one-processor system. Notice that the top line (in pink here), which represents the CPU usage, is at 100 percent.

Figure 24. Memory limits Web server performance for these ASP tests.

Figure 25. Memory bottleneck in an enterprise-class server with 128 MB memory when accessing a 240 MB Web site with 20 percent of the access through ASP programs.

Figure 26. Memory bottleneck on an enterprise-class Web server with 256 MB of memory. This is based on a run of the same test as shown in Figure 26.

Figure 27. Disk bottleneck shows up in an enterprise-class file server with fewer disks in a RAID (CSC = Offline Files).

Figure 28. Disk bottleneck in an enterprise-class file server

Figure 29. The effect a network bottleneck can have on Web server performance.

Figure 30. Network bottleneck on an enterprise-class Web server with a single 100Base-TX network adapter.

List of Tables

Table 1. Recommended tuning for file server hardware.

Table 2. Recommended tuning for Web server hardware.

Table 3. Performance counters that serve to detect basic system bottlenecks for any type of server.

Table 4. Performance counters that serve to detect Web server bottlenecks.

Table 5. Performance counters that serve to detect file server bottlenecks.

Table 6. CPU performance counter thresholds that indicate a bottleneck.

Table 7. Performance counters to watch for memory-related bottlenecks.

Table 8. Performance counters to monitor for a disk bottleneck.

Table 9. Performance counters to monitor for network bottlenecks.

Table 10. Web server benchmark configuration checklist.

Table 11. Checklist for benchmarking Windows 2000 using NetBench.

Table 12. Test clients used for this paper.

Table 13. RAID configuration.

Table 14. Small business/departmental server configuration (IBM Netfinity 5000).

Table 15. Enterprise server configuration (IBM Netfinity 7000 M10).

Table 16. Windows 2000 tuning parameters for Gigabit Ethernet adapters.