Performance Benchmarks Guide - Windows NT 4.0

Article
02/20/2014

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

This document includes information about the performance of Windows NT® Server as a file and print server, an application server and a web server. Windows NT Server is the only operating system that offers high-performance support in all of these areas. The file and print tests were conducted by the National Software Testing Laboratories (NSTL) using industry standard benchmark tools, such as NetBench. There are two different benchmarks of Windows NT Server as an applications platform, one performed by NSTL using ServerBench and the other by Compaq. This test used theTransaction Processing Council Benchmark C (TPC-C) and was audited by the TPC. The Web server tests were performed by Shiloh Consulting and Haynes & Company using the industry standard web capacity tool, Webstone 1.1.

The Microsoft Business Systems Division series of white papers are designed to educate information technology (IT) professionals about Windows NT and the Microsoft BackOffice™ family of products. While current technologies used in Microsoft products are often covered, the real purpose of these papers is to give readers an idea of how major technologies are evolving, how Microsoft is using those technologies, and how this information affects technology planners.

For the latest information on Windows NT Server, check out our World Wide Web site at https://www.microsoft.com/backoffice or the Windows NT Server Forum on the Microsoft Network (GO WORD: MSNTS).

Introduction

Test results included in this guide were performed by the National Software Testing Laboratories (NSTL), a division of the McGraw Hill Companies. NSTL, established in 1983, is the leading independent hardware and software testing organization in the microcomputer industry, dedicated to providing high quality services and test tools to the PC community. NSTL has extensive experience developing and conducting objective tests to assess new and existing products for compatibility, performance, usability, acceptance (bug) testing, and BIOS evaluation.

NSTL publishes Software Digest and PC Digest, providing test data and methodologies to industry and business publications around the world. NSTL test results appear in publications such as Byte Magazine, Data Communications, and Business Week. NSTL commercial testing clients include: IBM, Intel, Microsoft, AT&T, Lotus, Compaq, and many other leading companies. NSTL also provides testing services to the federal governments and a number of corporations in both the United States and Canada.

Shiloh Consulting is an independent network consulting company. Shiloh is led by Robert Buchanan who has over twenty years of experience in product development and testing for ROLM Corporation and 3Com Corporation. From 1990 to 1994, Mr. Buchanan ran the testing and operations of LANQuest, a leading network product testing laboratory. Recently he completed a new book, The Art-of-Testing Network Systems, which will be published by John Wiley & Sons in April, 1996.

Haynes & Company provides business planning and program management for high-tech companies. Past clients include ORACLE, Qualcomm, 3Com, Interlink Computer Sciences, Artisoft, and Netscape. Ted Haynes of Haynes & Company was a contemporary of Bob Buchanan at both ROLM and 3Com. He is the author of The Electronic Commerce Dictionary and has spoken on commerce over the Internet at the RSA Data Security Conference.

The purpose of this guide is to help you understand some of the major industry standard benchmarks and how the performance of the Microsoft® Windows NT® Server 4.0 operating system compares with other network operating systems such as Novell® NetWare® 4.1, IBM® OS/2® Warp Server and SCO® OpenServer Release 5.0 and even Windows NT Server 3.51. The results in this guide should not be construed as a capacity planning guide for Windows NT Server 4.0. Rather, this guide only serves to compare and evaluate the performance of different network operating systems under file, application and web server, input-output scenarios.

This guide briefly explains each benchmark, shows sample results, and explains how to interpret these results.

Equipment Used

All tests file and print and applications platform tests were performed on a Compaq® ProLiant® 5000 Pentium Pro 166Mhz. The tests were performed single and multiprocessor (166Mhz Pentium Pro) configurations with 128MB RAM (with 512K L2 Cache) for file and print and 256MB RAM (with 512K L2 Cache) for applications tests, 8x2GB HW RAID0, and 1xNetFlex™ FastEthernet card. The clients used were Pentium 100MHz, 16MB RAM machines with 1xIntel® Ethernet cards. Microsoft's client software was used in the testing.

NetBench 4.01

What is NetBench Version 4.01?

NetBench® version 4.01 is a portable Ziff-Davis benchmark that lets you measure the performance of servers in a file server environment as they handle I/O requests from MS-DOS®, Windows® for Workgroups, and Macintosh® OS clients. In a file server environment, the client runs an application—such as a word processor or spreadsheet program—and uses the server only to access data. You can use NetBench to measure, analyze, and predict how well a server can handle file requests from clients. Among the results NetBench returns is an overall throughput score for a server.

What are the Differences Between the Previous Version of NetBench, Version 3.01, and the Current Version, Version 4.01?

NetBench 4.01 is the latest release of NetBench and updates the 4.0 release with bug fixes. There are more substantial differences between 4.01 and 3.0/3.01 so it makes more sense to compare them. The key difference is that NetBench 4.01 lets you run NetBench with three client types: MS-DOS-based PCs, Windows for Workgroups-based PCs, and Macintosh operating systems. If you restrict the test bed to MS-DOS-based clients, you can compare NetBench 4.01 results with NetBench 3.01 and 3.0 results. However, because these earlier versions of NetBench didn't support Windows for Workgroups and Macintosh operating system clients, you can't compare any results you get using a test bed that includes either of these client types with previous NetBench results.

NetBench 4.01 includes Disk Mix which provides the primary measure of the server's network file I/O performance. Unlike the other tests, the Disk Mix performs a wide variety of network file operations. With the Disk Mix, the Ziff-Davis Benchmark Operation profiled top-selling applications for MS-DOS and Windows-based PCs and Macintosh operating systems to determine the I/O behavior of these applications. A Disk Mix test was then developed for each client type that performed the same kind of I/O operations that applications for that client typically perform. NetBench 4.01 automatically executes the correct Disk Mix test for each client type. As a result, the Disk Mix exercises your file server in the same manner as people use applications.

It is important to note that each NetBench client, as in the ServerBench test, actually represents more than a single client. So benchmarks of servers with 48 clients are an accurate measure of many more clients than 48. The exact number is not defined by Ziff-Davis but tests reveal that each client may represent as many as ten real clients.

Results

NetBench 4.01 was run with Windows NT Server 4.0, Windows NT Server 3.51, NetWare 4.1, OS/2 Warp Server and SCO OpenServer Release 5.0. The chart below depicts the results obtained on the following configuration:

A Compaq ProLiant 5000 Pentium Pro 166MHz with 128MB RAM (with 512K L2 cache), 8x2GB HW RAID0 and 1xNetFlex3 FastEthernet network interface card.

As you can see, Windows NT Server 4.0 outperforms both Windows NT Server 3.51 and NetWare 4.1 by a substantial margin. Windows NT Server 4.0, on a single processor machine, outperforms NetWare 4.1 by as much as 13% and a dual processor machine running Windows NT Server 4.0 outperforms NetWare 4.1 by as much as 28%. This is an important comparison because it shows that Windows NT Server takes advantage of multiple processors even for file an print performance. Until Novell ships NetWare SMP (other than through hardware manufacturers), it is impossible to benchmark Windows NT Server 4.0 against it.

It is also important to note that Windows NT Server 4.0 outperforms Windows NT Server 3.51 by a substantial margin, showing tremendous gains in file and print performance. Windows NT Server 4.0 running on a single processor machine outperforms Windows NT Server 3.51 by as much as 70%. The dual processor tests show that Windows NT Server 4.0 outperforms Windows NT Server 3.51 by as much as 43%.

These results clearly show that Windows NT Server is an excellent file and print server and can take advantage of multiple processors for added file and print performance. Additionally, Windows NT Server 4.0 outperforms the competition.

Serverbench 3.0

What is ServerBench Version 3.0?

ServerBench® version 3.0 is a Ziff-Davis benchmark, and is a true client/server application which runs on Windows NT Server 4.0, Windows NT Server 3.51, NetWare SMP, OS/2 Warp Server, and SCO UNIX®. An application optimized for each of these operating systems runs on the server and the clients run a set of predefined tests. (By comparison, NetBench 4.01 is an operating system-independent benchmark.)

ServerBench 3.0 measures the performance of the processor, disk, and network subsystems by running different types of tests which produce different loads on the server. The processor memory sub-system test manipulates a data file containing certain fields (based on a simulated database), such as customer names, addresses, and phone numbers. The record size is 125 bytes and there are 2800 records per client for a total of 350K per client. This file is manipulated in memory.

The disk subsystem test measures server disk input-output performance and comprises sequential read, sequential write, random read, random write, and append tests performed on the server. The atomic size and total file size are user-controllable and both determine how much data is read at one time plus the total data transferred to and from the disk. The network subsystem entails sending data from the client to the server and from the server to the client. The data transferred can be tailored to a desired value. This last component resembles the file input-output test described earlier.

The disk subsystem test is an excellent measure of the real world as it simulates how real users use the network operating system as both an application server and a file server. ServerBench 3.0 shows results in transactions per second (TPS) for each of the subsystems measured—the number of transactions completed by a client in an allotted amount of time.

It is important to note that each ServerBench client, as in the NetBench test, actually represents more than a single client. So benchmarks of servers with 48 clients are an accurate measure of many more clients than 48. The exact number is not defined by Ziff-Davis.

What are the Differences Between the Previous Version of ServerBench, Version 2.0, and the Current Version, Version 3.0?

In ServerBench 3.0, the clients are running either Windows 95 or Windows for Workgroups 3.11. In previous versions, the clients ran only Windows for Workgroups. Ziff-Davis took advantage of the fact that both the clients and the controller are now Windows-based to simplify installation and use a Windows SETUP.EXE program to install the client and controller programs. The installation also creates a ServerBench icon on each client to make starting the client program easier. (See the on-line ServerBench manual for your server platform and for more information.)

Additionally, ServerBench 3.0 features the following enhancements:

In addition to the x86 architecture, ServerBench now supports three all of the platforms that run Windows NT Server: the Alpha, MIPS and PowerPC processors
Support for many Winsock-compliant TCP/IP stacks
Unattended mode adds more error-handling features
Enhanced results reporting features
More stressful server tests

How to Interpret ServerBench 3.0 Results

ServerBench 3.0 was run with Microsoft Windows NT Server 4.0, Windows NT Server 3.51, OS/2 Warp Server (SMP is currently in beta) and SCO OpenServer Release 5.0. NetWare 4.1 is included in the mix by way of comparison. The chart below, comparing Windows NT Server 4.0 to Windows NT Server 3.51, SCO OpenServer 5.0, OS/2 Warp Server and NetWare 4.1, depicts the results obtained on the following configuration:

A Compaq ProLiant 5000 with a single Pentium Pro 166MHz processor, 256MB RAM (with 512K L2 cache), 8x2GB HW RAID0 and 1xNetFlex3 FastEthernet network interface card.

As you can see, Windows NT Server achieves excellent throughput on a single processor machine, more than keeping up with the competition. Windows NT Server 4.0 achieves higher throughput than any of the other applications platforms tested, besting NetWare 4.1 by as much as 15%, and sustains it very well. Windows NT Server 3.51 also achieves excellent throughput as well, keeping up with Windows NT Server 4.0 and NetWare 4.1. It is worth noting here that, while it is important to see how an applications server performs on a single processor, the true gauge of an applications platform is how well it scales when processors are added to the server.

The next chart, comparing Windows NT Server 4.0 to Windows NT Server 3.51, SCO OpenServer 5.0, OS/2 Warp Server and NetWare 4.1, depicts the results obtained on the following configuration:

A Compaq ProLiant 5000 with dual Pentium Pro 166MHz processors, 256MB RAM (with 512K L2 cache), 8x2GB HW RAID0 and 1xNetFlex3 FastEthernet network interface card.

As you can clearly see, Windows NT Server 4.0 scales far better than any of the competition and is only matched by Windows NT Server 3.51. Windows NT Server achieves a higher throughput than any other applications server platform, surpassing SCO UNIX by as much as 135% and Windows NT Server 3.51 by as much as 3%. Both NetWare 4.1 and OS/2 Warp Server, running on single processor machines, are shown here by way of comparison. OS/2 Warp Server SMP is currently in beta testing and NetWare SMP is due to ship as a mainstream product in the fall of 1996. With that caveat, Windows NT Server 4.0 outperforms NetWare 4.1 by as much as 53% and outperforms OS/2 Warp Server by as much as 80%.

The next chart, comparing Windows NT Server 4.0 to Windows NT Server 3.51, SCO OpenServer 5.0, OS/2 Warp Server and NetWare 4.1, depicts the results obtained on the following configuration:

A Compaq ProLiant 5000 with four Pentium Pro 166MHz processors, 256MB RAM (with 512K L2 cache), 8x2GB HW RAID0 and 1xNetFlex3 FastEthernet network interface card.

This test clearly shows the strength of Windows NT Server 4.0 as more processors are added. Windows NT Server takes advantage of multiple processors better than any other applications platform tested here. The delta between Windows NT Server 4.0 and SCO OpenServer has increased and is now as much as 140%. Again, Windows NT Server 4.0 outperforms OS/2 Warp Server (single processor) by as much as 110% and outperforms NetWare 4.1 (single processor) by as much as 81%. The only other tested platform that tested as well as Windows NT Server 4.0 is Windows NT Server 3.51, which actually outperforms Windows NT Server at 48 clients by 17%. However, Windows NT Server 4.0 reaches a higher peak than Windows NT Server 3.51 and sustains that level of performance much longer than Windows NT Server 3.51.

TPC-C Benchmark

What is a TPC-C?

Transaction Processing Council (TPC)1 Benchmark C is like TPC-A , the older TPC Benchmark for transaction processing, in that it, too, is an online transaction processing (OLTP) benchmark. However, TPC-C is more complex than TPC-A because of its multiple transaction types, more complex database, and overall execution structure. TPC-C involves a mix of five concurrent transactions of different types and complexity either executed online or queued for deferred execution. The database is comprised of nine types of records with a wide range of record and population sizes. TPC-C is measured in transactions per minute (tpm).

TPC-C simulates a complete computing environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses. While the benchmark portrays the activity of a wholesale supplier, TPC-C is not limited to the activity of any particular business segment, but, rather represents any industry that must manage, sell, or distribute a product or service.

Quad-Processor TPC-C Test

Two audited TPC-C tests were performed on Compaq ProLiant servers, one on a ProLiant 4500 with 4 P133 processors, 1-GB RAM and 46 (for the Windows NT Server 3.51 test) disk drives. The Windows NT Server 4.0 test used a Compaq ProLiant 5000 with 4 PP166 processors, 2-GB RAM and 91 4.2-GB disk drives with 6 two channel drive controllers. The test of Windows NT Server 3.51 and SQL Server 6.0 used the equivalent of 2500 users and the Windows NT Server 4.0 and SQL Server 6.5 tests used the equivalent of 5000 users. The results are combined in the following graph to show how they compared.

There are two things to note in these results: the number of transactions per minute and the cost of those transactions. Windows NT Server 3.51 and Microsoft SQL Server 6.0 achieved a maximum throughput of 2455 transactions per minute at a cost of $242 per transaction. By comparison, Windows NT Server 4.0 and Microsoft SQL Server 6.5 achieved a maximum throughput of 5677 transactions per minute at a cost of $136 per transaction. This means that the combination of Windows NT Server 4.0 and Microsoft SQL Server 6.5 offers nearly 125% more transactions per minute than the combination of Windows NT Server 3.51 and Microsoft SQL Server 6.0 at 55% of the cost per transaction. This test also highlights the fact that overall scalability is a function of both the operating system and the application. Together, not separately, good scalability is achieved.

Web Server Benchmark

What is Webstone 1.1?

Webstone is recognized to be the current industry standard for measuring Web server performance. It runs exclusively on clients, makes all measurements from the point of view of the clients, and is independent of the server software. Thus, Webstone is suitable for testing the performance of any and all Web servers, regardless of architecture, and for testing all combinations of Web server, operating system, network operating system, and hardware. It was developed by Silicon Graphics (SGI) and is freely available to anyone from the SGI Web site.

The Webstone software, controlled by a program called WebMASTER, runs on one of the client workstations but uses no test network or server resources while the test is running and places only a minimal burden on each client. Each Webstone client is able to launch a number of children (called "Webchildren"), depending on how the system load is configured. Each of the Webchildren simulates a Web client and requests information from the server based on a configured file load. The tests conducted for this report used four workstations to run the client software. Each workstation simulated the same number of clients and used the identical file-list.

The tests for straight HTML performance were all run using the same load distribution (Webstone Silicon Surf model) and using eight different client loads (16, 32, 48, 64, 80, 96, 112, and 128 clients). The load distribution was generated from the filelist.ss file which is incorporated within Webstone. By means of CGI and API interfaces, Webstone requests that a small program be run on the Web Server to generate a random string of characters that is sent back to the client. Typically a CGI program is a .exe file and an API program is a .dll file. The file is loaded when the Web server starts up under Windows NT. Based upon the Webstone 1.1 specification the operation of the test program (either CGI or API) is identical for all Web servers though the actual code will differ between ISAPI and NSAPI and between a Windows NT platform and a UNIX platform. The virtual equivalency of the program across all platforms and servers was assured by inspection of the source code.

The API and CGI tests were run at three load distribution, (light, medium, and heavy) for each of the eight client configurations. The load points mixed the proportion of client requests between dynamic HTML requests (requiring a call to a CGI or API routine) and static requests for an HTML document.

Webstone is a configurable benchmark that allows performance measurement of the server in the following ways:

Average and maximum connect time
Average and maximum response time
Data throughput rate
Number of pages retrieved
Number of files retrieved

Single Processor Test

This Webstone test compares Windows NT Server 3.51 and Internet Information Server (IIS) 1.1 to Windows NT Server 4.0 and IIS 2.0 running on a single Pentium 133mhz processor Hewlett Packard Netserver LS, with 1MB L2 cache, 32 MB RAM, and a 2 GB disk. Both machines are optimized for Web performance—all services are stopped except the ones used by web server. This test is run with 8 real client systems and the results represent the maximum connections per second for each set of tests run.

As you can see, three different tests were run and the results for Windows NT Server 4.0 with IIS 2.0 are substantially better than those for Windows NT Server 3.51 with IIS 1.1. In the HTML tests, Windows NT Server 4.0 and IIS 2.0 outperformed Windows NT Server 3.51 and IIS 1.1 by as much as 38.5%. In the 100% CGI tests, Windows NT Server 4.0 and IIS 2.0 outperformed Windows NT Server 3.51 and IIS 1.1 by as much as 18%. Likewise in the 100% ISAPI (Internet Server Application Programming Interface) tests, Windows NT Server 4.0 and IIS 2.0 outperformed Windows NT Server 3.51 and IIS 1.1 by as much as 18%. The important things to note in this set of tests are as follows: Windows NT Server 4.0 and IIS 2.0 have gained tremendous performance over Windows NT Server 3.51 and IIS 1.1, making it the best Internet/Intranet platform available. More importantly, however, Windows NT Server and IIS using Microsoft's ISAPI outperform Windows NT Server and IIS running CGI scripts by nearly 400%. ISAPI allows IIS to run as a multithreaded application on top of Windows NT Server. CGI does not. As a result, Internet Information Server is better able to exploit the Windows NT Server platform when it is running natively and using ISAPI instead of CGI scripts.

Scalability Issues

How to Achieve Greater Scalability with Windows NT Server

One of the major goals of Windows NT Server is to provide a robust and scaleable symmetric multiprocessing (SMP) operating system. By harnessing the power of multiple processors, SMP can deliver a large boost to the performance and capacity of both workstation and server applications. Scaleable SMP client/server systems can be very effective platforms for downsizing traditional mainframe applications. SMP also makes it easy to increase the computing capacity of a heavily loaded server by adding more processors. The Windows NT operating system provides a great platform to base server applications on, but the operating system is not solely responsible for performance and scalability. Applications hosted on the Windows NT operating system must also be designed with these goals in mind. In addition, the problem the application is trying to solve must be a scaleable problem, and the application must be run on a hardware platform whose raw scalability meets the requirements of the application.

The Windows NT platform provides many advanced features to make the development of efficient, scaleable applications easier. Other SMP operating systems have some of these features, but a few are unique to the Windows NT platform. Understanding and using these features is the key to realizing the full potential of Windows NT SMP in your application.

Note that the full potential of the application does not mean perfect linear scaling. The full potential of an application is very much limited by the problem the application is attempting to solve, and by the hardware platform upon which the application is being run. This is an often misunderstood aspect of scalability and one that this article addresses.

Resource Bottlenecks

A resource bottleneck is the resource that is in such demand that it is restricting the ability of an application or of a system from reaching its full potential. The easiest cure for a resource bottleneck is to add more of the resource and hope that the application is able to properly use the additional resources. If the application is able to use the additional resources, the result will either be no more resource bottlenecks, or the resource bottleneck will have moved.

If the resource bottleneck limiting an application from reaching it's "full potential" is the processor, there are two ways of eliminating or reducing the bottleneck. The simplest approach is to increase the performance of the processor. The other alternative would be to add additional processors.

If the solution involves adding additional processors, the application must be able to properly use the additional processors. On the Windows NT platform, this usually means that the application must be multithreaded.

Sample Application

Suppose that an application had to be written to sum the value of four million 32bit values. The simplest possible solution would be to execute the following pseudo-code:

Result = 0; 
CurrentCellVector = &ArrayBase[0];
NumberOfCells = 4000000;
for(I=0;I<NumberOfCells;I++){
   Result = *CurrentCellVector++;}

It is pretty clear that this simple application will use 100% of a processor. In this case, the application is bumping into a resource bottleneck in the processor. Adding a faster processor will clearly make the application run faster.

If the application is restructured into a multithreaded application, it can execute the above loop in parallel on multiple processors. To do this, the application would determine how many processors were available, divide the array into a portion for each processor and/or thread, and then assign the threads to begin their summation on their part of the array. After all threads have completed their work, their individual results would be folded together to form the complete answer. Microsoft has created a sample application (SMPSCALE) that performs these exact steps. What it actually does is count the number of processors on a system and does the work using one processor, then using two processors, and continues until it has done the work using all of the processors. The application times how long it takes to arrive at the complete sum. These times can then be used to see how much faster that problem can be solved with say four processors than with one processor.

Application Results

There is a common belief that adding a processor will result in a perfect linear increase in performance. This performance increase is commonly called the scalability. The data below shows that even in the simplest problem, perfect scalability can not be demonstrated across a wide variety of machines. The following table shows the raw data obtained by running SMPSCALE on a wide variety of machines:

The scalability of this problem from one processor to two processors ranges from 1.48 to a perfect score of 2.0. Scalability from one to three processors ranges from 1.83 to a nearly perfect 2.92. Scalability from one processor to four processors ranges from 2.01 to a nearly perfect 3.89.

The following table is another way to view the above data. It shows the amount of additional processor realized by adding that processor. In a perfect system, the amount of additional processor should be one. Less than perfect scaling will show a value of less than one.

Analyzing the Results

In analyzing the results we see very clearly that almost all machines scale well up to three processors. Scaling after three processors is rare. If we look closely at the first two machines, we see that the machine with the slower processors seems to scale better than the fast machine.

For this particular problem, restructuring the problem into a multithreaded application, and running it on SMP hardware initially eliminates the processor as the bottleneck. After three processors, the processor is no longer the bottleneck and the scaling drops off sharply. The new bottleneck becomes the memory to processor data bus. This bus has limited bandwidth and once it becomes saturated, processors start waiting for data and are, therefore, no longer operating at their maximum speed.

Once the memory bus bottleneck is understood, it helps to explain why slower processors scale better than faster processors (on the exact same bus/machine architecture). The slower processors do not generate the same load on the memory bus that the faster processors do, so it takes more processors to saturate the memory bus. The slower processors compute the results more slowly than the faster processors, but once the fast processor machine saturates the memory bus, it will not speed up. This is what explains the lack of scaling between five and six processors and the poor scaling after three processors.

Why is This Important?

Once the number of processors in a system are able to generate enough load on a memory bus to saturate the bus, scalability is very difficult to achieve. Problems (like SMPSCALE) that sweep through large amounts of data will not scale past the limits of the memory bus. Many database scenarios, or artificial server benchmarks do in fact degrade into problems that sweep through memory.

In the database scenario, in order to achieve scaling, care must be taken in the layout of the database, and in the transaction mix so that you avoid saturating the memory bus. Often times, the techniques you use on one database platform will be different from the techniques you would use on another platform.

Caches

From the preceding sections, it almost sounds like scaling beyond four processors is not possible. This is not the case at all. Scalability beyond four processors is possible, but it takes a lot of work, and even then, scaling is not guaranteed.

Computers running the Windows NT operating system generally have a fast memory cache between the CPU and main memory. This takes advantage of memory access locality to allow most of the CPU's memory references to complete at the speed of the fast cache memory instead of the much slower speed of main memory. Without this cache, the slower speed of DRAM memory would cripple the performance of modern high-speed processors. In SMP systems, cache memory has an additional function that is vital to system performance. Each processor's memory cache also insulates the main shared memory bus from the full memory bandwidth demand of the combined processors. Any memory access that the cache can satisfy will not need to burden the shared memory bus, thus leaving more bandwidth available for the other processors. Of course, for caches to be effective in isolating the memory bus from abuse, an application has to have good locality of reference which allows a large portion of its memory references to be satisfied out of the cache rather than out of main memory. For memory references like the code in an application, the stack of a thread, locality is somewhat easy to achieve. For the data being accessed by an application, achieving good locality is not always possible. If the problem is structured so that the application must sweep through all of the records in a database, or the problem is a large spreadsheet recalculation, good data locality might be difficult or impossible to achieve yielding scalability results similar to SMPSCALE.

SMP systems that provide separate caches for each processor introduce additional issues that affect application performance. Memory caches must maintain a consistent view of memory for all processors. This is accomplished by dividing up memory into small chunks (called a cache line) and tracking the state of each chunk that is present in one of the caches. In order to update a cache line, a processor must first gain exclusive access to it by invalidating all other copies in other processor's caches. Once the processor has exclusive access to the cache line, it may safely update it. If the same cache line is being continuously updated from many different processors, the cache line will bounce from one processor's cache to another. Since the processor cannot complete the write instruction until its cache acquires exclusive access to the cache line, it must stall. This behavior is called cache sloshing, since the cache line "sloshes" from one processor's cache to another.

Multiple threads continuously updating global counters commonly cause cache sloshing. Counters can be easily fixed by keeping separate variables for each thread, then summing them when required. SMPSCALE demonstrates this affect by summing each thread's result into a local variable, or into a per-thread global variable that is likely to share a cache line with other threads. The following raw data obtained from SMPSCALE shows that with severe cache sloshing, scalability decreases as processors are added. Different hardware platforms will degrade differently. Some worse, some better.

Conclusion

It is clear from the results of these NSTL and WebCat tests that Windows NT Server offers strong scalability in terms of using added CPU power application servers, the ability to scale with heavy client loads for file and print services and serving up web pages.

For more information

To access information via the World Wide Web, go to https://www.microsoft.com and select Microsoft BackOffice or the Windows NT Server Forum on the Microsoft Network (GO WORD: MSNTS).

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the publication date. This document is for informational purposes only.

This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.

Microsoft, MS-DOS, Windows, Win32, and Windows NT are registered trademarks and BackOffice and the BackOffice logo is a trademark of Microsoft Corporation.

AT&T is a registered trademark of American Telephone and Telegraph Company. Macintosh is a registered trademark of Apple Computer, Inc. dBASE and Paradox are registered trademarks of Borland International, Inc. Compaq and ProLiant are registered trademarks and NetFlex is a trademark of Compaq Computer Corporation. WordPerfect is a registered trademark of Corel Systems, Corporation. DEC is a trademark of Digital Equipment Corporation. Intel and Pentium are registered trademarks of Intel Corporation. IBM and OS/2 are registered trademarks of International Business Machines Company.Lotus, 1-2-3 and Freelance are registered trademarks of Lotus Development Corporation. cc:Mail is a trademark of cc:Mail Inc., a wholly owned subsidiary of Lotus Development Corporation. NEC is a registered trademark of NEC Corporation. NCR is a registered trademark of NCR Corporation. Novell and NetWare are registered trademarks of Novell, Inc. SCO is a registered trademark of Santa Cruz Operation, Inc. Sequent is a registered trademark of Sequent Computer Systems. Harvard Graphics is a registered trademark of Software Publishing Corporation. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd. NetBench and ServerBench are registered trademarks of Ziff-Davis Publishing Company.

National Software Testing Laboratories Inc. (NSTL) is a division of McGraw Hill. NSTL makes no recommendation or endorsement of any product. The results of the NSTL tests as presented in this brochure were prepared by NSTL.

The BAPCo network load used in the testing described in this brochure is a released version of the software which is publicly available. The BAPCo committee makes no recommendation or endorsement of any product.

Part No.

1 tpmC and TPC-C are registered certification marks of Transaction Processing Performance Council.