General Performance Considerations (Web and Application Server Infrastructure - Performance and Scalability)

Article
10/08/2009

Applies To: Windows Server 2003 with SP1

This section contains general performance considerations related to processing HTTP.

Networking

Networking is a key consideration for a high volume server and is also becoming more of a factor with recent hardware. Larger numbers of processors per server and the increase in speed of each CPU, means that a network I/O subsystem can easily become a bottleneck for a large, powerful Web or application server.

Consequently, the network I/O subsystem of a large server requires design, planning, and monitoring. If a network card becomes a bottleneck, faster processors, more processors, or more RAM will not aid your server in achieving higher throughput.

How do you know that a network card is a bottleneck on a system?

The most obvious sign is by looking in the Windows Task Manager under the Networking tab. Select the network interface you would like to view and check its Network Utilization statistic. If it is close to 100 percent, the card is likely a bottleneck in the performance of your application.

Next, check the CPU utilization and context switch per second parameters. If the CPU utilization is low and context/switches per second are low13, the server is likely to be experiencing a networking bottleneck.

What kind of maximum throughput can I get out of my network card?

Network Card Speed	Megabytes / Sec
100 megabit (100Mbps)	~11
1 gigabit (1000Mbps)	~60

These general, rule-of-thumb numbers are derived through IIS team experimentation across a wide variety of hardware and differing network card combinations. Two factors that can also affect the overall throughput of a network card is the speed of the I/O bus and the number of cards put into that I/O bus. For example, on a standard symmetric multiprocessing (SMP) 8-processor server with multiple cards, gigabit cards were plugged into a 33-MHz root bus. Results showed that the gigabit cards maxed out at around 30 MB/sec, where, on a 66-MHz root bus, the gigabit network cards could achieve 50-60 MB/sec.

The bottom line is that if network card throughput is vital, bus speeds, network cards, and testing of the overall system are also vital to understand the possible throughput characteristics of a system.

Windows, Networking, and Large Multiprocessor Servers

As discussed in the Networking section, it is important to plan for the amount of data throughput you expect from your server. When running on a large multiprocessor server, you also need to understand a little about the Windows system internals, in order to maximize your throughput.

From a design standpoint, the Windows networking stack performs a large part of its processing of a network cards I/O on a single processor. Therefore, if you have an intensive networking load (many Web scenarios fit into this category, with thousands of requests per second, and very little CPU time spent per request), it is possible to see uneven processor utilization. Worse, it is possible to make one CPU the bottleneck for the whole server, because it is at 100 percent utilization, while the other processors are idly waiting for it to hand off the network I/O so they can process the next request.

It can be the case that the CPU handling the network interrupts is at 100 percent, but the card is not physically saturated. The variables here are the network card manufacturers driver efficiency and quality, the speed of the processor in the server, offloading (or lack of offloading) support in the network card, and the physical capacity of the network card itself. You can rectify this situation by adding an additional network card and splitting the load coming into the server across the two network cards.

There are additional considerations for networking on large multiprocessor servers covered in the Multiprocessor section.

HTTP Compression

HTTP Compression is not often considered as a server throughput enhancing tool, but if your sites use large amounts of bandwidth, or if you would like to more effectively use bandwidth, consider enabling HTTP compression. HTTP compression provides faster transmission time between compression-enabled browsers and IIS. You can compress either static files only, or both static files and dynamic application responses. If your network bandwidth is restricted, HTTP compression can be beneficial, at least for static files, unless your processor utilization is already extremely high.

HTTP compression can have a dramatic improvement on the latency of responses, while improving the throughput capacity of the server.

As shown in the Networking section above, it is very easy to make a network card a bottleneck on a server14. Compressing the responses will mean fewer bytes to send for the same response. Network cards do not discriminate on the actual bytes, therefore, to serve the same load, the network cards will be able to push through many more compressed requests per second before they become a bottleneck.

For more information about enabling HTTP Compression. see Utilizing HTTP Compression in the IIS 6.0 Help.

Compressing Static Content

IIS 6.0 will store the compressed representation of a static file securely on persistent storage and will serve the compressed version for clients that request the static file and ask for the content in compressed form15. This is highly efficient in that, for the lifetime of the content, the compressed version of the content will be served to clients that ask for it without the CPU having to compress at each request.

A further optimization for static content is that if the heuristic for caching the item in the kernel is met, the compressed version of the item is stored in the kernel. This has the added advantage that less memory is used to cache the item and it is served in the most efficient way possible.

As an example of the gains that compression can give you, the following test was devised. The hardware used was a 1 processor, 3-GHz HT16 server with one 100-megabit network card. The content was a 10K static .xml file.

	Non-Compressed	Compressed
Requests / Sec	1,170	2,493
CPU %	12.2%	19.7%
Network Card %	99%	99%

In this case, the throughput increased by 113 percent. The additional CPU used in the compressed case was not spent on compression (the compression for the file was only done once). Rather, the additional CPU was expended in processing the additional 1,323 requests per second over the non-compressed case.

Controlling Static File Types to Be Compressed

When compression is enabled on IIS 6.0, by default, only files with extension .htm, .html, and .txt are compressed. This is quite a limited list, and there are other static file types that can benefit dramatically from being compressed, such as .xml, .css, and .js. You should also ensure that these files are also in the compression list for the IIS compression filter. To do this, execute the following commands at a command prompt:

cscript %SystemDrive%\Inetpub\AdminScripts\adsutil.vbs set W3SVC/Filters/Compression/deflate/HcFileExtensions htm html txt css js xml

and

cscript %SystemDrive%\Inetpub\AdminScripts\adsutil.vbs set W3SVC/Filters/Compression/gzip/HcFileExtensions htm html txt css js xml

If your implementation has its own static file types, feel free to add their extensions to this list as well.

Note

Compressed format images (.gif) do not tend to gain much out of being further compressed by IIS, thus the exclusion from this list. If the extensions are excluded from this list, however, they will not be cached in the kernel. This is an area where the extension inclusion should be considered, dependant on the Web sites content.

b54272ae-4f55-418a-88a0-d12fe61663e2

Figure 3: IIS 6.0 Compression Dialog

Compressing Dynamic Content

You can also enable compression for the output of dynamic content. This is enabled through the IIS configuration dialog (see above) by enabling the Compress Application Files option. The downside in using compression for dynamic content is that IIS 6.0 must run its compression routines on every response. This can have a cost in terms of CPU consumption.

From a business justification perspective, you would enable compression for dynamic content if latency for end clients was an issue, and the extra CPU utilization was acceptable to operational guidelines for a server. Implementations with large numbers of slow client connections or mobile device clients where the wireless networks typically have high propagation delays, are some scenarios where dynamic compression could be used.

The below test data provides an indication of the impact of switching on dynamic content compression. In the test, an ISAPI extension was written whose response consisted of a 10K XML-formatted response. The test with compression on was purposely gated at a max of 50 percent CPU utilization to simulate an acceptable level of CPU use on a server. The test without compression was done to see the difference in CPU utilization at the same request rate. Testing was done on a 2-GHz Intel HT computer, 512 MB RAM, one 100-megabit network card.

	Non-Compressed	Compressed
Requests / Sec	1,051	1,074
CPU %	23%	50%
Network Card	99%	31%

At roughly the same request rate, dynamic content compression (of an XML-formatted response) consumes approximately twice the CPU without compression on. In addition to less data being sent between client and server, the network card is utilized more efficiently.

As you can see, compression for dynamic content is costly and should be carefully evaluated before being enabled on a production server.

Compression Encoder Selection

Compression encoder strength refers to which technique Windows uses to compress the data stream. You have the option of using an incredibly strong encoder, which will typically result in a compressed stream that is smaller than what you get when the standard encoder is used. But using a stronger encoder generally means that the amount of CPU time used is also greater. The administrator should consider the overall goals of the Web or application server and the type of content being compressed.

The strength range for the encoder is between 0 and 10, with the following breakdown:

0-3 fast encoder
3-9 standard encoder
10 optimal encoder

By default, static content is compressed at level 10 (most optimal) and dynamic content is compressed at level 0 (1 pass), to be most efficient at CPU utilization, on the basis of how frequently these routines are called. These parameters can be modified in the IIS metabase. To change the static compression encoder strength, open the MetaBase.xml file with a standard text editor and change all instances of the HcOnDemandCompLevel parameter to the desired value. Similarly, to change the compression encoder strength on all dynamic content, search for all instances of the HcDynamicCompressionLevel parameter and change it to the desired value.

Compression at a Site Level Rather Than Global

IIS 6.0 also allows the enabling of compression at a site level rather than at a global level for the whole server. The properties to manipulate at a per site level are:

DoStaticCompression; and

DoDynamicCompression;

From more information about enabling compression on a per site basis, see DoStaticCompression and DoDynamicCompression in the IIS 6.0 Help.

TCP/IP Configuration and Connections

Windows Server 2003 can receive Web traffic via different physical TCP/IP addresses, different TCP/IP ports, or over different host headers (DNS names). A host header can be used to differentiate one site (or application) from another.

If working in a server consolidation scenario or a dense hosting scenario, the favored approach from a system resource usage perspective is to use the host header approach, rather than having a separate TCP/IP address for each site or application. Although possible to go higher, you should limit the number of TCP/IP addresses per system to around 2,000, where it is possible to support tens of thousands of host headers per system.

Caching

Caching is an important aspect of a Web platform. Caching can greatly improve the throughput on a server and take the pressure off back-end systems by avoiding the work of totally recreating the response on a per request basis. Figure 3 represents the caches available in Windows Server 2003 for Web sites and applications.

9c0581c1-1113-43c0-837a-8028a0bc331c

Figure 4: Windows application server caches (kernel and user-mode)

HTTP.sys Kernel-Mode Response Cache

HTTP.sys has the ability to cache responses to HTTP GET requests17 in the kernel-mode response cache. The kernel-mode response cache is PAE-aware,18 which means that it can contain a maximum 64 GB of information on an x86 system. For 64-bit systems, HTTP.sys has access to a full 64-bit address space and, at time of writing, Windows Server 2003 64-bit edition will support systems with up to 512 GB of RAM.

If the kernel-mode response cache is enabled (it is enabled by default), HTTP.sys will always consult it for a received request to see if there is a cached response available. If there is a cached response available, it will be served directly, without transitioning the processor to user mode.

HTTP.sys employs a scavenger model for the kernel-mode response cache. The scavenger runs periodically and clears out cache items that have not been recently accessed. If an item is cleared from the kernel-mode response cache, HTTP.sys sends a subsequent request for the particular item to the user mode program that originated the initial cached response.

Most configuration of HTTP.sys is done through IIS Manager and the IIS metabase. However, the following list of items can be added to the registry to override the behavior of the kernel-mode response cache. The registry location for these values is:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Http\Parameters19

Parameter	Range	Comment
UriEnableCache	(DWORD) 1=on; 0=off	If non-zero, the kernel-mode response cache is enabled.
UriMaxCacheUriCount	(DWORD) 0 - 0xFFFFFFFF	Determines how many responses may be cached. If zero, there is no limit.
UriMaxCacheMegabyteCount	(DWORD) 0 - 0xFFFFFFFF	Determines how much memory may be used by the kernel-mode response cache. If zero, a heuristic will be used.
UriScavengerPeriod	(DWORD) 10 - 0xFFFFFFFF seconds	Determines the frequency of the cache scavenger. The scavenger runs once every UriScavengerPeriod seconds. Any response or fragment that has not been accessed in at least UriScavengerPeriod seconds will be flushed.
UriMaxUriBytes	(DWORD) 4096(4k) 16777216(16Mb) bytes	This parameter is used when HTTP.sys tries to cache a response in the Uri cache, hence the name UriBytes. If an item size exceeds this value, it will not be cached.

User-Mode Static-File Cache

In addition to the kernel-mode response cache, IIS 6.0 employs a user-mode static-file cache. The user-mode static-file cache contains all of the policy decisions about what to cache and when to cache. It has a superset of functionality and will act as a backup for the HTTP.sys cache in situations where HTTP.sys is unable to cache an item. For example, when an administrator enables authentication, content in the virtual directory cannot be cached in the kernel, therefore, the user-mode static-file cache is used.

From a performance perspective, the user-mode static-file cache cannot achieve the performance of the kernel-mode response cache, because the code path length in the server is greater and the processor needs to transition modes per request. The below shows data from a simple test performed on a 2-processor, 1-Ghz server with 1 gigabit network card. The test is for the same 2K static file content item. One test is where the item is served from the user-mode static-file cache; in the second case, the file is served from the kernel-mode response cache.

	User-Mode Static-File Cache 20	Kernel-Mode Response Cache 21	Difference
Request / Sec	8321	17689	+ 113%
CPU %	100.00	92.83	- 7.17
KB / Sec	120906	257159	+ 113%

The user-mode static-file cache is capable of caching everything that is cached in the kernel-mode response cache, but it has additional caching capabilities that are useful to other scenarios. For example, static files that are cacheable and are requested by clients that have been authenticated, are served from the user-mode static-file cache. It would be inaccurate for IIS to serve a static file to a user without doing an authentication check on the content to ensure the user has rights to the content. All access checks are done in user mode.

Caching Policy

In direct opposition to what seems logical, it turns out that caching everything that can be cached is not as helpful as caching everything according to need. Therefore, you should only cache what is actually getting lots of requests rather than caching everything. The reason for this is that populating a cache can be quite an expensive operation, and the management of a cache with thousands or tens of thousands of items in it can also be a CPU-intensive or wasteful operation.

IIS 6.0 tracks the hit distribution of content, and if the content proves itself to be popular and worthy of being cached, IIS 6.0 will put the content in a cache. When IIS 6.0 makes this decision to cache content, it first populates the user-mode static-file cache with the content, and then optionally, depending on the circumstance, puts the item in the kernel-mode response cache.

When an item is placed in the kernel-mode response cache, all the hits are served directly by the kernel. What will likely happen then, is the user-mode static-file cache content will time out (after 30 seconds) and be dropped from the user-mode static-file cache , as the item in the kernel takes the requests.

Hot Content Determination and Tuning

IIS 6.0 determines content is hot is by counting the number of times it is requested. If a cacheable static file is requested twice within a time period (referred to as the ActivityPeriod, by default 10 seconds), IIS 6.0 caches the content. As mentioned above, if the item in the user-mode static-file cache does not receive another hit within 30 seconds (default), the item will be dropped from the user-mode static-file cache.

User-Mode Static-File Cache Parameters

The modification of these parameters should be done with care, because an incorrect setting can cause a drastic effect on the performance of the server.

The registry location for these values is:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Inetinfo\Parameters

Parameter	Range	Default	Comment
MaxCachedFileSize	(DWORD) n/a	256 KB	If you are running large, dedicated Web servers, you may want to add this value to the registry to increase the file size that the cache can hold.
MemCacheSize	(DWORD) 0 – 2500 MB	Approximately one-half of available physical memory	The default value is adjusted dynamically every 60 seconds.
ObjectCacheTTL	(DWORD) 0 – 4,294,967,295 (unlimited)	30 seconds	Setting this value to 0xFFFFFFFF allows the cached object to remain in the cache until it is overwritten. This would be appropriate if your server has enough system memory and is handling data that is relatively static.
ActivityPeriod	(DWORD) 0 – 4,294,967,295 (unlimited)	10 seconds	If this value is set to 0, IIS always caches files.

Dynamic Response Cache

Web applications built using ASP.NET and ISAPI can cache responses and have subsequent requests for content served directly from the cache (rather than regenerating the content). This can be a huge performance gain for servers, because the caching can save significant CPU cycles and significantly improve overall response time.

By default, the kernel-mode cache can store dynamic content responses. Applications must be written (or configured by developers) to take advantage of this cache. For information on the ASP.NET output cache see the section entitled Kernel-Mode Response Cache, and for information on how to get ISAPI applications to leverage the kernel-mode response cache, see the section entitled Caching Responses from Dynamic Content in the Kernel.

Logging

From an implementation perspective, logging has also changed for IIS 6.0. The log formats have not changed, nor the behavior in general, but the Internet log files are now created by HTTP.sys in kernel mode, rather than by user-mode IIS components. HTTP.sys knows what type of log file to create and which fields are necessary, based on information passed down from the IIS components in user mode. The configuration repository for logging stays consistent from previous releases of IIS; log file parameters are stored in the IIS metabase.

New Format in IIS 6.0 Centralized Binary Logging

In addition to the existing log file formats of W3C and IIS NCSA, IIS 6.0 introduces a new format, centralized binary logging. Centralized means that IIS will override all of the existing log file settings on all sites on the server, and generate one log file for all of the sites or applications on that server22. The log file will be generated in a binary format which is optimized to eliminate variable length strings where possible. The centralized binary logging file also has its fields aligned on the right boundaries to maximize the I/O performance.

Note

The centralized binary log file does not log all of the fields available in the W3C format. This cannot be changed or overridden, so if you have a need for a specific field that is not included, the binary log file is not an option for you. For more information about enabling centralized binary logging, see Centralized Binary Logging in the IIS 6.0 Help.

Why is centralized binary logging interesting?

There are a few reasons why you would consider using centralized binary logging. For servers that have thousands of sites, having a log file per site can be a significant bottleneck because disk head access latency can affect the overall throughput of the server. (The server struggles to write blocks widely distributed over a physical medium). With centralized binary logging, disk head access time is improved; the single log file for the server can be placed on a separate device and the disk head of that device gets to write sequential blocks to the device.

The centralized binary logging format is optimized, and offers performance gains over the standard text-based logging format, while being smaller in size.

When IIS is logging in standard text mode, a 64K log buffer per log file is allocated for performance reasons23. If you have 20,000 sites on a server, by default, you might be allocating over 1 GB of RAM for the log file buffers. However, IIS 6.0 does allow for the log file buffer size to be decreased, so that if you have 20,000 sites on a server, you might peak at a little over 200 MB for the log file buffers24. When using centralized binary logging, you have exactly one 64K log buffer for the whole server, regardless of how many sites you put on that server. Therefore, there can be a significant memory benefit when using centralized binary logging.

Note

The term might is used, because the memory allocation numbers quoted above can happen if the server receives hits across all of the sites in a short period of time. Log buffers are timed out depending on how frequently log records are written to the buffer.

Having a single log file can simplify the log file management on a server. You manage one log file rather than tens, hundreds, or thousands of log files per server.

Microsoft has released Log Parser v2.0 that can read the binary log file and generate W3C log files for the individual sites contained in the binary log file. With Log Parser, the binary log file can be copied to another computer for processing. Therefore, there is no need to invest the CPU cycles to do the formatting of the log file at run time, those CPU cycles can be better used on a different, non-production computer for log file interrogation and reporting. For more information, see the LogParser section of this document.

Logging is an area that has differing functional importance, depending on the scenario. For an ISP scenario, logging is a key mechanism in determining how many hits of usage a site has had for billing and capacity analysis purposes. For an average commerce application, the logs would not be analyzed constantly, but rather stored to help track down issues or do site analysis if necessary. On the basis of differing usage scenarios, operational staff have some options to consider when determining how logging will work for their particular implementation, and have some flexibility on how they can affect its overall performance.

The first thing to remember is that the best performance option is to disable logging; this relieves processing cycles dedicated to generating log files, and saves physical disk space and extra I/O associated with persisting the log files. This is done through the Enable Logging option on the Web Site Tab on the web site properties dialog.

Note, however, for many implementations this is not an option.

Performance

From a performance perspective, the most costly part of logging is the long, variable-length strings. These strings lead to inefficiencies in log buffer use and more physical bytes to write to the disk. The main perpetrators in the W3C log, are the User-Agent and UserName fields, and a few other fields.

If you are running a server that has a high throughput rate (peaks of thousands of requests / sec), you can get a throughput boost by just taking out the variable-length text strings.

Straight W3C Optimization over W3C Default

The data in the table below is from tests involving the execution of an ASP.NET page on a 900-MHz, 8-processor server. One test was done with the default logging settings; the second test was done with the User-Agent field of the W3C format disabled (Optimized W3C).

	Default W3C Logging	Optimized W3C
Requests / Sec	5716	6115
CPU %	98.49	98.93
Log file Size (MB)	263.63	165.02

The difference in taking out this one field is quite noticeable, throughput increased by approximately 7 percent and the resulting log file was 37 percent smaller than the original. Note that the User-Agent field value in this case was the string that represented Microsoft Internet Explorer 6.0 (approximately 70 characters).

In this same scenario, the centralized binary log file performs very similarly to the optimized W3C~7 percent more throughput than the default, and a smaller resultant log file. The reason that the binary log file is similar to the optimized W3C log file, in this case, is that 100 percent of the requests in the test actually executed the page (in user mode). This means that the binary log file has to write the Uniform Resource Identifier (URI) into every line item in the binary log file. When a 4K file was cached and served from the kernel, the binary log file increased server throughput by 2 percent over the optimized W3C format. The log file size was 7 percent smaller than the optimized W3C log file.

Parallelism in Logging

You should also consider, when configuring a log file subsystem on a server, logs for busy sites should be separated onto separate devices with separate controllers. The goal is to split out the I/O for busy sites so that log buffer writes for one busy site are not queuing behind log buffer writes for a different site, with both being written to the same device. If this is the case, the overall logging process has a much higher latency.

To determine the busy sites on a server, there are a few parameters you can monitor. In System Monitor, select the Web Service object. In select counters from list, select the Total Method Requests item. Then, in select instances from list, select the sites you want to inquire on.

The Total Method Requests item displays how many requests the site has had since IIS was started. From here you can quickly deduce which sites are the busy sites and start planning the log file subsystem accordingly.

Heap Fragmentation

One phenomenon that causes significant performance and scalability issues in ASP/COM+ and ISAPI applications is a condition called heap fragmentation. Heap fragmentation is frequently misdiagnosed as a memory leak when the system seems to be running out of memory resources and long-running applications slow down. Heap fragmentation is not a memory leaka developer could spend weeks trying to hunt down memory leaks in the application and not find any, and yet this condition would still exist. Heap fragmentation occurs when a memory heap (which is used for memory allocations) is expanded and continues to grow until it hits a system addressing barrier. What usually causes this heap fragmentation is memory requests of wildly differing sizes being made to the same memory heap.

When a memory heap has a request for a small memory block (say 30 bytes) followed by another request for a large memory block (say 100 KB), it will satisfy both and the heap will have a 30-byte allocation followed by a 100-KB allocation next to it. Now imagine the 30 bytes were de-allocated and a new request for 2,048 bytes came in. The system cannot allocate the 2,048 bytes in the space where the 30 bytes were; it must go to the end of the heap, possibly extend its size25, and satisfy the allocation there.

Imagine a scenario where over a long period of time, Windows has to keep growing the heap. For each allocation request, the heap manager has a lot of searching to do, and it will eventually get to a point where it cannot allocate any more addresses (2-GB address limit). The heap manager will then fail the heap extension and pass back the memory allocation error to the application.

The way to rectify this situation is to form buckets for allocations across a number of heaps and only allocate the same size for each heap. This will resolve the heap fragmentation issue. Another alternative is to use Low Fragmentation Heap,26 Microsofts implementation of the above, to rectify heap fragmentation issues.

What Can I do as an Administrator?

The discussion so far has focused on the cause and the remedy, unfortunately, there isn’t much to address the symptoms and keep an application going when it is slowing down and exhausting memory resources. With IIS 6.0, an administrator can keep this condition at bay without having to baby-sit the server or take it offline.

To detect an application that is fragmenting the heap, you can use System Monitor. The signature of an application that is fragmenting the heap is the worker process in which it is executing increases its use of virtual memory over time, while its private bytes stay static. You can detect this issue by logging the Process(w3wp#)->Virtual Bytes and Process(w3wp#)->Private Bytes for the specific worker process instance you are interested in.

Once the issue is detected, keeping heap fragmentation at bay until developers can rectify the situation, means setting the recycle parameters on the application pool to recycle once the application exceeds acceptable virtual memory use limits.

Note

Avoid setting the virtual memory limit recycle setting on ASP.NET applications; the ASP.NET runtime takes care of heap fragmentation intrinsically.

The Recycle tab of the Application Pool dialog (Figure 4) enables recycling on Virtual memory use and Real memory use (private bytes).

597bdea5-b1f1-4ee7-b271-52286222559a

Figure 5: Parameter to control applications that display heap fragmentation

Multiprocessor Servers

This section contains considerations and areas of focus for situations when Windows Server 2003 is running on large multiprocessor servers.

Localized Loads

When running on a large multiprocessor server27, it is important to understand the physical hardware bus architecture and the costs of having a workload distributed across processors. Many large multiprocessor computers are built using a building block approach. A 32-processor computer can have a four-CPU base unit, with an interconnection architecture that links eight of these four-CPU building blocks to an I/O back plain, a common RAM area, etc. Note, however, that the interconnections between the building block processing unit (and the caches around the building block unit), impose a cost when breached.

In the above diagram, a workload that processes for a while on processor 0, then context-switches to processor 23, incurs a much heftier tax than if the workload context switched to processor 1. By going to processor 23, the hardware has to facilitate the transfer and cache-coherence of any data the workload was referencing in the caches surrounding processors 0-3, to the caches surrounding processors 20-23 before it can be re-scheduled to run. Often, depending on the specific hardware architecture, the process of synchronizing the processor caches can also stall other processors in the two building block units to ensure that the caches are not randomly polluted before the transfer operation completes. Therefore, the transfer also has an impact on workloads outside the initial workload.

Application Silos

Windows Server 2003 allows an administrator to configure a server to minimize the impact of the above processing locality issue when running Web services or Web applications. This can be done within a single instance of the operating system.

In order to preserve locality on a large multiprocessor, IIS 6.0 offers the ability to configure an application pool to have affinity for a specific processor. In the above example, an administrator can set up eight application pools to have affinity with the building block processors. The below illustrates what this code would look like:

<IIsApplicationPool   Location ="/LM/W3SVC/AppPools/BuildingBlock0"  //Procs 0-3
SMPAffinitized="TRUE"
SMPProcessorAffinityMask28="15"
>
<IIsApplicationPool   Location ="/LM/W3SVC/AppPools/BuildingBlock1"  //Procs 4-7
SMPAffinitized="TRUE"
SMPProcessorAffinityMask="240"
>

Figure 6 IIS 6.0 metabase configuration sample for processor silos29

By assigning an application to an application pool, that application will then automatically maintain its locality. Items of work assigned to that application will not go outside the building block and pollute other caches or minimize the necessary operating system housekeeping.

Note that this is not an optimal setup for all scenarios, because an application could require more than four CPUs to process, and in the above setup, would be limited to four. Additionally, the above setup does not address reliability aspects at all. You may want to configure a separate application pool for an application purely to ensure that it is isolated from other applications. Of course, you can define an application pool for one application and establish affinity between that application pool and a processor.

The other consideration here is networking. If your application is hosted in an application pool locked to processors 20-23, having your entire network I/O for the application come in on processor 0 is still going to introduce cross-processor transfers. Most of these large multiprocessor servers allow configurable I/O so you can wire in the network card(s) processing loads on behalf of the application to one of the physical processors in the processor group where the application is being executed. Using the above example, if the application was accepting load on two gigabit network adapters, those network adapters would ideally be wired into processors 20 and 21. You can also use the Network Interrupt Filter (available with the Windows Server 2003 Resource Guide companion CD) to bind a specific network card to a specific processor.

Software Locks (Resource Contention)

A potential issue to be aware of when putting an application on a large multiprocessor server is inherent software locking internal to the custom application itself. A software lock is when two independent pieces of work contend over a certain resource during the application run time. The first one to ask for the resource will gain access to it, and the second item will have to wait until the first one has finished using the resource before it can gain access.

The net result of this contention can be that when the specific application has more processors to execute on, it may not display any additional throughput, because all of the work is being funneled through a single resource.

A classic sign that you might have a contention problem in an application is a high system performance counter System-> Context Switches/Sec. When multiple work items hit the contention point, the typical Windows operating system response is to swap out the thread on which the item is executing, from the processor it is executing on, and a new thread will be swapped in to the processor (context switch). If all the work items execute for a short time and hit the contended resource, you will likely see high context switch rates30.

How can you alleviate software locks?

The best way to fix an application that has high contention, unfortunately, is to get a software engineer, skilled in performance tuning, to analyze the common usage scenarios (common code paths for the application). The engineer would assess the locks31 (contention points) at run time and re-architect those contention points so they offer greater parallelism.

WebGardens

Web gardens are designed to help alleviate the effect of an application that has serious software contention issues without needing an application redevelopment exercise. The approach is relatively simple; a Web garden alleviates the contention on a resource by indirectly creating multiple instances of the resource and evenly distributing the load throughout them. The creating of a new instance is controlled through the application pool configuration.

To create a Web garden, you set the number of worker processes property in the application pool configuration. The default number of worker processes for each application pool is one. By changing that default, you create a Web garden with however many worker processes you select32.

IIS then distributes the load to the specific instances of the application. In a Web garden, the requests for a specific TCP/IP connection are all sent to one worker process, but every new connection is sent to the next worker process in a round robin fashion. This has the effect of localizing workload related to a specific client, but spreading out the workload across all of the worker processes of the Web garden. By spreading out the workload, you also alleviate some of the contention on the shared resources.

Using a Web garden consisting of four worker processes, the following pattern would be achieved:

Requests from connection 1 go to Worker Process 1
Requests from connection 2 go to Worker Process 2
Requests from connection 3 go to Worker Process 3
Requests from connection 4 go to Worker Process 4
Requests from connection 5 go to Worker Process 1

This pattern is represented in Figure 5.

4360f829-00a7-4c96-933f-fe46118bba0a

Figure 7: Assignment of HTTP connections to Web garden process members

WebGardens and Affinity

A Web garden can help relieve software contention without the use of hard affinity. It is possible, however, to have an application pool with affinity to a set of processors, and then run multiple worker process instances. All worker process instances of the application pool will share the same affinity to processors.

SSL Implementation

The SSL implementation on Windows Server 2003 is different architecturally to SSL in previous Windows Server releases. The good news is that the design of the SSL implementation on Windows Server 2003 is self tuning and aware of the incoming load.

A frequently asked question is How much will HTTPS (SSL) cost me, compared to straight HTTP? The following test was devised to demonstrate the cost of SSL. The test hardware used was an 8-processor, 900-MHz server with 2 gigabit network cards. The SSL certificate key length was 1024 bit, the content used was an 8K static file, with six requests sent per connection. Each new connection performed a separate SSL handshake. The results of the test are below:

	Without SSL	With SSL (1024-bit key)	Difference
Requests / Sec	9368.30	2462.10	~ -74%
CPU	47.81	51.72	~ 8%

The test above was purposely done to consume around 50% CPU to show what could be achieved in both configurations from a throughput perspective, at a CPU range that would be acceptable in a number of operational environments. The conclusion, for this load type, is that the throughput for HTTPS is roughly a quarter of what the straight HTTP case is. NOTE: The throughput will vary depending on the size of the data, the number of full SSL handshakes occurring, and the speed of the CPU.

SSL Accelerator cards

An SSL accelerator card can be a very effective acquisition to offload the CPU-intensive computations associated with SSL. SSL accelerator cards are now relatively cheap, and can be a great investment for improving the SSL throughput on a server.

A test was performed to gauge the throughput-enhancing capabilities of an SSL accelerator card with a new-generation, very fast CPU computer. The target server was a 1-processor, 3-Ghz HT (2 logical processors), with a 100-megabit network card and 512 MB of RAM.

The test characteristics were six requests to a piece of content that returned a 2K response per SSL connection, with a full SSL handshake after every six requests. The SSL public key length was 1024 bits. The results are below:

	SSL in Software	SSL Accelerator Card	Difference
Requests / Sec	1281.39	3075.20	+139.99%
CPU %	100.00	76.90	- 23.10%

As represented in the table above, the impact of an SSL accelerator card was quite significant to this load. Even when a high-speed processor is used, the SSL accelerator card outperforms substantially. The SSL accelerator card could have done even better, in terms of throughput, but its network interface was bottlenecked in the test.

Note

This test is not a real world test. It is a test designed to show the best-case effect of SSL accelerator cards.

Remote (or Centralized) Content

Many implementations prefer to have one copy of content stored on a Network Attached Storage (NAS) device or file server and have the Web server or application server access the remote content over a private network between it and the source content (Figure 6).

dfda81db-f62b-48bf-979f-e0f1b1f9d007

Figure 8: Remote content scenario

The main advantages with this scenario are that one copy of content is present and managed in a farm of servers. This might not seem like a big administrative chore, but when the server population reaches over 10 servers, this can quickly become an administrative concern.

Windows Server 2003 is a very good platform for this physical separation of the content from the server that processes the request and the server that stores the content. However, there are some performance considerations to be made, as the latency between when a piece of content is requested and returned grows in this scenario.

Caching

Important

There is a very subtle point here. Given that the IIS static file handler is the component that does the file attribute checks on remote content, all requests for remote content need to be funneled through the static file handler to ensure the content being served is not stale (changed on the remote device).

As the static file handler is a user-mode component, the implication here is that all requests for static content (hosted on a remote server) are not cached in the kernel-mode response cache. Rather, the user-mode static-file cache is used to cache these files.

If the device that is actually storing the files is a Windows file server, IIS 6.0 can be configured to cache the files in the kernel (see below for details).

There is typically a bigger impact on static content than dynamic content in the remote content scenario. Dynamic content is generally loaded in memory, or compiled and cached. Therefore, one of the big focus areas for IIS 6.0 was the rework of the caching algorithms for remote static content.

Last-Modified Caching

The user-mode static-file cache has implemented a new caching algorithm where, by default, instead of asking for change notification for each directory structure (i.e., the remote device tells all of the servers that a directory has changed), the user-mode static file cache simply asks the file system for the last-modified date and time on the cached file. If the file is new, or the file has changed, the cache entry is updated with the new content and served. If the file is the same as before, the cached version of the file is sent out. As a further optimization, IIS adds a staleness check to avoid having to do this last-modified check on a per request basis. Therefore, the last-modified check occurs only every five seconds33.

Use of Directory Change Notification for Remote Content

As mentioned above, Directory Change Notification is the more expensive way to configure remote content (if that content resides on a Windows file server), because the change notifications need to be active and using redirector resources. However, by using this mechanism, remote content can be stored in the IIS kernel-mode response cache and IIS will be informed by the content server that the content has changed and will refresh the kernel-mode response cache.

It is recommended that you test this implementation before using it. If a server is connected to large numbers of UNC paths, this can reduce the servers ability to get change notifications.

To enable the Directory Change Notification, set the following registry key:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\InetInfo\Parameters\ DoDirMonitoringForUnc (DWORD). 1=on 0=off.

Implications of Switching to Change NotificationBased Caching

Several significant tradeoffs need to be considered when deciding which caching mechanism to use. The most reliable and secure method is to use the default, last-modified time-based caching. It is reliable, because it does not rely on the change notification functionality of the file server, and will scale well on a wide application structure (where your Web server has lots of applications pointing at different file structures). Last-modified time-based caching is more secure because share level permission changes will be honored (the cache will access the file often, which automatically triggers the authorization check against the current file system Access Control List (ACL) and the current share ACL).

However, if you are using a file server that reliably reports change notifications and you have a small number of sites or virtual directories, using change notification is significantly faster. For example, if you have a single Web site with a couple of applications that include all your content, stored on a Windows .NET file server, it is recommended that you use the change notification method of caching.

General Performance Tools

This section introduces some basic tools that can be used to diagnose how applications are performing, system performance and a level of information from the operating system that can identify the workloads and state of the operating system.

Event Tracing Enhancements

Event tracing provides a mechanism for managing data that is useful to performance monitors, capacity planning tools, and other applications that analyze information relating to system resource utilization. Event tracing is enabled by running new command-line tools that control event trace log scheduling, collection, and analysis on local and remote computers. Log files can be taken from the server for analysis. Event tracing is a very high-performance implementation of tracing through core Windows components. A benefit of event tracing for Windows for Web and application server loads is that an administrator can enable event tracing on a production server and have it generate detailed statistics and information on the running load, without a dramatic impact on the performance of the production load.

Listed below is some of the information that can be extracted at run time from a server:

Most-requested URLs
Slowest URLs
URLs with the most CPU usage
URLs with the most bytes sent
Clients with the most requests
Clients with the slowest responses
Statistics on all the running processes (kernel CPU, user CPU, number of threads, etc.)
Disk statistics
Files causing most disk I/Os

For more information about event tracing, see Tracerpt in the Windows Server 2003 product documentation.

System Performance Monitor

System Monitor is one of the key tools that display the internal system information of Windows at run time. It is a primary tool to use in performance analysis of the Windows platform and should be used as the master source of information about Windows Server internals when doing performance analysis.

The objects that are of particular interest, in the context of this paper, are counters that belong under the following objects:

.NET CLR LocksAndThreads
ASP pages
ASP.NET v1 applications
Network interface
Physical disk
Process
System
Web service
Web service cache

In addition to offering live information about what is happening in the system, System Monitor also offers the ability to log performance information and generate alerts.

Windows Task Manager

The Windows Task Manager, another useful tool, provides similar information to System Monitor, but displays it in a summary fashion. Additionally, the Windows Task Manager only displays Windows system information. System Monitor is extensible and will therefore load many third-party libraries when calibrating performance information; however, it does not load any third-party code.

The Windows Task Manager gives an administrator a low-cost way to see the process internals, how busy network cards are, etc.

LogParser v2.0

LogParser is a versatile tool that makes it possible to run SQL-like queries against different input sources, and have the results written to the screen or to different output targets.

With LogParser, it is possible to:

Quickly search for data and patterns in files of various formats, including IIS log files, Windows event log files, generic comma separated value (CSV) files, W3C files, and text files.
Create formatted reports and XML files containing data retrieved from different sources.
Export data to SQL tables. You can export entire files or filter the data to obtain only relevant entries.
Convert data from one log file format to another.

LogParser is composed of three main components:

An input source generates records from the specified source. The following input sources are implemented:
- All the IIS log file formats (W3C, IIS, IISMSID, NCSA, centralized binary logging, and ODBC), and the UrlScan and HTTP Error log formats
- The NT event Log
- W3C log files (e.g. personal firewall, WMS, etc.)
- Generic CSV files
- Directory structure information
- Generic text files
The SQL engine processes these records; filtering, grouping, and ordering them according to the specified SQL-like query.
An output target writes these processed records to the specified target; the following output targets are implemented:
- Native log format (a generic format to display results primarily to the screen)
- Generic W3C log format text files
- XML files
- IIS log format text files
- Direct output to SQL tables
- Text files formatted according to user-defined templates
- Generic CSV format text files

In SQL terms, LogParser treats an Input Source as a relational table, whose field types and names depend on the particular Input Source selected.

The LogParser can be found at: https://go.microsoft.com/fwlink/?LinkID=100882.

General Performance Considerations (Web and Application Server Infrastructure - Performance and Scalability)

Networking

Windows, Networking, and Large Multiprocessor Servers

HTTP Compression

Compressing Static Content

Controlling Static File Types to Be Compressed

Compressing Dynamic Content

Compression Encoder Selection

Compression at a Site Level Rather Than Global

TCP/IP Configuration and Connections

Caching

HTTP.sys Kernel-Mode Response Cache

User-Mode Static-File Cache

Caching Policy

Hot Content Determination and Tuning

User-Mode Static-File Cache Parameters

Dynamic Response Cache

Logging

New Format in IIS 6.0 Centralized Binary Logging

Performance

Straight W3C Optimization over W3C Default

Parallelism in Logging

Heap Fragmentation

What Can I do as an Administrator?

Multiprocessor Servers

Localized Loads

Application Silos

Software Locks (Resource Contention)

WebGardens

WebGardens and Affinity

SSL Implementation

SSL Accelerator cards

Remote (or Centralized) Content

Caching

Last-Modified Caching

Use of Directory Change Notification for Remote Content

Implications of Switching to Change NotificationBased Caching

General Performance Tools

Event Tracing Enhancements

System Performance Monitor

Windows Task Manager

LogParser v2.0

Additional resources