Chapter 15 - Detecting Cache Bottlenecks

Article
02/20/2014

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

The Windows NT file system cache is an area of memory into which the I/O system maps recently used data from disk. When processes need to read from or write to the files mapped in the cache, the I/O Manager copies the data from or to the cache as if it were an array in memory — without buffering or calling the file system. Because memory access is quicker than file operations, the cache provides an important performance boost to the processes.

Because the cache is just a part of physical memory, it is never really a bottleneck (although memory can be). However, when there is not enough memory to create an effective cache, the file system must retrieve more data from disk. This shortage of cache space is known as a cache bottleneck.

The size of the Windows NT file system cache is continually adjusted by the Virtual Memory Manager based upon the size of physical memory and the demand for memory space. In many operating systems, administrators can tune the cache size, but the Windows NT cache is designed to be self-tuning; you cannot change the cache size. For more information about the Cache Manager and the Virtual Memory Manager, see Chapter 5, "Windows NT Workstation Architecture."

Note Cache bottlenecks are rare on workstations. More often the cache is monitored as an indication of application I/O, since almost all application file system activity is mediated by the cache.

Cache bottlenecks are mainly a server problem: Workstations rarely generate enough traffic to pressure the cache. However, complex programs such as CAD/CAM applications and large databases that access large blocks of multiple files and benefit from the cache will suffer when the cache is too small. Also, cache bottlenecks only affect applications that use the cache effectively, for example, by using data in the same sequence in which it is stored, so data requested is likely to be in the cache.

To monitor the cache, log the Memory, Cache, and Logical Disk objects for several days at a 60-second update interval, then chart the following counters:

Object	Counter
Memory	Cache Bytes
Memory	Cache Faults/sec
Memory	Page Fault/sec
Cache	Copy Reads/sec
Cache	Data Flushes/sec
Cache	Copy Read Hits%
Cache	Lazy Write Pages/sec
Cache	Lazy Write Flushes/sec
Cache	Read Aheads/sec
Logical Disk	Disk Reads/sec
Logical Disk	Pages Input/sec

The Windows NT File System Cache

Cache is a French word for a place to hide necessities or valuables. In computer terminology, a cache is an additional storage area close to the component that uses it. Caches are designed to save time: In general, the closer the data is, the quicker you can get to it.

Windows NT 4.0 supports several cache architectures: caches on processor chips, caches on the motherboard, caches on physical disks, and caches in physical memory. This chapter describes the file system cache, a cache in physical memory through which data files pass on their way to and from disk or other peripheral devices.

The file system cache is designed to minimize the need for disk operations. When an application requests data from a file, the file system first searches the cache:

If the data is found in the cache, the Virtual Memory Manager copies it into the application's buffer and performs no disk I/O.
If the data is not in the cache, the Virtual Memory Manager searches elsewhere in memory. As a last resort, it looks on the disk.

When determining what to cache, the Windows NT Virtual Memory Manager tries to anticipate the application's future requests for code and data, as well as its immediate needs. It might map an entire file into the cache, if space permits. This increases the likelihood that data requested will be found there.

The file system cache actually consists of a series of section objects created and indexed by the Windows NT Cache Manager. When the Virtual Memory Manager needs space in the cache, the Cache Manager creates a new section object. The files are then mapped—not copied—into the file system cache, so they don't need to be backed up in the paging file. This frees the paging file for other code and data.

Cache Hits and Misses

The simplest way to judge the effectiveness of the cache is to examine the percentage of cache hits, that is, how often data sought in the cache is found there. Cache misses, however*,* are even more important. When data is not found in the cache, or elsewhere in memory, the file system must make a time-consuming search of the disk. An application with a miss rate of 10% (a hit rate of 90%) requires twice as much disk I/O as an application with a miss rate of 5%.

Also, especially on a workstation, you must keep cache rates in perspective. On a system where cache reads are minimal, the hit-and-miss rates are not a significant performance factor. However, when running I/O-intensive applications such as databases, the cache hit-and-miss rates are an important performance measure of the computer and the application.

Cache Flushing

Pages are removed from the cache by flushing, that is, any changes are written back to disk, and the page is deleted. Two threads in the system process—the lazy writer thread and the mapped page writer thread—periodically flush unused pages to disk. The cache is also flushed when Virtual Memory Manager needs to shrink the cache because of memory constraints.

Applications can also request that a page copied from the cache be written back to disk. With write-through caching, the disk file is updated immediately; with write-back caching (the default), the Virtual Memory Manager waits until a batch of modifications has accumulated and writes them together.

Locality of Reference

Applications use memory most efficiently when they reference data in the same sequence or a sequence similar to the order in which the data is stored on disk. This is called locality of reference. When an application needs data, the data page or file is mapped into the cache. When an application's references to the same data, data on same page, or in the same the file, are localized, the data they seek is more likely to be found in the cache.

The nature of the application often dictates the sequence of data references. At other times, factors such as usability become more important in determining sequence. But by localizing references whenever possible, you can improve cache efficiency, minimize the size of process's working set, and improve application performance.

In general, sequential reads, which allow the Cache Manager to predict the application's data needs and to read larger blocks of data into the cache, are most efficient. Reads from the same page or file are almost as efficient. Reads of files dispersed throughout the disk are less efficient and random reads are least efficient.

You can monitor the efficiency of your application's use of the cache by watching the cache counters for copy reads, read aheads, data flushes and lazy writes. Read Aheads usually indicate that an application is reading sequentially, although some application reading patterns may fool the system's prediction logic. When data references are localized, a smaller number of pages are changed, so the lazy writes and data flushes decrease.

Copy read hits (when data sought in the cache is found there) in the 80-90% range are excellent. In general, Data flushes/sec are best kept below 20 per second, but this varies widely with the workload.

Unbuffered I/O

The file system cache is used, by default, whenever a disk is accessed. However, applications can request that its files not be cached by using the FILE_FLAG_NO_BUFFERING parameter in its call to open a file. This is called unbuffered I/O. Applications that use unbuffered I/O are typically database applications (such as SQL Server) that manage their own cache buffers. Unbuffered I/O requests must be issued in multiples of the disk sector size.

Cache Monitoring Utilities

In addition to Performance Monitor, several tools and utilities let you monitor the file system cache.

Task Manager

Task Manager displays the size of the file system cache on the Performance Tab in the Physical Memory box..

Cc722473.xwr_m01(en-us,TechNet.10).gif

Performance Meter

Performance Meter (Perfmtr.exe), a tool on the Windows NT Resource Kit 4.0 CD in the Performance Tools group, lists current statistics on the file system cache. It is run at the command prompt. Start Performance Meter, then type r for Cache Manager reads and write statistics. Type q to quit.

Response Probe

Response Probe, a tool on the Windows NT Resource Kit 4.0 CD, lets you design a workload and test it on your system. When your workload includes file I/O, you can choose whether the files accessed use the cache or are unbuffered. In this way, you can measure the effect of the cache strategy on your application or test file operations directly. For more information, see "Response Probe" in Chapter 11, "Performance Monitoring Tools."

Clearmem

Clearmem, another tool on the Windows NT Resource Kit 4.0 CD, allocates and references all available memory, consuming any inactive pages in the working sets of all processes (including the cache). It clears the cache of all file data, letting you begin your test with an empty cache.

Understanding the Cache Counters

The following Performance Monitor Cache and Memory counters are used to measure cache performance and are described in this chapter.

Important The Hit% counters are best displayed in Chart view. Hits often appear in short bursts that are not visible in reports. Also, the average displayed for Hit% on the status bar in Chart view might not match the average displayed in Report view because they are calculated differently. In Chart view, the Hit% is an average of all changes in the counter during the test interval; in Report view, it is the average of the difference between the first and last counts during the test interval.

Counter	Description
Memory: Cache Bytes	How big is the cache? The Virtual Memory Manager regulates the size of the cache, which varies with the amount of physical memory and the demand for memory by other processes.
Memory: Cache Faults/sec	How many pages sought in the cache are not there and must be found elsewhere in memory or on the disk? This counts numbers of pages, so it can be compared with other page measures, like Page Faults/sec and Pages Input/sec.
Copy Reads/sec	How often does the file system look in the cache for data requested by applications? This is a count of all copy read calls to the cache, including hits and misses. Copy reads are the usual method by which file data found in the cache is copied into an application's memory buffers.
Copy Read Hits %	How often do applications find what they need in the cache? Any value over 80% is excellent. Compare with Copy Reads/sec to see how many hits you are really getting. A small percentage of many calls might represent more hits than a higher percentage of an insignificant number of calls. This is the percentage of copy read calls satisfied by reads from the cache out of all read calls. Performance Monitor displays the value calculated for the last time interval, not an average. It also counts numbers of reads, regardless of amount of data reads.
Read Aheads/sec	How often can the Cache Manager read ahead in a file? Read aheads are a very efficient strategy in most cases. Sequential reading from a file lets the Cache Manager predict the pattern and read even larger blocks of data into the cache on each I/O.
Data Maps/sec	How often are file systems reading their directories? This counts read-only access to file system directories, the File Allocation Table in the FAT file system, and the Master File Table in NTFS. If this count is high, the Cache Manager might be occupied with directory operations. This is not a measure of cache use by applications.
Fast Reads	How often are applications able to go directly to the cache and bypass the file system? A value over 50% is excellent. The alternative is to send an I/O request to the file system.
Data Flushes/sec	How often is cache data being written back to disk? This counts application requests to flush data from the cache. It is an indirect indicator of the volume and frequency of application data changes.
Data Flush Pages/sec	How much data is the application changing? This counter measures data flushes in numbers of pages rather than number of disk accesses. Counts the number of modified pages in the cache that are written back to disk. This includes pages written by the System process when many changed pages have accumulated, pages flushed so the cache can be trimmed, and disk writes caused by an application write-through request.
Lazy Write Flushes/sec	How much data is an application changing? How much memory is available to the cache? Lazy write flushes are a subset of data flushes. The Lazy Writer Thread in the system process periodically writes changed pages from the modified page list back to disk and flushes them from the cache. This thread is activated more often when memory needs to be released for other uses. This counter counts number of write and flush operations, regardless of the amount of data written.
Lazy Write Pages/sec	How much data is an application changing? How much memory is available to the cache? Lazy Write Pages are a subset of Data Flush Pages. Counts the numbers of pages written back to disk by a periodic system thread. Lazy writes are an asynchronous file operation that allows the application to update and continue without waiting for the I/O to be completed.

Recognizing Cache Bottlenecks

The file system cache is a part of memory. It can be thought of as the working set of the file system. When memory becomes scarce and working sets are trimmed, the cache is trimmed as well. If the cache grows too small, cache-sensitive processes will be slowed by disk operations.

To monitor cache size, use the following counters:

Memory: Cache Bytes
Memory: Available Bytes

Tip You can test the effect of a memory and cache shortage on your workstation without changing the physical memory in your computer. Use the MAXMEM parameter in the boot configuration to limit the amount of physical memory available to Windows NT. For more information, see "Configuring Available Memory" in Chapter 12, "Detecting Memory Bottlenecks."

The following graph shows that a memory shortage causes the cache to be trimmed, along with the working sets of processes, and other objects that compete with the cache for space in memory. The memory shortage was produced by running LeakyApp, a test tool that consumes memory.

Cc722473.xwr_m02(en-us,TechNet.10).gif

In this graph, the thick black line represents Process: Private Bytes for LeakyApp. (Note that it has been scaled to 0.000001 to fit on the graph.) At the plateau in this curve, it held 70.4 MB of memory. The white line represents Memory: Cache Bytes. The gray line is Memory: Available Bytes, and the thin black line is Process: Working Set.

In this example, run on a workstation with 32 MB of physical memory, the memory consumption by LeakyApp affects all memory, but not to the same degree. Available Bytes drops sharply then recovers somewhat, apparently because pages were trimmed from the working sets of processes. Cache size falls steadily until all available bytes are consumed, and then it levels off. In addition, page faults—not shown on this already busy graph—increase steadily as working sets and cache are squeezed.

The effect of a smaller cache on applications and file operations depends upon how often and how effectively applications use the cache.

Applications and Cache Bottlenecks

Applications that use the cache effectively are hurt most during a cache shortage. A relatively small cache, under 5 MB in a system with 16 MB of physical memory, is likely to become a bottleneck for the applications that use it.

However, normal rates of reads, hits, and flushes vary widely with the nature of the application and how it is structured. Thus, you must establish cache-use benchmarks for each application. Only then can you determine the effect of a cache bottleneck on the application.

To monitor the effect of a cache bottleneck on an application, log the Cache and Memory objects over time, then chart the following counters:

Cache: Copy Reads/sec
Cache: Copy Read Hits%

Tip To test your application with different size caches, add the MAXMEM parameter in the Boot.ini file. This lets you change the amount of memory available to Windows NT without affecting the physical memory in your computer.

A cache bottleneck appears in an application as a steady decrease in Copy Read Hits while Copy Reads/sec are relatively stable. There are no recommended levels for these counters, but running an application over time in an otherwise idle system with ample memory will demonstrate normal rates for the application. It will also let you compare how effectively different applications use the cache. When you run the same applications on a system where memory is scarce, you will see this rate drop if the cache is a bottleneck. In general, a hit rate of over 80% is considered to be excellent. A 10% decrease in normal hit rates is cause for concern and probably indicates a memory shortage.

The following graph shows a comparison of copy reads and copy hits for several instances of a compiler. Compilers are relatively efficient users of the cache because their data (application code) is often read and processed sequentially. During the short time-interval represented here, the cache size varied from 6.3 MB to 7.3 MB.

Cc722473.xwr_m03(en-us,TechNet.10).gif

In this example, the thicker line is Copy Reads/sec and the thin line is Copy Read Hits %. The Copy Reads/sec, averaging 6 per second, are a moderate amount, and the Copy Read Hits %, at an average of 32%, are also moderate. This indicates that, on average, fewer than 2 reads/sec are satisfied by data found in the cache. The remainder are counted as page faults and sought elsewhere in memory or on disk.

It is important to put some of these rates in perspective. When copy reads are low (around 5 per second), a 90% average hit rate means that the data for 4.5 reads was found in the cache. However, when reads are at 50 per second, a 40% hit rate means that data for 20 reads was found in the cache.

Accumulating data like this while varying the amount of memory will help you determine the effect of cache size on your application.

Page Faults and Cache Bottlenecks

When memory is scarce, more data must remain on the disk. Accordingly, page faults are more likely. Similarly, when the cache is trimmed, cache hit rates drop and cache faults increase. Cache faults are a subset of all page faults.

Note The operating system sees the cache as the file system's working set, its dedicated area of physical memory. When data isn't found in the cache, the system counts it as a page fault, just as it would when data was not found in the working set of a process.

To monitor the effect of cache bottlenecks on disk, use the following counters:

Memory: Cache Faults/sec
Memory: Page Faults/sec
Memory: Page Reads/sec
Logical Disk: Disk Reads/sec

The following graph shows the proportion of page faults that can be traced to the cache. Cache Faults/sec includes data sought by the file system for mapping as well as misses in copy reads for applications. Because both the Cache Faults/sec and Page Faults/sec counters are measured in numbers of pages, they can be compared without conversions.

Cc722473.xwr_m04(en-us,TechNet.10).gif

In this example, the thin black line represents all faulted pages; the thick black line represents pages faulted from the cache. Places where the curves meet indicate that nearly all page faults are cache faults. Space between the curves indicates faults from the working sets of processes. In this example, on average, only 10% of the relatively high rate of page faults happen in the cache.

The important page faults, however, are those that require disk reads to find the faulted pages. But the memory counters that measure disk operations due to paging make no distinction between the number of reads or pages read due to cache faults and those caused by all faults.

Cc722473.xwr_m05(en-us,TechNet.10).gif

This graph and the report that follows show that most faulted pages are soft faults. Of the average of 182 pages faulted per second, only 21.586—less than 12%—are hard faults. It is even more difficult to attribute any of the pages input due to faults to the cache.

Cc722473.xwr_m06(en-us,TechNet.10).gif

Applications and the Cache

Cache bottlenecks on workstations are uncommon. More often, the Performance Monitor cache counter values are used as indicators of application behavior. Although some large database applications, such as the Microsoft SQL Server, bypass the cache and do their own caching, most use the file system cache.

Data requested by an application is first mapped into the cache and then copied from there. Data changed by applications is written from the cache to disk by the Lazy Writer system thread or by a write-through call from the application. Thus, watching the cache is like watching your application I/O.

Remember, however, that if an application uses the cache infrequently, cache activity will have an insignificant effect on the system, the disks, and on memory.

Reading from the Cache

There are four types of application reads:

In copy reads, data mapped in the cache is copied into memory so it can be read. An application's first read from a file usually is a copy read.
Subsequent reads are usually fast reads, in which an application or other process calls the cache directly rather than calling the file system.
With pin reads, data is mapped into the cache just to be changed and is then written back to disk. It is pinned in the cache; that is, it is held at the same address and is not pageable. This prevents page faults.
With read aheads, the Virtual Memory Manager recognizes that the application is reading a file sequentially and, predicting its read pattern, begins to map larger blocks of data into the cache. Read aheads are usually efficient and are a sign that data references are localized. However, some application read patterns might fool the prediction logic of the Virtual Memory Manager and do read aheads when smaller reads might be more efficient. Only the application designer knows for sure!

The following graph shows the frequency of different kinds of cache reads during the run of a compiler. The intersecting curves are difficult to interpret, so a second copy of Performance Monitor—a report set to the same Time Window as the graph—is appended.

Cc722473.xwr_m07(en-us,TechNet.10).gif

In this example, copy reads are more frequent than fast reads. This pattern of many first reads and fewer subsequent reads indicates that the application is probably reading from many small files. The rate of read aheads is also low, which is another indication that the application is skipping from file to file. When more fast reads than copy reads occur, the application is reading several times from the same file. The rate of read aheads should increase as well.

Writing to the Cache

Although most of this chapter has described using the cache to prevent repeated file operations for reading, it's important to note that applications can also write to data pages in the cache, though not directly. When applications write data to files in their memory buffers that have been copied from the cache, the changes are copied back to the cache. The application continues processing without waiting for the data to be written back to disk.

The system does not count copies or writes to cache directly, but these changes appear in Performance Monitor as data flushes and lazy writes when they are flushed. Cache pages are flushed to free up cache space in several different ways:

The application issues a write-through request, instructing the Cache Manager to write the change back to disk immediately.
The Lazy Writer thread in the system process writes pages back to disk. It writes more pages in each disk operation when the cache needs to be trimmed to recover space in memory.
The Mapped Page Writer thread in the system process writes pages on the mapped page list back to disk. The Mapped Page List for the cache is like the Modified Page List for the paging file. This thread is activated when the number of pages on the mapped page list exceeds a memory threshold.

To measure the rate at which changed pages mapped into the cache are written to disk, use the following counters:

Cache: Data Flushes/sec
Cache: Data Flush Pages/sec
Cache: Lazy Write Flushes/sec
Cache: Lazy Write Pages/sec

In general, lazy writes reflect the amount of memory available in the cache. Lazy writes are a subset of all data flushes, which include write through request from applications and write-back requests by the mapped page writer thread.

The following display was made with three copies of Performance Monitor all charting from the same log file with the same time window.

Cc722473.xwr_m08(en-us,TechNet.10).gif

The top graph shows the ratio of Lazy Write Flushes/sec (the white line) to all data flushes, as represented by Data Flushes/sec (the black line). The space between the lines indicates mapped page writer flushes and application write-through requests. In this example, as the report shows, on average, 73.5% of the data flushes were lazy writes, but lazy writes accounted for only 60% of the pages flushed.

The bottom graph shows the relationship between Data Flush Pages/sec (the black line) and Data Flushes/sec (the white line). The points where the curves meet indicate that data is being flushed one page at a time. Space between the curves indicates that multiple pages are written. The report shows that the lazy writer flushed 1.6 pages per write on average compared to 1.8 pages for all flushes. These small numbers indicate that the system is reading from many small files. Lazy writes often average 15-16 pages of data.

The spikes in the data probably result from an application closing a file and the lazy writer writing all of the data back to disk. To see just how many pages went back to disk, narrow the time window to a single data point, and add the same counters to a report.

Cc722473.xwr_m09(en-us,TechNet.10).gif

This report on the second spike shows that in that second (averaged for the last two data points), about 101 pages were written back to disk, nearly 40% of which were lazy writes.

Tuning the Cache

This is going to be a short section, because there is not much you can do to tune the Windows NT file system cache. The tuning mechanisms are built into the Virtual Memory Manager to save you time for more important things. Nonetheless, there are a few things you can do to make the most of the cache:

Localize your application's data references. This will improve its cache performance and minimize its working set so it uses less space in memory.
If you are running Windows NT Server, you can direct the Virtual Memory Manager to give the cache higher priority for space than the working sets of processes. In the Control Panel, double-click the Network icon. On the Services tab, double-click Server. To favor the cache, click Maximize Throughput for File Sharing. To favor working sets click, Maximize Throughput for Network Applications.
Change the way work is distributed among workstations. Try dedicating a single computer to memory-intensive applications such as CAD/CAM and large database processors.
Add memory. When memory is scarce, the cache is squeezed and cannot do its job. After the new memory is installed, the Virtual Memory Manager takes care of expanding the cache to use the new memory.

AI Skills Fest

Share via