Appendix D: Server Performance Analyzing and Scaling

Article
02/26/2008

Published: February 25, 2008

The following information identifies important monitoring counters used for capacity planning and performance monitoring of a system.

These counters are referenced in Step 7: “Design the Farm,” in which the form factor and size of the terminal servers and the TS Web Access server are determined. They are also referenced in Step 9: “Size and Place the Terminal Services Role Services for the Farm,” where the form factor and sizing for each of the Terminal Services role services is determined.

Proper design and sizing of the terminal server and the Terminal Services role services is critical to the effective operation of the environment, particularly since the servers will host a somewhat unusual workload: multiple copies of applications that were likely originally designed to run on client workstations. The variety of components in this workload means that there is no “one size fits all” answer for terminal server sizing, so careful measurements and testing must be performed in order to arrive at a design that is capable of meeting end users’ expectations.

Processor Utilization

Over-committing CPU resources can adversely affect all the workloads on the same server, causing significant performance issues for a larger number of users. Because CPU resource use patterns can vary significantly, no single metric or unit can quantify total resource requirements. At the highest level, measurements can be taken to see how the processor is utilized within the entire system and whether threads are being delayed. The following table lists the performance counters for capturing the overall average processor utilization and the number of threads waiting in the processor Ready Queue over the measurement interval.

Table D1. Performance Monitor Counters for Processor Utilization

Object	Counter	Instance
Processor	% Processor Time	_Total
System	Processor Queue Length	N/A

Processor\% Processor Time

As a general rule, processors that are running for sustained periods at greater than 90 percent busy are running at their CPU capacity limits. Processors running regularly in the 75–90 percent range are near their capacity constraints and should be closely monitored. Processors regularly reporting 20 percent or less utilization can make good candidates for consolidation.

For response-oriented workloads, sustained periods of utilization above 80 percent should be investigated closely as this can affect the responsiveness of the system. For throughput-oriented workloads, extended periods of high utilization are seldom a concern, except as a capacity constraint.

Unique hardware factors in multiprocessor configurations and the use of Hyper-threaded logical processors raise difficult interpretation issues that are beyond the scope of this document. Additionally, comparing results between 32-bit and 64-bit versions of the processor are not as straightforward as comparing performance characteristics across like hardware and processor families. A discussion of these topics can be found in Chapter 6, “Advanced Performance Topics” in the Microsoft Windows Server 2003 Performance Guide.

System\Processor Queue Length

The Processor Queue Length can be used to identify if processor contention, or high cpu-utilization, is caused by the processor capacity being insufficient to handle the workloads assigned to it. The Processor Queue Length shows the number of threads that are delayed in the processor Ready Queue and are waiting to be scheduled for execution. The value listed is the last observed value at the time the measurement was taken.

On a machine with a single processor, observations where the queue length is greater than 5 is a warning sign that there is frequently more work available than the processor can handle readily. When this number is greater than 10, then it is an extremely strong indicator that the processor is at capacity, particularly when coupled with high cpu utilization.

On systems with multiprocessors, divide the queue length by the number of physical processors. On a multiprocessor system configured using hard processor affinity (that is, processes are assigned to specific CPU cores), which have large values for the queue length, can indicate that the configuration is unbalanced.

Although Processor Queue Length typically is not used for capacity planning, it can be used to identify if systems within the environment are truly capable of running the loads or if additional processors or faster processors should be purchased for future servers.

Memory Utilization

In order to sufficiently cover memory utilization on a server, both physical and virtual memory usage needs to be monitored. Low memory conditions can lead to performance problems, such as excessive paging when physical memory is low, to catastrophic failures, such as widespread application failures or system crashes when virtual memory becomes exhausted.

Table D2. Performance Monitor Counters for Memory Utilization

Object	Counter	Instance
Memory	Pages/sec	N/A
Memory	Available Mbytes	N/A
Memory	Pool Paged Bytes	N/A
Memory	Pool Paged Resident Bytes	N/A
Memory	Transition Faults/sec	N/A
Memory	Committed Bytes	N/A
Process	Working Set	<Process Name>

Memory\Pages/sec

As physical RAM becomes scarce, the virtual memory manager will free up RAM by transferring the information in a memory page to a cache on the disk. Excessive paging to disk might consume too much of the available disk bandwidth and slow down applications attempting to access their files on the same disk or disks. The Pages/Sec counter tracks the total paging rates, both read and writes, to disk.

For capacity planning, watch for upward trends in this counter. Excessive paging can usually be reduced by adding additional memory. Add memory when paging operations absorbs more than 20–50 percent of the total disk I/O bandwidth. Because disk bandwidth is finite, capacity used for paging operations is unavailable for application-oriented file operations.

The Total Disk I/O Bandwidth is a ratio of the Pages/sec and the Physical Disk\Disk Transfers/sec for all disks in the system:

Memory\Pages/sec ÷ Physical Disk (_Total)\Disk Transfers/sec

Memory\Available Mbytes

The Available Mbytes displays the amount of physical memory, in megabytes, that is immediately available for allocation to a process or for system use. The percent Available Megabytes can be used to indicate if additional memory is required. Add memory if this value drops consistently below 10 percent. To calculate the percent of Available Megabytes:

(Memory\Available Mbytes ÷ System RAM in Megabytes) * 100

This is the primary indicator to determine whether the supply of physical memory is ample. When memory is scarce, Pages/sec is a better indicator of memory contention. Downward trends can indicate a need for additional memory. Counters are available for Available Bytes and Available Kbytes

Memory\Pool Paged Bytes and Memory\Pool Paged Resident Bytes

The Pool Paged Bytes is the size, in bytes, of the paged pool, an area of system memory used by the operating system for objects that can be written to disk when they are not being used.

The Pool Paged Resident Bytes is the size, in bytes, of the nonpaged pool, an area of system memory used by the operating system for objects that can never be written to disk, but must remain in physical memory as long as they are allocated.

A ratio of Pool Paged Bytes to Pool Paged Resident Bytes can be calculated by:

Memory\Pool Paged Bytes ÷ Memory\Pool Paged Resident Bytes

This ratio can be used as a memory contention index to help in planning capacity. As this approaches zero, additional memory needs to be added to the system to allow both the Nonpaged pool and Page pool to grow.

The size returned by the Pool Paged Resident Bytes can be used for planning additional TCP connections. Status information for each TCP connection is stored in the Nonpaged pool. By adding memory, additional space can be allocated to the Nonpaged pool to handle additional TCP connections.

Memory\Transition Faults/sec

The Transition Faults counter returns the number of soft or transition faults during the sampling interval. Transition faults occur when a trimmed page on the Standby list is re-referenced. The page is then returned to the working set. It is important to note that the page was never saved to disk.

An upward trend is an indicator that there may be a developing memory shortage. High rates of transition faults on their own do not indicate a performance problem. However, if the Available Megabytes is at or near its minimum threshold value, usually 10 percent, then it indicates that the operating system has to work to maintain a pool of available pages.

Memory\Committed Bytes

The Committed Bytes measures the amount of committed virtual memory. Committed memory is allocated memory that the system must reserve space for either in physical RAM or on the paging file so that this memory can be addressed by threads running in the associated process.

A memory contention index called the Committed Bytes:RAM can be calculated to aid in capacity planning and performance. When the ratio is greater than 1, virtual memory exceeds the size of RAM and some memory management will be necessary. As the Committed Bytes:RAM ratio grows above 1.5, paging to disk will usually increase up to a limit imposed by the bandwidth of the paging disks. Memory should be added when the ratio exceeds 1.5. The Committed Bytes:RAM is calculated by:

Memory\Committed Bytes ÷ System RAM in Bytes

Process\Working Set

The Working Set counter shows the amount of memory allocated to a given process that can be addressed without causing a page fault to occur. To see how much RAM is allocated overall across all process address spaces, use the _Total instance of Working Set. Watch for upward trends for important applications.

Some server applications, such as IIS, Exchange Server, and SQL Server, manage their own process working sets. To measure their working sets, application-specific counters need to be used.

Disk Storage Requirements

The process of planning for storage requirements is divided into capacity requirements and disk performance. Although a total capacity requirement can be determined, the performance requirements, as well as fault tolerance requirements, of the system will have an impact on the implementation of the storage subsystem. For example, a single drive could provide enough storage space, but the performance of that single disk may not meet the performance needs of the system.

Due to this, both capacity and performance requirements need to be met, which may alter the decision around the size, speed, and configuration of the drives in the storage subsystem.

Disk Space Capacity

The amount of storage capacity required can be calculated based on OS requirements as well as any application-specific data that needs to be stored on the system.

Disk Performance

Disk performance is typically expressed as a value of the total number of I/O operations that occur per second (IOPS), measured over some period of time during peak usage.

To determine the number of disks needed to meet a system’s IOPS requirement, the IOPS of a given drive needs to be determined. To further complicate matters, IOPS are very dependent upon the access pattern. For example, a drive will typically have a higher IOPS rating for sequential reads than it will for random writes. For this reason, it is normal to calculate a “worst case” IOPS measurement based on short random input/output operations.

To calculate the IOPS of a drive, information about the drive needs to be collected. The following table lists the information required that normally can be found in the manufacturer’s datasheet about the drive.

Table D3. Information Required for Calculating IOPS

Required Information	Description
Spindle Rotational Speed (RPM)	The spindle speed expressed as RPM.
Average Read Seek Time (ms)	The average seek time for reads.
Average Write Seek Time (ms)	The average seek time for writes.

The first step to calculating the IOPS is to determine the Average Seek Time in milliseconds that the drive is capable of doing. There is an assumption that there will be a 50/50 mix of read and write operations. If the ratio of read and write operations is modified, then the Average Seek Time will need to be adjusted. For example, if a drive has an Average Read of 4.7 ms and an Average Write of 5.3 ms, the Average Seek Time for this drive will be 5.0 ms:

5.0ms = (4.7ms + 5.3ms) ÷ 2

Next, the IO Latency needs to be calculated. This is calculated by adding the Average Seek Time to the Average Latency. The following table lists the Average Latency of common spindle speeds of drives on the market today.

Table D4. Average Latency Based on Spindle Rotational Speeds

Spindle Rotational Speed (rpm)	Average Latency (ms)
4,200	7.2
5,400	5.6
7,200	4.2
10,000	3.0
15,000	2.0

The example drive has a spindle speed of 10,000 RPM. So this drive has an IO Latency of 8.0 ms:

8.0 ms = 5.0ms + 3.0ms

A drive can only perform one IO operation at a time. To calculate the number of IOs that can be performed in a millisecond, 1 is divided by the IO Latency. Finally, this value is converted to IO per Second by multiplying by 1000. The IOPS calculation for the example drive evaluates to 125 IOPS:

125 IOPS = (1 IO ÷ 8.0ms) * 1000 ms/sec

Storage Requirements

To determine storage requirements, additional information is needed to be collected around the system being considered. Some of this information is easily identified and self explanatory, while other information may be more difficult to identify due to lack of quantifiable data. All of the following calculations are based on a per server; although if shared storage systems are being considered, then the information can be scaled up based on the number of systems sharing that storage. The following table shows the information that needs to be collected.

Table D5. Information Required for Calculating Storage Requirements

Required Information	Description	Example
# Users Per Server	Total number of users hosted by that server.	700
% Concurrent Users	The percentage of users connected to the server during peak times.	80%
IOPS per User Required	The number of IOPS generated by a user.	0.5
Storage Capacity in Gigabytes	The calculated disk storage capacity needed.	450
% Buffer Factor (for growth)	The percentage of disk storage growth allowed by the system.	20%
Read % of IOPS	The percentage of IOPS that are Read operations.	50%
Write % of IOPS	The percentage of IOPS that are Write operations.	50%
Disk Size (GB)	The drive size being considered for the storage system.	146
Calculated Drive IOPS	The calculated IOPS of the drive being considered for the storage system.	125

The information in the table above is fairly self explanatory with the exception of the IOPS per User Required. This is a measurement of the number of IOPS that a single user will generate on the system. Most venders do not have this information for their applications unless the application is extremely IO intensive. This information may be calculated from observation of a running system, but it is fraught with a number of challenges and is beyond the scope of this guide. For the purpose of this example, this guide will use the value used with Exchange Server, which is 0.5 IOPS per user.

Based on the information in Table D5, there are a number of derived values that need to be calculated. The following table lists these values.

Table D6. Derived Information Required for Calculating Storage Requirements

Required Information	Description	Example
# of Concurrent Users	The number of concurrent users calculated by calculating the percentage of concurrent users and the number of users per server.	560
IOPS per Server Required	The number of IOPS generated by each user multiplied by the number of concurrent users.	280
Total Storage Requirements	The Storage Capacity increased by the percentage of growth allowed.	540
Number of Read IOPS	The percentage of Reads and the IOPS per Server Required.	140
Number of Write IOPS	The percentage of Reads and the IOPS per Server Required.	140

Required Information	Description	Example
Drive Size Actual (GB)	After formatting, drives capacity is roughly 10 percent less than the advertised total capacity of the disk. This value adjusts for this loss of space. This value is calculated by taking 90 percent of the Disk Size (GB).	132

RAID 0+1 and RAID 5 are the most common drive configurations for storing data in a redundant manner on a system. However, these two RAID systems have different IOPS calculations due to how they operate.

RAID 0+1 Calculations

To calculate the number of drives necessary to meet the storage requirements with RAID 0+1, divide the Total Storage Requirements by the Drive Size Actual. Round the result up and multiply by 2. For this example, 10 drives are needed for RAID 0+1 to meet the storage requirements:

10 = ROUNDUP(540÷132)*2

To calculate the number of drives required to meet the performance requirements with RAID 0+1, multiply the Number of Write IOPS by 2 and add the Number of Read IOPS. Divide this total by the Calculated Drive IOPS and round the result up. For this example, 4 drives are needed for RAID 0+1 to meet the performance requirements:

4 = ROUNDUP(((140*2)+140)÷125)

Although RAID 0+1 can meet the performance requirements with just 4 disks, 10 disks are required to meet the capacity requirements.

RAID 5 Calculations

To calculate the number of drives necessary to meet the storage requirements with RAID 5, the Total Storage Requirements needs to be multiplied by 1.2 to adjust for parity storage requirements. This value is then divided by the Drive Actual Size and then rounded up. For this example, 5 drives are needed for RAID 5 to meet the storage requirements:

5 = ROUNDUP((540*1.2)÷132)

To calculate the number of drives required to meet the performance requirements with RAID 5, multiply the Number of Write IOPS by 4 and add the Number of Read IOPS. Divide this total by the Calculated Drive IOPS and round the result up. For this example, 6 drives are needed for RAID 5 to meet the performance requirements:

6 = ROUNDUP(((140*4)+140)÷125)

Although RAID 5 can meet the storage requirements with just 5 disks, 6 disks are required to meet the capacity requirements.

RAID 0+1 versus RAID 5 Calculations

As can be seen in this example, RAID 5 looks to be the better choice when using 10K 146-GB drives. However, it is important to look at different types of drives when doing these calculations. For example, if a drive that has 300 GB is substituted for the 146-GB drives and if all other characteristics remain the same, then the choices drastically change.

Using 300-GB drives, RAID 0+1 requires just 4 drives to meet both capacity and performance characteristics. RAID 5 will require 3 drives to meet capacity but 6 drives to meet performance requirements. By changing the size of the drive, the best choice changed as well.

Storage Monitoring

IOPS are used to help characterize the performance requirements of the system. However, once the system is up, additional performance monitoring can be utilized to determine if the disk subsystem is slowing the system down.

Table D7. Performance Monitor Counters for Disk Performance

Object	Counter	Instance
Physical Disk	% Idle Time	<All Instances>
Physical Disk	Disk Transfers/sec	<All Instances>
Physical Disk	Avg. Disk secs/Transfers	<All Instances>
Physical Disk	Split IO/sec	<All Instances>

Physical Disk\% Idle Time

The % Idle Time counter is the percent of time that a disk was idle during the sampling interval. An idle time of less than 20 percent indicates that the disk may be overloaded.

Physical Disk(n)\Disk utilization can be derived by subtracting the Physical Disk(n)\% Idle Time from 100%.

Physical Disk\Disk Transfers/sec

The Disk Transfers/sec is the number I/O request Packets (IRP) that has been completed during the sampling interval. A disk can only handle one I/O operation at a time. So the number of physical disks attached to the computer serves as an upper bound on the sustainable disk IO rate. Where disk arrays are used, divide the Disk Transfers/sec by the number of disks in the array to estimate individual disk I/O rates.

The Physical Disk(n)\Average Disk Service Time/Transfer can be calculated by taking the Physical Disk(n)\Disk Utilization and dividing it by the Physical Disk(n)\Disk Transfers/sec for each physical disk. This indicates how fast the drive responds to a request. If this climbs above what is specified for the disk, it can indicate that the subsystem is overloaded.

Physical Disk\Avg. Disk secs/transfers

The Avg. Disk secs/transfer is the overall average of the response time for physical disk requests during the sampling interval. This includes both the time the request was serviced by the device and the time it spent waiting in the queue. If this climbs to over 15–25 disk I/Os per second per disk, then the poor disk response time should be investigated.

The Physical Disk(n)\Average Disk Queue Time/Transfer can be calculated by taking the Physical Disk(n)\Avg. Disk secs/Transfer and subtracting the Physical Disk(n)\Avg. Disk Service Time/Transfer. The Average Disk Queue Time/Transfer indicates the amount of time a request spent waiting in the disk queue for servicing. A high queue time can indicate a poorly responding disk subsystem or specifically a poorly responding physical disk.

Physical Disk\Split IO/sec

Split IO/sec is the rate physical disk requests were split into multiple disk requests during the sampling interval. A large number of split IO/s indicates that the disk is fragmented and performance is being affected. The percentage of Split IOs can be calculated by the following formula where “n” is a specific disk:

(Physical Disk(n)\Split IO/sec ÷ Physical Disk(n)\Disk Transfers/sec) * 100

If this percentage is greater than 10 to 20 percent, check to see whether the disk needs to be defragmented.

Network Performance

Most workloads require access to production networks to ensure communication with other applications and services and to communicate with users. Network requirements include elements such as throughput—that is, the total amount of traffic that passes a given point on a network connection per unit of time.

Other network requirements include the presence of multiple network connections.

Table D8. Performance Monitor Counters for Network Performance

Object	Counter	Instance
Network Interface	Bytes Total/sec	(Specific network adapters)
Network Interface	Current Bandwidth	(Specific network adapters)
Ipv4 & Ipv6	Datagrams/sec	N/A
TCPv4 & TCPv6	Connections Established	N/A
TCPv4 & TCPv6	Segments Received/sec	N/A

Network Interface\Bytes Total/sec and Network Interface\Current Bandwidth

This Bytes Total/sec is the number of bytes transmitted and received over the specified interface per second The Current Bandwidth counter reflects the actual performance level of the network adaptor, not its rated capacity. If a gigabit network adapter card on a segment is forced to revert to a lower speed, the Current Bandwidth counter will reflect the shift from 1 Gps to 100 Mpbs.

Using these two values, the Network interface utilization can be calculated for each interface, designated as “n” with the following equation:

(Network Interface(n)\Bytes Total/sec ÷ Network Interface(n)\Current Bandwidth) *100

If % Busy for a given adapter exceeds 90 percent, then additional network resources will be necessary. Generally, the maximum achievable bandwidth on a switched link should be close to 90–95 percent of the Current Bandwidth counter.

Ipv4 & Ipv6\Datagrams/sec

These counters show the total number of IP datagrams transmitted and received per second during the sampling interval. By generating a baseline around this counter, a trending and forecasting analysis of the network usage can be performed.

TCPv4 & TCPv6\Connections Established

The Connections Established counter is the total number of TCP connections in the ESTABLISHED state at the end of the measurement interval. The number of TCP connections that can be established is constrained by the size of the Nonpaged pool. When the Nonpaged pool is depleted, no new connections can be established.

Trending and forecasting analysis on this counter can be done to ensure that the system is properly scaled to handle future growth. The server can be tuned using the TCP registry entries like MaxHashTableSize and NumTcTablePartitions based on the number of network users seen on average.

TCPv4 & TCPv6\Segments Received/sec

The Segments Received/sec is the number of TCP segments received across established connections, averaged over the sampling interval. The average number of segments received per connection can be calculated. This can be used to forecast future load as the number of users grows. The following formula can be used to calculate the average number of segments received per connection:

TCPvn\Segments Received/sec ÷ TCPvn\Connections Established/sec

Windows Server-based File Servers

For Windows Server-based file servers, additional performance counters are available in addition to the ones mentioned previously, which can be used to monitor the performance of the system. The following table lists the most important counters.

Table D9. Performance Monitor Counters for File Server Performance

Object	Counter	Instance
Server	Work Item Shortages	N/A
Server Work Queues	Available Threads	<All Instances>
Server Work Queues	Queue Length	<All Instances>

Server\Work Item Shortages

The Work Item Shortages counter indicates the number of times a shortage of work items caused the file server to reject a client request. This error usually results in session termination. This is the primary indicator that the File Server service is short on resources.

Server message blocks (SMBs) requests are stored in a work item and assigned to an available worker thread. If there are not enough available threads, the work item is placed in the Available Work Items queue. If this queue becomes depleted, then the server can no longer process SMB requests.

Server Work Queues\Available Threads

The Available Threads reports the number of threads from the per-processor Server Work Queue that are available to process incoming SMB requests. When the number of available threads reaches 0, incoming SMB requests must be queued. This is the primary indicator for identifying if the number of work threads defined for the per-processor Server Work queues is a potential bottleneck.

The MaxThreadsPerQueue registry DWORD value located at HKLM\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters controls the number of threads that are created in the per-processor thread pools. By default, the system will create 10 threads per-processor thread pool. This value should be increased when the following conditions hold true:

Available Threads is at or near zero for a sustained period.
Queue Length of waiting requests is greater than 5.
% Processor Time for the associated processor instance is less than 80.

By increasing the available threads, additional work items can be handled; however, care should be taken not to overstress the system with the addition of the extra threads.

Server Work Queues\Queue Length

The Queue Length counter reports the number of incoming SMB requests that are queued for processing, waiting for a worker thread to become available. There are separate per-processor Server Work queues to minimize inter-processor communication delays.

This is a primary indicator to determine whether client SMB requests are delayed for processing at the file server. It can also indicate that the per-processor Work Item queue is backed up because of a shortage of threads or processing resources.

If any of these queues grows beyond 5, then the underlying reason for this growth should be investigated. File server clients whose sessions are terminated because of a work item shortage must restore their sessions manually.

This accelerator is part of a larger series of tools and guidance from Solution Accelerators.