Site Server - Capacity Model for Internet Transactions

April 1999 

Abstract

The purpose of capacity planning for Internet services is to enable Internet site deployments that accomplish the following:

  • Support transaction throughput targets 

  • Remain within acceptable response time bounds

  • Minimize the total dollar cost of ownership of the host platform 

Conventional solutions to capacity planning often attempt to cost Internet services by extrapolating generic benchmark measurements. To better meet the stated objectives of capacity planning, a methodology based on Transaction Cost Analysis (TCA) has been developed at Microsoft® for estimating system capacity requirements.

Client transactions are simulated on the host server by a load generation tool that supports standard network protocols. By varying the client load, transaction throughput is correlated with resource utilization over the linear operating regime. A profile is then defined based on anticipated user behavior. This usage profile determines the throughput target and other important transaction parameters from which resource utilization and capacity requirements are then calculated.

Introduction

Conventional Approach

Conventional approaches to capacity planning often involve benchmarking the maximum workload of some specific combination of services, each with a particular usage profile on specific hardware. This process is incomplete and time-consuming, because it is not practical to iterate through all possible hardware, service, and usage profile permutations. A common response to this limitation is to determine these benchmarks for only a moderate number of representative deployment scenarios, and then attempt to estimate costs by extrapolating these benchmarks to fit each particular deployment scenario. This approach also increases the risk of over-provisioning hardware, because these benchmarks are designed to capture only the maximum workload attainable.

Transaction Cost Analysis

Transaction Cost Analysis (TCA) attempts to reduce the uncertainty inherent in this process by providing a framework for estimating each resource cost as a function of any usage profile, service mix, or hardware configuration. Hardware resources include, but are not necessarily limited to, CPU, cache, system bus, RAM, disk subsystem, and network. By using TCA methodology, the potential for over-provisioning hardware is also reduced, because the entire (linear) workload range is measured. Finally, unlike some code profiling techniques, TCA captures all costs present during run time, such as operating system overhead.

Figure 1 Comparison between a Conventional Approach and TCA 

The TCA methodology is applicable to any client-server system, but this paper will focus on its application to Internet services.

Two references*, Capacity Planning and Performance Modeling (From Mainframes to Client-Server Systems)* 1 *** *** and Modeling Techniques and Tools for Computer Performance Evaluation,2 focus on capacity and performance modeling in general. Another reference, Configuration and Capacity Planning for Solaris Servers,3 focuses on capacity planning for Sun Solaris systems, but offers helpful general insight as well.

Transaction Cost Analysis Methodology

TCA can be used to detect code bottlenecks and for capacity planning. In many cases, it is used for both purposes simultaneously, allowing the last code iteration to have accompanying capacity information. This paper only discusses using TCA for capacity planning. For simplicity, the methodology presented here assumes that the service analyzed has no code bottleneck (and scales approximately linearly in implementation) and that the hardware bottleneck resides on the hardware running this service.

Usage Profile

The effectiveness and flexibility of the capacity planning model depends on a careful assessment of the expected usage profile for each service. This usage profile consists of both individual and aggregated user behavior, and site profile information. Analysis of transaction logs of similar services provides a helpful starting point for defining this profile. Characteristics derived from this profile are then used to establish performance targets.

Site Profile

A representative site profile must be defined that specifies the following:

  • Services deployed 

  • Number of concurrent users for each service 

  • Expected deployment configuration 

This deployment configuration should include the servers where each software component will reside.

Example 

An Internet Service Provider may be interested in deploying a Web hosting platform that supports browsing of subscriber Web pages through the Internet. Suppose that this platform provides subscribers with use of a File Transfer Protocol (FTP) service to control file transfer of these Web pages to and from the host site.

The services required to support this scenario consist of Web (HTTP), FTP, directory (for example, LDAP), and database (for example, SQL Server) services. The Web and FTP services are configured on front-end servers while the directory and database services may each reside on separate back-end servers. In order to simplify subsequent discussion of this example, suppose that only the Web and FTP services are analyzed.

User Behavior

Individual user behavior must be characterized to include the following:

  • The behavior of any user will correspond to some sequence of client-side operations defined by a fundamental set of transactions. To simplify the analysis, only those transactions that in aggregate utilize hardware resources most significantly need be considered. 

  • The expected user session time for each service. 

  • Number of each transaction performed per user during this session time. 

  • Associated transaction parameters that significantly impact system performance should be identified, such as file size for transactions that write, read, or transport data. 

 

Example 

Continuing with the previous example, if it is anticipated that of all possible FTP transactions only "delete," "get," "open," and "put" will generate the most significant load in aggregate, then only these transactions need be considered. In particular, FTP "open" may stress a directory service such as LDAP and its database back end if the connection requires an authentication sequence. FTP "get" and "put" may stress disk input/output (I/O) resources and saturate network capacity while FTP "delete" may only stress disk I/O resources, and so on. Similar deductions are made for HTTP "get." Therefore, the fundamental transactions for this analysis consist of "delete," "get," "open," and "put" for FTP, and "get" for HTTP.

Clearly, the size of files transferred or deleted using FTP, and the Web page size requested using HTTP are important transaction parameters.

Performance Targets

The performance targets consist of the minimum required transaction throughput and maximum acceptable transaction latency. The minimum throughput (transaction rate) of transaction j for service i required to support this usage profile is given by

Example 

Suppose the Internet site has 1 million Web hosting subscribers and that 0.1 percent are concurrently using FTP at peak. Further, suppose that the total Web page audience is 2 million users with 0.9 percent users concurrently browsing Web pages at peak. If each Web user performs 5 HTTP "get" operations over 10 minutes, and each FTP user performs 3 FTP "put" and 2 FTP "get" operations together over 5 minutes, then using the notation

 

so that

 

Instrumentation of Probes

Transaction Cost Analysis (TCA) requires correlating resource utilization with service run-time behavior. Although adequate instrumentation to enable measurement of resource utilization is often already built into the operating system, the metrics to measure the performance of service transaction characteristics must also be defined and then built into the probing apparatus. Figure 2 illustrates the intended purpose of these probes in the context of capacity planning and performance management.

 

Figure 2 The Purpose of Probes in Capacity Planning and Performance Management 

Windows NT Performance Monitor

Microsoft® Windows NT® Performance Monitor is equipped with probes (called counters) that measure resource utilization and other system parameters as well as transaction flow, state, queue, and latency parameters for each service. Some Performance Monitor counters commonly collected during TCA (and also as part of regular performance monitoring) include:

  • Windows NT Operating System Counters 

    These include processor utilization, context switching rate, processor and disk queue lengths, disk read and write rates, available memory bytes, memory allocated by process, cache hit ratios, network utilization, network byte send and receive rates, and so on. 

  • Transaction Flow Counters 

    These include transaction rates such as HTTP gets/sec, FTP puts/sec, LDAP searches/sec, and so on. 

  • Service State, Queue, and Latency Counters 

    These include concurrent connections, cache hit ratios (for example, for LDAP and SQL Server transactions), queue length, and latency within each service component, and so on. 

See the Microsoft Windows NT Resource Kit4 for more specific examples of Windows NT counters.

Performance Measurement
System Configuration

In order to leverage TCA measurements made with one hardware configuration into variable configuration contexts, it is advantageous to separate service components among servers as much as possible. In an uncached environment this enables better isolation of resource costs for each transaction as a function of each component servicing the transaction.

This notion is illustrated in Figure 3, which shows the propagation of a single transaction (for example, LDAP "add") between client and servers and its accumulated resource costs C1 on Server 1 (LDAP Service running here) and C2 on Server 2 (SQL Server running here).

 

Figure 3 Separating Resource Costs of Service Components by Server 

If all of the components servicing this transaction reside on a single server for a particular deployment, then these separate costs may be added together as C1+C2 in order to roughly estimate the integrated cost. Of course, certain performance properties will change in this case. For example, certain costs will disappear, such as the CPU overhead for managing the network connection. On the other hand, other costs may increase, such as disk I/O, because the capacity for caching transaction requests is diminished (which reduces throughput), and so on. When caching plays a particularly important role, variable configurations should be separately analyzed.

As illustrated in Figure 4, the load generation clients and performance-monitoring tools should reside on computers other than the servers under analysis in order to minimize interference with these servers.

Figure 4 Hardware Configuration for Performance Analysis 

The service configuration should be deployed with the highest performance hardware available. This helps to ensure that resource requirements based on measurements on higher-performance hardware will scale down (at worst linearly) to lower-performance hardware. The converse is not necessarily true, however.

For example, multi-processor scaling is often sub-linear, in which case total CPU costs on a 4-way SMP server may be greater than on a 2-way SMP server under the same load (due to additional context switching and bus management overhead). Suppose CPU costs are measured on a 500 MHz 4-way SMP server and the actual deployment hardware consists of a 300 MHz 2-way SMP server. Then if all other hardware characteristics remain unchanged, CPU requirements should increase by no more than a factor of 3.3 = (500 MHz/300 MHz)*(4 processors/2 processors).

Load Generation

Load generation scripts must be written to simulate each transaction separately, and have the following form:

do * *transactionsleep rand timeInterval

The sleep is intended to reduce load on the client queue, and timeInterval should be chosen randomly5 over a representative range of the client-server transaction latencies. The load must be distributed among enough clients to prevent any single client from becoming its own bottleneck.

Before each run, the system should be shut down to flush caches and otherwise restore the system to a consistent state for data collection consistency. Furthermore, each run should continue for at least long enough to reach run-time "steady-state." This state is reached when, for example, initial transient (start-up) costs are absorbed, network connections are established, caches are populated as appropriate, and periodic behavior is fully captured. The measurements collected for each run should be time averaged over the length of the run in "steady-state."

In addition to collecting measurements of resource utilization and throughput, counters that help to isolate points of contention in the system should be monitored. These include queue lengths, context switching, out of memory paging, network utilization, and latencies. In particular:

  • On multiple disk RAID arrays, the average disk queue length per array should not exceed the number of physical disks per array. 

  • On Ethernet networks, the layer-2 CSMA/CD (Carrier Sense Multi-Access/Collision Detection) protocol attempts to enforce netcard utilization to less than 36 percent to deliver maximum shared network throughput. Network utilization in excess of 36 percent should be approached with caution.

For performance considerations specific to Microsoft Windows NT, see Netfinity Performance Tuning with Windows NT 4.0.6 

The user load (concurrent connections) should start small and increase incrementally up to Nmax at which point the transaction throughput begins to decrease from its maximum at Tmax. This decrease in throughput is due to factors such as high rates of context switching, long queues, and out of memory paging, and often corresponds to the point at which transaction latency becomes nonlinear.

These relationships are depicted in Figure 5. Each circle and the "X" in this figure represent a run and are points at which data is collected during the load generation. Cmaxresource denotes the maximum resource capacity.

Figure 5 Performance Measurement over the Linear Operating Regime (Generating load beyond maximum throughput corresponds to the point at which transaction latency becomes nonlinear) 

Cost Equations
Resource Costs for each Transaction

Each measured transaction rate Ti , j corresponds to some measured utilization of each resource denoted by Ci , j resource. These data point pairs are interpolated over the linear operating regime to construct an analytic relationship that expresses this resource utilization as a function of transaction rate. This relationship is shown in Figure 6.

 

For example, Ci , j CPU(Ti , j ; other) denotes the number of CPU cycles consumed by transaction j for service i with transaction rate Ti , j where other is a placeholder for other transaction parameters, for example, file size for HTTP "get."

 

Figure 6 Cost Equations Constructed by Interpolation of Resource Costs as a Function of Throughput (Transaction Rate) 

Strictly speaking, Ci , j resource is defined over the transaction rate range 0 <= Ti,j <= Ti, j max, but if Ti,j > Ti,jmax then the equation for Ci , j resource may still be applied with the interpretation that hardware will need to be scaled up linearly. In this case, it is especially important that the TCA assumptions listed at the beginning of the "Transaction Cost Analysis Methodology" section be satisfied for this interpretation to be valid.

Total Resource Costs

It is assumed that resource utilization adds linearly in a mixed environment in which multiple services are running and multiple transactions are being processed by each service. The total cost for each resource and for all transactions is then given by

For performance reasons and in order to account for unexpected surges in resource utilization, it is advantageous to operate each resource at less than full capacity. A threshold* *factor for each resource is introduced to reflect this caution in the provisioning of resources. This factor represents the percentage of resource capacity that should not be exceeded.

 

Therefore, the total number processors required to support this transaction mix without exceeding the CPU utilization threshold is given by

where CmaxCPU cycles denotes the peak clock speed per processor. qCPU is typically chosen between 60 percent and 80 percent.

Similarly, the total number of spindles required is given by

where Cmaxreads and Cmaxwrites denote the maximum number of disk reads/sec and disk writes/sec per spindle, respectively. Similar equations can be deduced for the other hardware resources.

It should be noted that the disk subsystem must be calibrated to determine Cmaxreads and Cmaxwrites. The disk calibration process performs a large number of uncached reads and writes to the disk to determine the maximum number of reads and writes the disk array can support. In hardware, Cmax reads and Cmaxwrites are most strongly functions of disk seek time and rotational latency.

In light of the layer-2 CSMA/CD protocol, qnetwork is often set to 36 percent on Ethernet networks.

Model Verification

The actual deployment scenario will consist of a mixed transaction environment as defined by the usage profile. The purpose of verification is to simulate this deployment scenario in order to (1) confirm that transactions do not negatively interfere with each other, and (2) verify that the resource costs estimated from the cost equations accurately reflect the actual costs measured.

Verification Script

The usage profile is translated into a verification script for each protocol. (There is nothing in principle to prevent creating a single verification script to invoke all protocols, but current tools do not presently support this.)

For a single protocol with multiple transactions, this script logic can be written in pseudo-code as:

count <-- 0
while ( count < ti ) 
{
if ( count mod ( ti / ni ,1 ) = 0 ) do transaction 1 
…
if ( count mod ( ti / ni ,J ) = 0 ) do transaction J 

sleep sleepIncrement
count <-- count + sleepIncrement
}
sleep rand smallNumber 

Here sleepIncrement = gcd( ti / *ni ,*1 ,…, ti / ni ,J ) where gcd denotes the greatest common divisor and ti / ni , j is the length of time that must elapse between initiating each transaction j using protocol i. For each transaction j, this script initiates the correct number of transactions ni , j uniformly over the length of the service session time ti as defined by the usage profile. The statement sleep rand smallNumber is included to randomize the transaction arrival distribution (see footnote 5).

Load Generation

The load generation process is the same as in Load Generation in the "Performance Measurement" section, except that only one run is made for each usage profile simulated. During this run, the number of concurrent connections per service should approximately equal the number of concurrent users using that service as defined by the usage profile. Each load generator instance runs a single script (such as given in the Verification Script topic under "Model Verification") representing the behavior of multiple users using one service. Multiple instances are run to generate the correct aggregate user load.

Comparison with TCA Estimates

For each usage profile simulated, measurements of resource utilization are compared against the utilization estimates calculated using the cost equations from the "Cost Equations" section. The difference between the measured costs and estimated total costs should be small, according to the required confidence interval. This notion is depicted in Figure 7.

 

Figure 7 Error Between Cost Equation Estimates and Simulation Measurements for Three Different Usage Profiles A, B, and C 

Example 

The usage profile from the example in the "Usage Profile" section indicates that the required throughput for Web page requests is 75 HTTP gets/sec and so on for the FTP transactions. The resource costs are then calculated using the equations developed in the "Cost Calculations" section.

In particular, suppose for the HTTP and FTP servers that the total CPU costs are calculated CtotalCPU = 2000 MHz. Further suppose that the deployment will occur with 400-MHz processor servers and that the requirement is to run these servers at less than 70 percent utilization. Then CmaxCPU = 400 MHz and qCPU = 70 percent, so that the total number of processors required is 2000 MHz / (0.7 * 400 MHz) = 7.1. In this case, two quad-processor servers will support the required load. Similar estimates are made for the other resources.

For verification, the system is then deployed with two quad-processor servers (as indicated by these calculations) and the appropriate load is generated using the verification scripts. This load should generate 9,000 concurrent HTTP connections and 1,000 concurrent FTP connections, as indicated by the usage profile example. Suppose an average of 77 HTTP gets/sec and average CPU utilization of 1900 MHz are measured. Then the throughput requirements are satisfied and the CPU utilization estimates are in error by 5 percent.

Capacity Planning Using Transaction Cost Analysis

After TCA has been performed, it can be applied to capacity planning using the following procedure: 

  1. Define usage profile and calculate throughput targets using Equation 1. 

  2. Calculate total resource costs using Equation 2. 

  3. Calculate capacity requirements using Equations 3 and 4 and similar equations for other resources. 

Transaction Cost Analysis Requirements

The requirements for performing TCA include:

  • Usage profile 

  • Instrumentation 

  • Performance monitoring tool 

  • Load generation tools that support all network protocols invoked by the services under analysis 

  • Load generation and usage profile scripts 

  • Configuration tool 

Conclusion

This capacity model based on TCA has been successfully applied to estimate hardware platform deployment requirements for Microsoft® Commercial Internet System (MCIS) and Microsoft® Exchange products.

1 Daniel Menasce, Virgilio Almeida, and Larry Dowdy, Prentice Hall, 1994

2 Ramon Puigjaner and Dominique Potier, Plenum Press, 1989

3 Brian Wong, Prentice Hall, February 1997

4 Microsoft Press, 1996

5 The Benchmark Handbook for Database and Transaction Processing Systems, Jim Gray, Morgan Kaufman Publishers, 1991

6 David Watts, M.S. Krishna, et al., IBM Corporation, October 1998