Site Server - A Site Server - Commerce Edition Scalability Case Study

August 1999 

Executive Summary

We conducted this study to test concepts for building a highly scalable Electronic Commerce site with as few servers as possible.

A standard Site Server 3.0 Commerce Edition (SSCE) installation was used, with an application code base from the customer's production environment. The study consisted of the following phases:

Phase 1: Baseline and Optimization

Measured the original baseline code, then measured an optimized baseline code.

Phase 2: Platform Upgrades

Measured platform upgrades to obtain insight into performance deltas and benefits of upgrading the platform by each type of upgrade (Windows NT 4.0 Service Pack 4, SSCE Service Pack 2, MDAC 2.1 SP1, ADSI 2.5, Visual Basic Scripting Edition 5.0).

Phase 3: Architectural Optimization

Implemented architectural improvements and measured the throughput of partitioned operations.

The study shows that we can improve performance and increase total site capacity by:

  • Converting product browse Active Server Pages (ASPs) to static HTML pages 

  • Upgrading site platform software to the latest versions and service packs 

  • Upgrading the site platform from a dual-CPU server to a quad-CPU server 

There was an overall increase of 100% in capacity, from 300 shoppers to 600 shoppers, resulting from conversion of ASP pages to HTML pages and site platform software upgrades. Upgrading the platform to a quad-CPU server resulted in a 167% increase in capacity, from 300 shoppers to 800 shoppers.

The study also shows that even more substantial performance gains can be realized by modifying the overall site architecture to render frequently visited ASP pages as static HTML requests. We used this technique in combination with ISAPI.DLL to render the dynamic portions of the now-static HTML pages, then partitioned HTML/ASP requests with different workload characteristics (cost) to different, dedicated servers. This reduced server count by applying vertical scaling to the partitioned request, thus in effect increasing available CPU resources to process a greater number of requests.

In short, the study shows how 100,000 concurrent users theoretically can be supported by as few as 88 Web servers when optimization and vertical scaling techniques are applied to operations that are high consumers of CPU capacity. The improvements in code and architecture will enable the site to improve capacity by 383.1%, from 300 shoppers per server to 1149 shoppers per server, at a savings of approximately $3,240,000.

These results contrast dramatically with traditional practices of scaling applications vertically (using larger, faster single machines) or horizontally (using hundreds of servers executing the entire base set of application code), which would have required hundreds of machines (333 servers) to support the same workload.

In terms of cost per shopper, the cost of 87 servers is approximately $1,770,000 (Compaq Proliant 5500 dual-CPU Xeon Class $15,000 x 43 = $645,000 + Compaq AlphaServer 1200 dual-CPU Alpha Class $25,000 x 45 = $1,125,000). This is a cost of $17.70 per shopper, for 100,000 shoppers—a stark contrast to the cost of a traditional, horizontally scaled server farm of 333 servers, which would have cost $15,000 x 333 = $4,995,000 (or $49.95 per shopper for 100,000 shoppers).

Note that this capacity is at the very extreme end. With assumptions of a 40:1 browse-to-buy ratio, 11-minute average session length, $250 average per checkout, and with a transaction rate of 0.1 checkouts/user/11 minutes, this configuration supports 57 million shoppers/day, 1.7 million checkouts/day, and can generate revenue of $158 billion annually.

In terms of cost per transaction, the checkout transaction server (Compaq Proliant 5500 dual-CPU Xeon Class) has an observed capacity of 216,000 checkouts/day/server, or a checkout transaction cost of $0.06/transaction ($15,000 / 216,000). Amortized over a year, this amounts to $0.0005/transaction ($15,000/(216,000 * 365)).

With Compaq AlphaServer 1200 dual-CPU hardware, the theoretical capacity is 345,600 checkouts/day/server, or a transaction cost of $0.07/transaction ($25,000 / 345,000). Amortized over a year, this amounts to $0.0001/transaction ($25,000/(345,000 * 365)).

Note that transaction capacity in this architecture can be finely tuned by increasing the number of checkout-transaction servers. You can also add capacity for other operations by adding dedicated servers. In this way, capacity can be increased with fewer servers than when using traditional scaling techniques. The benefit of this architecture, then, is high capacity with minimal management complexity and cost.

Customer Capacity Goals

The study customer asked for 10x capacity by Christmas 1999, followed by 100x capacity for Christmas 2000, but had no figures for actual site traffic. The study showed that there were approximately 30 concurrent visitors to the customer's site, and capacity could be estimated as shown in the following table:

Time

Concurrent users

Checkouts per day

Server capacity required1

Projected annual revenue

Current (1x)

30

521

0.10

$47 million

Christmas 1999 (10x)

300

5,210

1

$475 million

Christmas 2000 (100x)

3,000

52,100

10

$4.7 billion

1 The customer currently has four servers. Although capacity increases for Christmas 2000 initially indicated a requirement for 6 additional servers, performance optimization resulting from this study (platform upgrades, early rendering product browses, additional CPU upgrade) reduced the server requirement to 3000/800 = 3.75. No additional servers are required to meet this goal.

We also discovered that the customer's actual goal was only 4,000 transactions per day for Christmas 2000 (equivalent to 240 concurrent users). This translates to the following requirements for the next three years, using the same 10x, 100x growth rates:

Time

Concurrent users

Checkouts per day

Server capacity required2

Projected annual revenue

Christmas 2000

240

4,168

0.80

$380 million

Christmas 2001

2,400

41,684

8

$3.8 billion

Christmas 2002

24,000

416,840

65

$38 billion

2 The customer already has four servers. On this growth schedule, the customer would need four additional servers to accommodate growth planned for Christmas 2001 (for a total of eight servers) and an additional 57 servers to accommodate growth planned for Christmas 2002 (for a total of 65 servers). However, using the performance optimizations recommended in this study, the server requirement can be reduced to 2,400/800 = 3 for Christmas 2001 (no additional servers required) and approximately 30 servers (26 additional servers) to handle projected capacity for Christmas 2002.

During this study, a Compaq Proliant 7000 quad-CPU Pentium II 400 MHz Xeon Class server provided sufficient capacity for 800 concurrent users. Our measurement stopped short of the actual maximum, due to insufficient resources available for testing. However, using this number, the required server count for Christmas 2002 becomes 24,000/800 = 30 servers. Note also that server counts are based on current Transaction Cost Analysis (TCA) costs on Pentium II-class servers. Using Alpha Class servers would further reduce the required number of servers.

If additional capacity is required in the immediate future, the partitioned operations architecture used in this study enables the customer to scale the site horizontally by operation, scale it vertically by installing Compaq Alpha Class servers, or, if necessary, further optimize operations that consume a high amount of CPU capacity. At this revenue level, high-performance servers from Compaq (8-CPU systems), Data General (64-CPU systems), or Unisys (64-CPU systems) can be brought online to further increase capacity and reduce server count.

Standardized Capacity Projections

A much more realistic look at capacity projections can be found in the Forrester Report entitled Retail's Growth Spiral (11/98), which reports that the growth of online commerce sites averages 70% (1.7x) per year.

This figure can be compared to the customer's projection of a 10x growth rate, which translates to an increase of 1000% in revenue each year. This rate of growth, and perhaps more, might be achievable in the first year after launching the site, when there is typically high growth as customers discover the site, with the help of advertising campaigns. However, the site's rate of growth will probably track industry averages more closely in following years.

This study projects that with 40 servers, the site could theoretically generate $19 billion in annual revenue by Christmas 1999 (404x current revenue levels). At 400 servers, the site could theoretically generate $190 billion in annual revenue by Christmas 2000 (4,042x current revenue levels).

With optimization, these projected revenue levels begin to look astronomical. With 40 servers, the site could theoretically eventually generate $38 billion in annual revenue (808x current revenue levels). With 400 servers, the site could theoretically eventually generate $507.1 billion in annual revenue (10,789x current revenue levels).

A much more realistic capacity growth projection might look as follows:

Time

Concurrent users

Checkouts per day

Server capacity required

Projected annual revenue

Current (1x)

30

521

0.10

$47 million

Christmas 1999 (10x)

3003

5,210

1

$475 million

Christmas 2000 (17x)

510

8,857

1.7

$808 million

3 This accounts for initial growth following site launch.

Reason for the Study

We have recently received requests for advice on how to scale customer sites for peak shopping periods, such as Christmas and Back-To-School. Some customers are also being pressured by our competitors, who suggest that they might face as much as ten times normal traffic during these peak periods, followed by ten times more traffic during next year's peak periods.

For example, if a customer has four web servers handling the current user load, they would need 40 web servers to handle peak shopping periods this year, and 400 web servers to handle peak shopping periods next year.

However, these figures are probably unrealistic, since only a handful of very successful sites actually experience this type of an increase in traffic volume. Nevertheless, all customers want to become successful electronic commerce sites, and want to be prepared for the resulting increase in volume.

Traditionally, scaling a Web site or electronic commerce site is done by scaling the site either vertically or horizontally. Scaling vertically is done by adding processors, memory, and disk space to an existing server or by purchasing new machines with larger internal capacity. Scaling horizontally is done by adding more servers. See the white paper on "Building High-Scalability Server Farms" for more information on the various methods of scaling sites.

At first glance, these methods appear to provide easy solutions to the problem of increasing a site's capacity. However, some customers object to these methods, either because they have already maximized their current hardware (vertically scaled their sites), or are not enthusiastic about expanding their server farms (scaling horizontally), due to the increased costs of managing large Web site server farms. In effect, these customers do not want to scale vertically or horizontally beyond a certain point because of the cost of the related increase in management complexity.

In order to address customers' requirements for using a smaller group of machines to support their sites, we decided to attempt to improve the architecture of their existing hardware. This approach attempts to utilize the machines in a more efficient manner by dividing up requests with similar workload characteristics (similar CPU cost), then putting similar operations on dedicated servers, rather than having all of the servers handling all types of requests (requests with different CPU costs). We found that this approach can improve site capacity more efficiently and more cost-effectively than traditional horizontal and vertical scaling methods. See the white paper on "Building High-Scalability Server Farms" for more details on how to improve a site's architecture.

Before we could redesign a site's architecture, however, we first needed to understand the current performance of the site and where the bottlenecks were (if any). We then needed to understand how improving the architecture could improve the site's scalability, while reducing complexity and total operating cost more than either of the other two methods.

We chose a live, representative customer site for our study. We audited the site using a measurement technique called Transaction Cost Analysis (TCA). TCA measures the cost of each shopper operation in terms of CPU cycles. (See the white paper "Using Transaction Cost Analysis for Site Capacity Planning" for more information on TCA.)

TCA is a more accurate measure than ASP throughput performance, which is not directly related to the shopper capacity of a site. For example, increasing ASP throughput by 50% does not necessarily increase shopper capacity by 50%, due to the mixed types of requests that can be overshadowed by a bottleneck existing in other ASP pages.

Measuring the CPU cost of a shopper operation, however, provides an accurate measure of the cost of that operation. CPU cost of a shopper operation can then be translated to shopper capacity, simply by dividing CPU capacity by the CPU cost of the shopper operation.

The Study

This section describes the platform on which we conducted the study, and describes what we did in each phase. The Conclusion section provides a detailed description of our results.

Hardware and Software

This section describes the hardware and software used in this study. The hardware and software match the configuration of the customer site's data center as closely as possible.

Web Servers

The Web servers used in the study were configured as follows:

Hardware

· 4 Compaq Proliant 5500s, Pentium II Xeon 400MHz Dual Proc, 1 GB Memory
· 1 9GB Hard drive for base software
· 3 6GB Hard Drives for Web Server Data (Web pages) Raid 5
· Cisco LocalDirector for load balancing

Software

· Microsoft® Windows NT® 4 Server with SP3
· Microsoft® Windows NT® Option Pack (IIS)
· Microsoft® Site Server Commerce Edition 3.0 with SP1
· LDAP hotfix for membership
· Microsoft® Internet Explorer 4.01 with SP1

SQL Servers

The SQL servers used in the study were configured as follows:

Hardware

· 2 Compaq Proliant 5500s, Pentium II Xeon 400 MHz Quad Proc, 2GB Memory
· 2 9GB Hard drives for base software
· 4 18GB Hard drives for Product Catalog and Membership (Raid 5)

Software

· Microsoft® Windows NT® 4 Server with SP3
· Microsoft® SQL Server™ 6.5 with SP4
· Microsoft® Windows NT® Option Pack with MTS
· Microsoft® Internet Explorer 4.01 with SP1
· Ad Server Database

Network bandwidth

· 100 mbps

Clients

Clients used in the study were configured as follows:

Hardware

· 5 Pentium II 233mhz with 128MB Memory

Software

· Microsoft® Windows NT® 4.0 Server with SP4
· Microsoft® Internet Explorer 4.01 with SP1
· InetMonitor

Client Usage

· 4 clients to drive load
· 1 client for response time measurements

Router (Cisco LocalDirector 5200)

The router was configured with four real Web servers mapped to one virtual Web server. Sticky session was enabled, since the customer site makes use of ASP Session variables.

Cisco LocalDirector requires that client machines be configured on a different network card interface from real Web servers, to enable virtual IP addressing of the real Web servers.

Phase 1: Baseline & Optimization

Baseline

Baseline code consisted of the customer's existing production site application code, installed and configured as it is running in the customer data center. The test site duplicated customer IIS/ASP registry entries, Microsoft® SQL Server™ 6.5 configuration parameters (including database device/sizing and database location), and Windows NT page file configuration.

Once the system was set up and configured, we performed a TCA verification, which resulted in the data shown in the following table:

# of Shoppers

CPU utilization

Context switches per second

Average operation latency

ASP Req. / sec.

ASP Req. Queued

# of CPUs

CPU Capacity (Mcycles)

Cost (Mcycles)

0

0.000%

0.000

0.000

0.000

0.000

2

400

0.000

100

14.706%

1050.279

421.024

2.659

0.000

2

400

117.648

200

31.960%

3177.189

439.151

5.391

0.000

2

400

255.680

300

57.939%

14637.905

647.937

8.105

0.141

2

400

463.512

400

90.464%

47551.680

15770.377

8.299

78.074

2

400

723.712

500

94.616%

49054.379

25561.366

9.583

158.789

2

400

756.928

Note The 300 concurrent shoppers for this study site translate to 300 * 4 = 1,200 concurrent shoppers for a live production site, since the transaction size of the study site is four times the size of the live production site.

The following chart plots Load vs. Latency:

Optimization

The optimized site consisted of the baseline code plus changes that we made to optimize performance.

We had two strategies for optimizing the site:

  • Reduce database fetches by caching frequently used data (such as that used by drop-down list boxes) and otherwise use resources more efficiently. 

  • Reduce ASP CPU utilization overhead by sidestepping execution of ASP pages for operations in which the data is relatively static (such as product pages that change no more than once a day).

The second optimization effort turned out to be a much easier way to optimize performance of an existing site than reviewing and modifying code page by page. The following table lists the changes we made first.

Change

Result

Use Set objSession = Session at the top of the ASP page.

Avoids multiple Microsoft® Visual Basic® Scripting Edition (VBScript) name look-ups of the Session variable. You can also apply the same change to the Application, Request, and Response objects.

Use curSubTotal = Session("subtotal") rather than using Session("subtotal") directly.

Reduces VBScript name look-ups.

Copy session variables to local variables if used multiple times (such as within a loop).

Reduces VBScript name look-ups.

Use CreateObject instead of Server.CreateObject unless MTS transaction flow to the ASP page is necessary.

Reduces activation time, since no MTS context wrapper is created for the object when the CreateObject method is used directly.

Use cached data for list/combo boxes, rather than rendering the data every time within the ASP page. You can cache rendered data in the application variable and reuse it within the ASP page.

Reduces database fetches.

Use typelib declaration for constants. For example, the following declares constants taken from the ADO 2.1 type library:
<!—METADATA TYPE="TypeLib" UUID="000020-
1239-123-12309" NAME="ADO21"-->
or

FILE="C:\Program Files\Common
Files\system\ado\msado15.dll" -->

Reduces name look-ups for constants.
ADO typelib uuid: Library {00000205-
0000-0010-8000-00AA006D2EA4}

Do not create the Connection object just for use by the Command object. Use the Connection object directly, or provide the connection string to the Command object directly.

Reduces object-creation time.

Avoid interspersing HTML content littered with ASP fragments. For example, the following code snippet can be optimized greatly from:
<% While Not qtyRS.EOF %>
<% if qtyRS("qty_high") >= 9999 Then %>
<% = qtyRS("qty_low") & "+" %> @ <% =
FormatCurrency(qtyRS("price")) %><br>
<% Else %>
<% = qtyRS("qty_low") %> - <% = qtyRS("qty_high") %> @ <% =
FormatCurrency(qtyRS("price")) %><br>
<% End If %>
<% qtyRS.MoveNext %>
<% Wend %>
 To the following:
<%
Set fldQtyHigh = qtyRS("qty_high")
Set fldQtyLow = qtyRS("qty_low")
Set fldPrice = qtyRS("price")

Do While Not qtyRS.EOF
If fldQtyHigh >= 9999 Then
Response.Write fldQtyLow & "+ @ " &
FormatCurrency(fldPrice) & "<br>"
Else
Response.Write fldQtyLow & " - " & fldQtyHigh " @ " &
FormatCurrency(fldPrice) & "<br>"
End If
qtyRS.MoveNext
Loop
%>
 

Optimizes program execution.

Use stored procedures in place of SQL statements. Stored procedures are stored in native compiled form within SQL Server, but SQL statements must be parsed and processed by the Query processor prior to execution.

Reduces SQL Server database execution and CPU utilization.

Open a database connection and submit execution as late as possible, then close the recordset/connection as early as possible.

Increases efficiency and results in higher scalability, since more resources are made available for use at any given time.

The following table lists the changes we made to reduce ASP CPU utilization overhead by sidestepping the execution of ASP pages.

Change

Result

Render ASP pages as HTML pages and use static HTML pages instead of ASP pages. IIS serves up HTML pages extremely efficiently.
To do this, we used XBuilder, a page-rendering tool that converts ASP pages to static HTML pages. To do this, it crawls through HTTP URL links and renders static HTML of the pages. XBuilder can be provided a top level URL (such as https://www.mysite.com), a directory URL (such as https://www.mysite.com/infodir), or a page URL (such as https://www.mysite.com/infodir/about.asp).

Serves a much higher concurrent number of users browsing static HTML pages.

The following is the data we obtained from running transaction cost analysis verification on the optimized site.

Shoppers

CPU Util.

Context Switches per second

Avg. Operation Latency

ASP Req. per sec.

ASP Req. Queued

# of CPUs

CPU Capacity (Mcycles)

Cost (Mcycles)

0

0.000%

0.000

0.000

0.000

0.000

2

400

0.000

1004

28.060%

19260.459

509.217

3.751

0.000

2

400

224.480

200

20.130%

2174.367

264.313

4.225

0.000

2

400

161.040

300

34.016%

5823.119

194.653

6.363

0.000

2

400

272.128

400

72.053%

35961.031

4213.775

8.289

11.787

2

400

576.424

500

93.375%

56404.863

10550.026

9.067

50.538

2

400

747.000

600

93.105%

57115.512

28825.262

8.326

160.429

2

400

744.840

4 Data for this row appears to be anomalous.

The following chart plots Load vs. Latency.

The data shows that, following optimization, site capacity improved by 100 shoppers (400 vs. 300 shoppers with the baseline site), which is a 33% improvement.

Although the data shows that CPU utilization hasn't peaked at a load level of 500 shoppers, the ASP request throughput dropped. That, coupled with much higher latency and a high queue, would realistically make such a load level unacceptable to users.

Phase 2: Platform Upgrades

After optimizing the site, we tested it on a series of upgraded platforms. This section describes the results of those upgrades on site performance.

Windows NT Service Pack 4

The first platform upgrade was Windows NT Service Pack 4. The following table shows the results of upgrading to Windows NT Service Pack 4.

Shoppers

CPU Util.

Context Switches per second

Avg. Operation Latency

ASP Req. per sec

ASP Req. Queued

# of CPUs

CPU Cap. (Mcycles)

Cost (Mcycles)

100

9.086%

857.596

340.383

2.123

0.000

2

400

72.688

200

20.352%

2207.050

307.151

4.346

0.000

2

400

162.816

300

32.203%

4857.834

206.434

6.477

0.000

2

400

257.624

400

50.048%

13150.839

625.089

8.506

0.000

2

400

400.384

500

92.995%

58136.125

12444.548

9.094

64.595

2

400

743.960

600

92.876%

57653.461

22288.374

9.293

140.155

2

400

743.008

Capacity did not significantly increase with this upgrade.

Site Server 3.0 Service Pack 2

Next, we added Site Server 3.0 Service Pack 2 to Windows NT Service Pack 4 and again measured the results.

Shoppers

CPU Util.

Context Switches per second

Avg. Operation Latency

ASP Req. per second

ASP Req. Queued

# of CPUs

CPU Cap. (Mcycles)

Cost (Mcycles)

400

42.721%

9421.509

551.356

8.526

0.175

2

400

341.768

500

64.840%

26913.301

2626.537

10.144

6.479

2

400

518.720

550

87.379%

52156.246

6228.486

10.543

29.900

2

400

699.032

600

93.406%

58294.496

17243.443

10.363

112.800

2

400

747.248

This time, capacity increased by 100 shoppers, to a maximum of 500 shoppers (the upper limit for load level before the average operational latency increased). This is a 25% improvement over the previous platform, and a 66.7% improvement over the baseline site.

MDAC 2.1 Service Pack 1, and ADSI 2.5

Next, we added MDAC 2.1 Service Pack 1 and ADSI 2.5 (required with MDAC 2.x) to the optimized site code platform and previous upgrades, and again measured the results.

Shoppers

CPU Util.

Context Switches per second

Avg. Operation Latency

ASP Req. per sec.

ASP Req. Queued

# of CPUs

CPU Cap (Mcycles)

Cost (Mcycles)

400

40.612%

2722.688

294.543

8.697

0.000

2

400

324.896

500

53.534%

4496.735

315.334

10.847

0.045

2

400

428.272

550

63.882%

7257.642

543.354

12.005

0.386

2

400

511.056

600

69.591%

9775.615

1007.789

12.721

1.776

2

400

556.728

700

98.259%

30045.004

10279.983

13.156

76.436

2

400

786.072

The following chart plots Load vs. Latency.

This time, capacity increased to 600 shoppers, an increase of 100 shoppers over the previous platform. This represents a 20% improvement over the previous platform and a 100% improvement over the baseline site.

The data also shows that at a load level of 700 shoppers, the CPU has been fully utilized. At this load level, ASP Requests Queued jumped sharply from 1.776 at the 700-shopper load level to 76.436, with operation latency at an unacceptable level of 10.3 seconds.

An interesting number to note is the number of Context Switches per second. Prior to reaching maximum shopper capacity, it dropped significantly from the previous platform.

Visual Basic Scripting Edition 5

Finally, we added Visual Basic Scripting Edition 5.0 (VBScript) to the previous platform and measured once again.

Shoppers

CPU Util.

Context Switches per second

Avg. Operation Latency

ASP Req. per sec.

ASP Req. Queued

# of CPUs

CPU Cap. (Mcycles)

Cost (Mcycles)

600

84.146%

22608.238

2153.399

12.461

7.644

2

400

673.168

700

98.187%

29073.221

13558.450

12.288

102.641

2

400

785.496

800

98.455%

26812.770

23504.685

12.108

186.071

2

400

787.640

VBScript did not increase capacity. At a load level of 700 shoppers, CPU utilization increased, but there was lower ASP Request throughput and significantly higher latency, compared to a load level of 600 shoppers. Thus, 600 is the maximum capacity for the platform configured in this way.

It is also interesting to note that ASP Request throughput at the same load level dropped to 12.461 ASP Requests/Second, compared with the previous platform throughput of 12.721 Requests/Second. Note also that Context Switching per second increased by approximately 131.28% over the previous configuration, suggesting a performance issue with VBScript 5 in this configuration. (This issue has since been addressed with Windows NT4 Service Pack 5.)

Quad-CPU Configuration

We previously measured performance and capacity on dual-CPU hardware. Next, we measured performance on quad-CPU hardware, with optimizations and all platform upgrades applied. The results are as follows.

Shoppers

CPU Util.

Context Switches per second

Avg. Operation Latency

ASP Req. per sec.

ASP Req. Queued

# of CPUs

CPU Cap. (Mcycles)

Cost (Mcycles)

600

34.570%

7872.792

286.806

12.922

0.000

4

400

553.120

700

49.040%

15491.617

200.265

14.176

0.000

4

400

784.640

800

57.174%

24420.549

2296.634

15.916

12.515

4

400

914.784

900

89.914%

53591.469

17019.767

16.133

158.458

4

400

1438.624

The following chart plots Load vs. Latency.

The quad-CPU platform increased capacity by 200 shoppers over a dual-CPU configuration at the maximum shopper load level of 800 shoppers. At this load level, context switching was excessive and latency significantly higher, although CPU utilization had not reached 100%. If threading or VBScript were causing bottlenecks, this might not be the true maximum capacity. It still might be possible to reduce the IIS thread count to reduce context switching and achieve more ASP Request throughput to take capacity beyond 900 shoppers. However, we didn't test those possibilities during this test.

Phase 3: Architectural Optimization

This section describes how we optimized the test site's architecture.

Partitioning HTML vs. ASP

Optimizing a site's architecture is primarily a process of separating (partitioning) operations with significantly different workload (cost) characteristics. For the baseline site, we separated static HTML product browse requests from checkout operations. To completely optimize a site's architecture, every operation identified in a user profile analysis needs to be partitioned. This type of partitioning is a more efficient use of server capacity than a mixed-request operations scenario.

For this study, we dedicated one server to static HTML product browse pages and another server to processing checkout operations. We did this on the assumptions that HTML requests cost very little, and that a dedicated checkout server can serve as many as 1,000 concurrent users. We then measured throughput performance and validated it to see whether our original assumptions were correct.

HTML Partitioning

Since product browse operations were already rendered as static HTML pages, this partition was relatively simple. There are three types of product browse pages in the customer site: department, skuset, and product information.

Product information pages provide dynamic product pricing based on a customer's zip code. For simplicity of the study and in the interest of development time, we generated product information pages with an assumed zip code. One way to implement dynamic product pricing on product information pages with static HTML pages is to use an ISAPI filter to interpret meta tags and retrieve product pricing, based on a zip code stored as a cookie or provided as a parameter to the product information page. In this way, product browse pages can be rendered as static HTML pages, but still provide dynamic product pricing based on zip code.

The following table shows HTML Partitioning Data.

Shoppers

CPU Utilization

Number of CPUs

CPU Capacity (Mcycles)

Cost (Mcycles)

5000

2.464%

2

400

19.712

10000

4.260%

2

400

34.080

Since we were measuring HTML throughput, we didn't collect any data for ASP performance counters (they were all 0); only CPU utilization is relevant. The evidence of load is the number of current connections on the IIS server, which showed 5,000 and 10,000 connections, respectively, for each of the verifications.

The data shows that IIS can process a high number of static HTML requests. The study ran the simulation of, at most, 10,000 shoppers, thereby validating the assumptions of HTML partitioning. It's clear that a dual-CPU server can absorb the load of 10,000 concurrent shoppers requesting static HTML product browse pages.

The data also suggests that many more than 10,000 shoppers can be accommodated (perhaps as many as 50,000 shoppers with CPU utilization approaching 50%). The study stops at 10,000 shoppers, however, since not enough machines were available to generate a larger load (we tested five client machines, each generating 2,000 shopper sessions).

Current benchmarks (Microsoft's 100 million hits per day scalability demo site) result in 1,200 hits per second using a 1-CPU Pentium Pro server without reaching capacity. Current Pentium II Xeon Class multiprocessor servers will likely attain 4,000 hits per second quite easily.

Checkout Partitioning

Our objective was to measure the maximum number of checkout transactions. We did this with a script used during the TCA measurement process, which we modified by inserting a sleep time so that it performed a checkout once every minute. Increasing the number of users increases the load level. For example, ten users generate ten checkout transactions every minute.

The following table shows Checkout Partitioning Data.

Shoppers

CPU Util.

Context Switches per second

Avg. Operation Latency

ASP Req. per sec.

ASP Req. Queued

# of CPUs

CPU Cap. (Mcycles)

Cost (Mcycles)

100

29.759%

4176.904

414.139

3.294

0.000

2

400

238.072

120

44.047%

7060.038

451.977

3.967

0.000

2

400

352.376

150

54.970%

15668.520

945.319

4.914

0.156

2

400

439.760

200

95.732%

44520.559

25882.737

4.979

84.111

2

400

765.856

The data shows that the maximum capacity is 150 shoppers. The latency and ASP Requests Queued at a load level of 200 shoppers are not realistically acceptable. This translates to a measured checkout capacity of 150 checkouts per minute (the verification script executes a checkout transaction per minute per user).

Taking the customer site user profile analysis of 0.01 checkout transactions per user per minute, the data shows that the site can absorb 150 / 0.01 = 15,000 concurrent shoppers for the user profile every 10 minutes, per server.

Conclusion

The customer's requirement for $1.7 billion in annual revenue is supported very easily with existing site capacity. Assuming an average checkout of $250, this translates to approximately 18,669 orders/day.

The existing site, as measured by the TCA process, has a capacity of 300 concurrent shoppers/server or live capacity of 1,200 concurrent shoppers for the site every 11 minutes, with a 40:1 browse-to-buy ratio, transaction rate of 0.1 checkouts/shopper/11 minutes, and an average checkout amount of $250. This means that the existing site should be able to support annual revenues of $1.9 billion.

The existing site supports revenues of $1.7 billion annually with 1,075 concurrent shoppers, well below the site's maximum capacity of 1,200 concurrent shoppers. Note that the customer specified a requirement that is 37 times the existing volume of transactions (well beyond ten times the current volume).

The study shows that the site definitively performs and scales extremely well above and beyond customer requirements.

A site supporting 100,000 concurrent shoppers is theoretically possible with as few as 88 front-end Web servers.

This study validates the concept of scaling a site architecturally.

Shopper operations with similar workload characteristics (cost) can be grouped together. Their respective ASP pages are then processed from their allocated group servers.

Calculations based on TCA measurements show that it is theoretically possible to support 100,000 concurrent shoppers every 11 minutes, a 40:1 browse-to-buy ratio, 0.1 checkouts/user/11 minutes, with targeted optimizations, with at least 88 servers, as follows:

  • Home page: Home page operations would consume 44.23 servers. Although it has a low CPU cost of 53.77 Mcycles, the Home page has a high traffic rate, thereby requiring higher capacity. Since cost is already so low, a way to dramatically reduce server count is to pre-render this page in static HTML and use ISAPI.DLL to provide dynamic content (shopping cart subtotal via stored procedure and Ad Server tags). The required server count would then theoretically drop from 44.23 to 0.38. However, to ensure availability, it's best to use at least two servers or aggregate this operation with others. 

  • Checkout: Checkout consumes 31.2 servers. Upon inspection, the purchase pipeline apparently runs six scriptor components. The liberal use of scriptor components is the likely cause of the high cost of this operation, at 415.22 Mcycles. Converting the logic within the scriptor components to natively compiled pipeline components, and using an Alpha Class server would theoretically reduce the number of servers necessary for checkout operations from 31.2 to 4.57. 

  • Product Browse page: Product browsing consumes 53.06 servers when run as ASP pages. One way to reduce server count, as shown in this study, is to use pre-rendered static HTML pages and ISAPI.DLL to provide dynamic content (such as zip code-based pricing). The number of servers required for product browsing would then theoretically be reduced from 53.06 to 0.61. However, to ensure availability, it's better to use at least two servers or aggregate this operation with others. 

  • Zip code: The zip code operation consumes 28.16 servers, with a cost of 70.10 Mcycles per operation. Although this operation appears to have a small cost, a high rate of traffic causes a high capacity requirement. This operation accepts a zip code entered by the user and sets in a session variable. Rather than implementing this as an ASP page, it could be implemented as an ISAPI.DLL that places zip code data as a cookie and modifies ASP pages to use the zip code cookie instead of the session variable. Doing this theoretically reduces the number of servers required to process zip codes from 28.16 to 0.27. Another way to simplify this operation is to assume a default zip code (such as 99999) for non-registered or anonymous users. Prices can then be recalculated in the shopping cart when the user enters a valid zip code to register and check out.

  • Search: The Search operation consumes the most servers, by far, at 70.33. This is due to a moderately high cost of 150.12 Mcycles, coupled with a high rate of traffic. Search cannot be optimized by using static HTML. However, it can be optimized by preprocessing Search logic (the customer site implements vectored search and stemming before submitting to the search engine) by ISAPI.DLL and utilizing Alpha Class servers. Doing so would theoretically reduce the number of servers required for the Search operation from 70.33 to 39.85. 

In addition to the front-end Web servers, there may be a requirement for three or more additional back-end SQL Servers (membership database servers, product database servers, and ad-server database servers).

Although this study shows that a single dedicated checkout server can support up to 15,000 shoppers, we don't recommend using a single server for checkout transactions. A single server can become a single point of failure. Instead, we recommend that you use at least two checkout transaction servers, to accommodate periodical peak spikes and to ensure high availability.

An understanding of scalability issues facing the test site is clearly provided by the TCA results. Transactions such as checking out or adding items are expensive in terms of workload. Database fetches such as product browses aren't as expensive; but if they are executed frequently enough, they consume a significant amount of CPU time.

Problems with the customer site as it is currently designed are as follows:

  • High visits to home, product browse, and search pages

  • High cost of checkouts 

  • Unoptimized ASP code 

Available solutions include the following:

  • Upgrading the current platform to Windows NT SP4, Site Server Service Pack 2, MDAC 2.1 SP1, ADSI 2.5 

  • Deploying pre-rendered static HTML product browse pages 

  • Partitioning and dedicating server groups to operations with similar workload characteristics (cost), applying vertical scaling to dedicated server groups 

  • Further performance optimization of the ASP code 

Tools

This section describes the tools used in this study:

  • Automation scripts 

  • Remote command service 

  • InetMonitor 

  • Performance Monitor 

  • XBuilder 

Automation Scripts

The scripts provided in this section are the batch files we used to automate the testing process.

TCA.CMD

TCA.CMD is the root batch file. It takes 3 parameters:

  • Host to run load against 

  • Length of test in seconds 

  • Server name of host (to allow remote command recycle of the World Wide Web service) 

For example: tca www.myhost.com :80 1800 \\mywebserver.

TCA launches Performance Monitor and InetMonitor. At the end of the specified test period, TCA kills Performance Monitor and InetMonitor to go on with the next iteration. The iterations are specified by a collection of InetMonitor parameter files named user*.txt.

@echo off
if "%1" == "" goto error
if "%2" == "" goto error
if "%3" == "" goto error

rem sleep 6 hours
sleep 21600

net use * %3\c$ /u:administrator mypassword

for %%i in (users*.txt) do (dostart.cmd %%~ni && start iload.exe /s %1 /f 
%%i && sleep %2 && dorestart.cmd %3)

attrib +r *.* /s

goto end

:error
echo don't forget to use server name:port on command line.
echo also the second parameter should be run time per script in seconds.
echo also the third parameter should be the remote web server hostname (e.g. 
\\csamsiss30).
:end
DOStart.CMD
@echo off
start perfmon %1.pmw
sleep 30
DORestart.CMD
@echo off
kill iload.exe
kill perfmon.exe
rcmd %1 "net stop w3svc /y"
rcmd %1 "net start w3svc"
Sleep.Exe

This is a utility from the Windows NT Resource Kit that is used to pause/sleep within a batch file.

Kill.Exe

This is also a utility from the Windows NT Resource Kit. It is used to terminate an InetMonitor load simulation run and the associated Performance Monitor log.

Remote Command Service

Remote Command Service also comes from the Windows NT Resource Kit. This tool is used to execute character/console commands on a remote server. It is used to recycle Web servers at the end of a test run.

InetMonitor

InetMonitor is a load simulation tool found in the Microsoft® BackOffice® Resource Kit. It is used to simulate load from client machines on the target Web server.

A component of InetMonitor called InetLoad (iload.exe) runs the client process that generates the load. InetLoad's parameter file (load.inp) can be found in the directory in which InetMonitor runs. This file is a text file that can be customized to run iload.exe from the command line (as shown in the TCA.CMD batch file).

The command syntax to run iload.exe is as follows:

iload.exe /s www.myhost.com:80 /f parameterfile.txt

InetMonitor supports script commands to execute HTTP requests, control requests, and load distribution commands. You can distribute load by specifying %NN SKIP X (SKIP the next X commands NN% of the time). For example, you can use this to execute 2.1 operations by executing 2 operations, and adding 1 additional operation skipped 90% of the time.

In order to distribute the load and obtain good data results, you need to distribute users across the average session length. For example, if the session length is 10 minutes, you can distribute 600 users by specifying an InetMonitor client ramp-up delay of 10 min* 60 sec/ 600 users= 1 second. InetMonitor will then pace client ramp-up time by separating users by 1 second, each. A full load will occur after 10 minutes. It is a good idea to run the test for at least several times the length of the average session length. For example, if it takes 10 minutes to ramp up, allow another 10 minutes to create an average load for measurement, then an additional 10 minutes to ramp down.

Performance Monitor

Performance Monitor can automatically start a log if you specify the log name and save the workspace as a file (.pwc extension). You can provide this file to Performance Monitor for start-up settings, which will autostart logging.

To see the ASP session load, it's best to terminate a session at the end of a script so that the Performance Monitor ASP Session counter does not continue to climb. You can do this with the Session.Abandon ASP statement. The TCA scripts use an ASP page named quitsession.asp to do this.

XBuilder

One of the best ways to optimize Site Server performance is to avoid the CPU overhead required to execute ASP pages and database operations, if data is relatively static. You can do this by changing the ASP pages to HTML pages. IIS processes HTML pages very efficiently, thus allowing the site to serve a much higher concurrent number of users.

XBuilder is a tool that you can use to render static HTML pages from dynamic ASP pages. It crawls HTTP URLs and renders the pages it crawls through the HTTP URL links as static HTML.

XBuilder can work with any of the following types of URLs:

From this information, XBuilder renders the Web site tree and automatically transforms dynamic links to static links.

One very useful feature of XBuilder is that you can embed a header tag within the ASP page and XBuilder will name the rendered page with the text provided in that header. For example, the following code generates a page named product1234.htm (if the SKUSetID has a value of 1234):

<% Response.AddHeader "XBuilder-FileName","product" & Request("SKUSetID") & ".htm"%> 

You can scope XBuilder to include a narrow set of pages, or allow it to crawl and render all pages with excluded directories or pages. When you scope XBuilder this way, it does not transform links to pages that are not part of its crawl path into static links. In this way, links to dynamic ASP pages (such as checkout, view cart) can remain active.

We used the file naming and scoping features of XBuilder to render only product pages, thus enabling the static rendering of product pages, while maintaining all other links to dynamic pages (view cart, checkout, search, and so on).

Generating product pages requires an intermediate root to serve as the starting point for XBuilder. The root page simply generates URL links pointing to a child page that renders the dynamic content using the passed product ID (SKU) as the parameter. The root page shows 1000 links at a time, with the last link pointing back to itself with a parameter indicating to where to start next. XBuilder then follows the links on the root page, generates the product pages, and follows the last link back to the root page, from which it continues with the next set of 1000 until all of the product links have been exhausted.

An easy way to generate the root and child pages for XBuilder is to take the dir.asp and row.asp pages generated by Site Server Search Database Catalog Wizard. These pages have URL links ready to crawl. We modifed the dir.asp page to point to the real product browse ASP pages while passing the product ID (SKU) as the parameter. We then copied the dir.asp and row.asp (modified product.asp page) to the site directory and pointed XBuilder to crawl, starting at https://www.mysite.com/st/dir.asp, with scoping set to include only https://www.mysite.com/st/dir.asp and https://www.mysite.com/st/product.asp pages.

Sample Test Scripts

There were two types of InetMonitor scripts used in this study:

  • Transaction Cost Analysis (TCA) Scripts 

  • Verification Scripts 

Transaction Cost Analysis (TCA) Scripts

We used TCA scripts to exercise load for measuring costs. See "Using Transaction Cost Analysis for Site Capacity Planning" for an explanation of TCA and a complete set of sample TCA scripts.

Verification Scripts

We used verification scripts to exercise load for TCA verification. See "Using Transaction Cost Analysis for Site Capacity Planning" for an explanation of TCA and a complete set of sample verification scripts.

Team

The Microsoft team conducting the case study consisted of the following people:

Ron Bokleman (Microsoft Consulting Services)

Philip Carmichael (IIS Product Group)

Michael Glass (Commerce Product Group)

David Guimbellot (Commerce Product Group)

Ken Mallit (Commerce Product Group)

Doug Martin (Commerce Product Group)

Michael Nappi (Commerce Product Group)

Scott Pierce (Commerce Product Group)

Caesar M. Samsi (Microsoft Consulting Services)

Macintosh is a registered trademark of Apple Computer, Inc.

For information on Microsoft Solutions Framework, see https://www.microsoft.com/msf/ .