Handling Multiple Bottlenecks


Topic Last Modified: 2005-05-18

It is not uncommon for servers to experience multiple bottlenecks. It is important, however, to understand whether there are any causal relations occurring—that is, where one subsystem's performance issues spills over to another subsystem. For example, a CPU-bound server can build up queues, which causes unusually high use of the SMTP disks.

Because of the possibility of causal relations occurring between subsystems, analyze the performance logs with regard to:

  • The role assigned to the server.

  • The cause or causes that trigger the performance degradation of one or more subsystems.

Generally, it is worth mitigating each bottleneck, and then seeing the effects of removing that malfunctioning piece of the puzzle. Otherwise, enforcing policies may be enough to mitigate issues caused by multiple bottlenecks. For instance, enforcing message sizes for POP3 retrievals can reduce the load on the database disk. However, enforcement may not be enough. There are many cases that will require upgrades or a redesign of the hardware.

In this example, the Exchange server is a mailbox server that hosts 6,000 users. It is connected to three direct attach storage arrays:

  • One array has the database.

  • Another array has the transaction logs.

  • The third array has the SMTP disks.

This Exchange server has two storage groups with two private databases, each database with 1,500 users. The SLA for backing up and restoring limits the number of storage groups to two.

The problem is that, during the daytime, users experience slow response as they use Microsoft® Office Outlook® in online mode.

By collecting a performance log during the eight hours of day-time use for which this server is experiencing degraded performance, it becomes clear that the MSExchangeIS\RPC Requests counter is constantly around 60, and that some clients experience slow responses to the operations requested. Furthermore, the MSExchangeIS\RPC Averaged Latency counter is constantly hitting or going above 100 ms. These are clear symptoms of performance issues that need to be isolated.

Eight hours of day-time performance information


Analysis of the performance logs uncovered problems with the performance of the database drive, the log drive, and the CPU. The following sections indicate which performance counters were used to determine each problem.

The Exchange Server is connected to a Storage Area Network that cannot handle the I/O load. As shown in Figure 10, the write latencies on the database drive (as indicated by the PhysicalDisk\Average Disk sec/Write counter) average 62 ms, with frequent spikes above 80 ms and some above 100 ms.

Performance log of the database drive


By adding another array and controller, and then splitting the storage groups into separate arrays, the performance of the database drive improved.

As shown in the following figure, a slow transaction log drive is causing the Database\Log Record Stalls/sec counter to average 114 stalls/second, with constant spikes above 150 stalls/second. In addition, there are frequent log threads waiting as indicated by the Database\Log Record Stalls/sec counter, with spikes above 20.

Performance log of the transaction log drive


The controller responsible for the transaction logs was experiencing problems. The controller has the write-back cache disabled. The stalls subsided after replacing the old controller with a new controller that had a properly functioning write-back cache.

As shown in the following figure, this server is experiencing high CPU usage (with the Processor\% Processor Time counter averaging 97%) and large processor queues (as indicated by the System\Processor Queue Length counter).

Performance log of CPU usage


The slowness on the database and transaction logs aggravated the CPU utilization, causing more context switches than necessary (an average of 50,000 on this performance log) and consequently over utilizing the CPU. By resolving the database issues in Problem 1 and the transaction logs issues in Problem 2, the CPU utilization problem shown in the figure in this problem is resolved as well.