Get A Traffic Report: Analyze Your Database Usage With System Tables
Matt Evans and Josh Moody
At a Glance:
- Determine who's connected to your SQL Server
- Dealing with blocked processes
- Performance killers
- Clues to SQL Server performance problems
When you need information about the current connections to your SQL Server database, what those connections are doing, and which ones are slowing your system down, who do you call? Why, sp_who and sp_who2, of course..
Sometimes you may also view this information graphically in Enterprise Manager. While this info is good, it could be more helpful if you could do some better reporting on it. And why not get good data analysis? You are using SQL Server™ after all.
It turns out that sp_who, sp_who2, and Enterprise Manager are simply presenting data from the master.sysprocesses table. This means that you can write your own SQL queries to display the information you want, and create some enlightening reports. Suppose, for instance, you want to know which applications are connecting to your database. Sysprocesses contains a column called program_name, so a handy query would be:
SELECT program_name, count(*)
GROUP BY program_name
ORDER BY count(*) desc
It will tell you which apps are connected and how many instances there are of each, listing the application names by connection count, in descending order. If you have a machine that allows for parallelized queries, ecid=0 filters out the associated subthreads and only counts queries running in parallel once rather than once per processor on which they're running.
Our teams have an open policy and a fairly large number of technical users, many of whom are writing tools that access the database directly. The results of this query shed light on the tools in use and the breakdown between custom tools and tools supplied by the development team.
If you have an open policy about who can write tools that connect to your database, eventually you'll find some bad tools causing a variety of problems. One common problem is blocking. Suppose that process A is modifying data that process B wants to use. Process B will be blocked until process A has completed what it is doing. This is only one type of blocking situation; others exist and are common. What matters to you as a DBA is identifying when blocking is a problem and how to deal with it effectively.
We've found that when blocking is bad enough that users are noticing slowdowns and complain about it, Enterprise Manager does not help—it takes minutes to extract and render data from the production machines. When Enterprise Manager fails, we turn again to the handy system tables.
One of the other columns available in the sysprocesses table is "blocked". Normally, this column has a value of zero, indicating that the process in that row of the table is not blocked. However, if a row has a non-zero blocked value, that value is an integer which is the SQL Server process ID number (SPID) of the process that this process is blocked by.
More concretely, each row of sysprocesses corresponds to one connection/process inside SQL Server. Each of these processes has a SPID (this is the same SPID you use with the kill command). SQL Server can also split processing of a task across multiple subthreads, resulting in numerous rows with the same spid but differing ecids. There will always be a spid/ecid pair where ecid=0, but we only count this one when counting processes.
With a large number of users, it is common for tens or hundreds of processes to be blocked when slowdowns are noticed. Killing these processes may or may not solve the problem because 10 processes may be blocked by process B, while process B itself is blocked by process A. Issuing 10 kill statements for the processes blocked by B probably won't help, as new processes will simply become blocked by B. Killing process B may or may not help, because then the next process that was blocked by B, which is given execution time, may get blocked by process A and become the process that is blocking the other 9 remaining processes. When you have lots of blocking that isn't resolving in a reasonable amount of time you need to identify the root blocker, or the process at the top of the tree of blocked processes.
Imagine again that you have 10 processes blocked by process B, and process B is blocked by process A. If A is not blocked by anything, but is itself responsible for lots of blocking (B and the 10 processes waiting on B), then A would be the root blocker. (Think of it as a traffic jam. Figure 1 will help.) Killing A (via kill) is likely to unblock B, and once B completes, the 10 processes waiting on B are also likely to complete successfully.
Figure 1 Process Traffic Jam
Since blocking information is stored in sysprocesses, you can use plain SQL scripts to quickly identify root blocker processes, as shown in the following:
SELECT spid from master..sysprocesses
WHERE blocked = 0
AND spid in (select blocked from master..sysprocesses)
This script asks for the id of a process that isn't blocked, but which is blocking some other process. It is usually the case that killing any process identified by this query will get the database running well again.
So how do you know when to look for blocked processes? Once users are complaining, it's too late. You can use Performance Monitor (Start | Run | perfmon.msc) to monitor, graph, and set alerts on the built-in performance counters installed on the system, or on remote systems. SQL Server installs a good variety of performance counters, but for SQL Server 2000 there isn't one built-in that tells you the number of blocked processes. (For SQL Server 2005, you can use the new built-in counter "SQL Server:General Statistics:Processes blocked".) To solve this problem, we used the .NET Framework to write a performance counter that simply logged the value from
SELECT count(*) from master..sysprocesses
WHERE blocked <> 0
as the counter value. Then we could both graph and record our blocked processes count using the normal Windows Performance Counter framework. You can devise your own mechanism which logs or alerts you based on polling this value.
If this value ever gets too high (which depends on your users, database, and applications), you can have Performance Monitor alert you so you can investigate. Then you can resolve blocking problems before your users even notice.
Note that haphazardly killing processes that look like they're blocking may be a career-limiting move. Sysprocesses will tell you who and what you are about to kill via the columns program_name (the name of the application specified by the application programmer when he wrote the database connection code), hostname (the network name of the computer from which the connection is originating), and loginame (the credentials used to connect to the database). In Windows NT® domain environments, loginame will usually be the user's Windows credentials, expressed as DOMAIN\username. If the connection is using SQL-mode authentication, you'll find the SQL credentials. You may also want to run DBCC INPUTBUFFER(<SPID>) in order to determine what the user is running. Unfortunately, this only returns the first 255 characters. If you want more information, you can use the built-in fn_get_sql function, passing as an argument the sql_handle column from the sysprocesses table.
One final word on killing processes efficiently—suppose that there are quite a few processes you'd like to kill, and you can identify them via a query:
SELECT spid from master..sysprocesses
WHERE loginame = 'FABRIKAM\joe'
You can use select literals to make life easier. If that query spits back 20 SPIDs for you to kill, make SQL do the work by changing the query to:
SELECT 'kill ', spid from master..sysprocesses WHERE loginame = 'FABRIKAM\joe'
It will write lines saying "kill <spid>", one per row, that you can select and paste right back into the tool (osql.exe, isql.exe, or isqlw.exe), and then kill the whole group with one keystroke. Try doing that in Enterprise Manager!
Computationally Expensive Processes
Sometimes the database performs poorly due, not to blocking, but to particularly heavy loads. Often the DBA will determine that the database simply cannot support the work that it is being asked to do and maintain adequate performance. This doesn't necessarily mean it is time to create more indexes or throw more hardware at the problem. And if it is time to throw more hardware at the problem, there may be no money available for new boxes.
You can't always assume that periods of high utilization represent legitimate work. There could be problems in the apps that are running, or even problems caused by the user. Maybe the application has a data paging functionality, but the user has opted to receive the entire 100,000 row DataSet every time, even though she has applied a sort which gives her the one row she needs first with each query. Regardless of the root cause of the problem, your job is to identify performance issues and eliminate them.
Let's take another look at two columns in sysprocesses: cpu and login_time. One way we can measure the pain a SPID is causing is by looking at the cpu column:
SELECT TOP 5 * from master..sysprocesses
ORDER BY cpu DESC
One problem with this is that the cpu value is constantly incremented over the lifetime of the process. You may find that this query tells you about SQL Server internal processes that started when the database booted and are running in the background. These use lots of total CPU time because they've been running for days at a time, but they have spent most of that time in an idle state. These can be filtered out by querying for all spid values less than 50, as all values 50 and below are reserved for internal SQL Server system processes.
It would be more helpful to find a process that uses lots of CPU the majority of the time it is connected. You can do that by dividing the cpu value by the time the process has been connected. You can't get process connect duration directly; instead, you get the time of day that the process connected to the database, as "login_time".
Consider the following select statement:
SELECT spid, program_name, datediff(second,login_time, getdate()) as
ConnectedSeconds FROM master..sysprocesses WHERE spid > 50
This script will give you the SPID, the name of the app, and an integer called ConnectedSeconds, which represents the number of seconds that the SPID has been logged into the database.
Now you've got the pieces you need to find expensive processes, as you see in Figure 2.
Figure 2 Find Bad Processes
convert(varchar(10),db_name(dbid)) as DBName,
datediff(second,login_time, getdate()) as Secs,
convert(float, cpu / datediff(second,login_time, getdate())) as PScore,
convert(varchar(16), hostname) as Host,
convert(varchar(50), program_name) as Program,
convert(varchar(20), loginame) as Login
WHERE datediff(second,login_time, getdate()) > 0 and spid > 50
ORDER BY pscore desc
The output of this script (as illustrated in Figure 3) will be the top 30 processes, sorted by descending PScore. (PScore is the alias we've given to the computed column which results from dividing the cpu time of a process by the connected seconds of that process.)
Figure 3 Processes by PScore
In addition, the script tells you if the process is blocked, what hostname it's running from, who's running it, and what application it is. We're also using the built-in SQL Server function db_name, which takes an integer as its argument and returns the human-readable name of the database. Internally, SQL Server refers to databases by their dbid, and while sysprocesses uses this convention as well, this script converts the dbid to the database's name to make life easier for the human database administrator.
We used the WHERE clause at the end of the query to prevent divide-by-zero errors on processes that connected while the script was still running, and have a value of zero for the number of seconds that the process has been connected.
Running this script will tell you which instance of a process is using lots of CPU on the database server. You can use the loginame or hostname columns to identify the person responsible for the offending process (if your hostnames are related to who owns that workstation, you can often tell who owns a computer just from its hostname), and ask them what they're doing. Often, users have no idea that they're punishing the database. This is a good way to uncover application bugs as well—a program shouldn't necessarily eat up database CPU resources when the user doesn't even realize it's still running!
Computationally Expensive Aggregations
Sometimes you'll need to find an app that by itself doesn't consume many resources, but because the app is widely deployed, the aggregate effect of the instances of this app is really punishing the database server.
For instance, suppose you have an app that polls the database with a simple query every n seconds. The app developer probably chose n to be a compile-time constant in the application. This means that n cannot dynamically change in response to increased database load, increased number of instances of this app connecting to the database, execution time, or database performance degradation.
Suppose this application is set to poll every five seconds. This implies that execution of the query should take less than five seconds, assuming the application is deployed to one machine. If the application is deployed to five machines, then executing its query should take less than one second. It may be the case (but is not always) that executing this query 10 times doesn't take much longer than executing it once because of the caching abilities built into SQL Server, but that's more of a lucky situation than something to count on.
It is often the case that hardcoded polling intervals, especially in database applications, are chosen by the application developer as an afterthought, or because it seemed reasonable. What seems reasonable for one instance of the application, running in isolation against a database with only a few rows of test data in it, quickly becomes unreasonable in production scenarios. One way to look at the situation is, assuming the database server is completely utilized by instances of this application, doubling the length between polling—changing it from 5 to 10 seconds—doubles the number of instances which can be active or cuts the load on the database in half, freeing resources for other applications.
Before you can ask your application developers what their apps are doing, you need to identify the problem apps. In Figure 2, we simply looked at the SPIDs that were using the most CPU resources per unit of time. Now we'll use the same table—sysprocesses—but we'll query on it in a different way. We want to know which programs (program_name) use the most CPU resources (see Figure 4).
Figure 4 Find Resource Hogs
convert(varchar(50), program_name) as Program,
count(*) as CliCount,
sum(cpu) as CPUSum,
sum(datediff(second, login_time, getdate())) as SecSum,
convert(float, sum(cpu)) / convert(float, sum(datediff(second, login_time, getdate()))) as Score,
convert(float, sum(cpu)) / convert(float, sum(datediff(second, login_time, getdate()))) / count(*) as ProgramBadnessFactor
WHERE spid > 50
ORDER BY score DESC
This query does a few new things. First, we're fixing up the column program_name to make it easier to display and then using that as the argument to group by. This script will compute aggregate functions all with respect to a unique list of running program_names currently logged in.
Next we see how many instances of that program are connected, and we present it as CliCount. If it turns out that there are tens or hundreds of instances of a program (CliCount is a large number), then we know we have a program worth investigating first, as the biggest overall gains can be had by getting better efficiency from that app.
The next thing the script does is sum the cpu time and the connected seconds from all instances of this program. Using those two values, it populates the Score column, which is aggregate cpu divided by aggregate connect time (for that program).
The last computed column is called ProgramBadnessFactor. This is the aggregate cpu, divided by the aggregate seconds, then divided by the number of connected instances of the program.
This ProgramBadnessFactor attempts to quantify how bad this program is compared to the others, by dividing the Score by the number of connected instances. A high value for ProgramBadnessFactor would indicate that every instance of the program was CPU-intensive. A lower value would indicate that the program may have some instances that cause performance problems, but also has instances that are mostly idle.
Here we presented just a few examples of the interesting pieces of data you can mine from SQL Server and the performance picture you can paint once you have that data, thanks to system tables. We've looked only at sysprocesses here. However, syslockinfo is an interesting table as well, since it contains information about the objects that are locked. If you find that your database is having lots of trouble with locking, examine syslockinfo in conjunction with sysprocesses to identify programs causing locks.
Of course, these few scripts are certainly not the only tools you should have in your DBA arsenal. While many of them are a good start, you'll find that just having a script handy to run on an ad hoc basis isn't enough. We found that certain scripts were worth automating, and then publishing their results as performance counters. The .NET Framework makes it very easy to execute a SQL query to get a scalar result, and then publish that result as a performance counter value that can then be graphed, recorded historically, and used to trigger alerts. While SQL Server provides a good selection of built-in performance counters, only your specific environment and applications can determine which metrics are especially indicative of the health of your servers. Queries against the system tables are a key first step towards really gaining insight into your databases.
Matt Evans is a software tester working on the Microsoft Business Framework. You can reach him at email@example.com.
Josh Moody is a Software Development Engineer in Test at Microsoft working on Visual Studio.
© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited