You can use the tracking data for SQL Server transaction wait times to find overused resources and eliminate bottlenecks.
Excerpted from “SQL Server DMV Starter Pack,” published by Red Gate Books (2010).
Every time a SQL Server session has to wait for some reason before the requested work can continue, SQL Server records the length of time waited. It also records the resource for which the transaction is waiting.
The sys.dm_os_wait_stats dynamic management view (DMV) exposes these wait statistics, aggregated across all sessions, to provide a summary review of where the major waits are on any given instance. This same DMV also exposes performance (PerfMon) counters, which provide specific resource usage measurements (disk transfer rates, amount of CPU time consumed and so on).
By correlating wait statistics with resource measurements, you can quickly locate the most “contested” resources on your system, and highlight potential bottlenecks.
The use of “waits and queues” as the basis for a performance-tuning methodology is explained in an excellent white paper by Tom Davidson. Essentially, each request to SQL Server will result in the initiation of a number of “worker tasks.”
A SQL Server Scheduler assigns each task to a worker thread. Normally there’s one SQL OS scheduler per CPU, and only one session per scheduler running at any time. It’s the scheduler’s job to spread the workload evenly between available worker threads.
If a session’s worker thread is running on the processor, the session status will be “Running,” as exposed by the Status column of sys.dm_exec_requests DMV. If a thread is ready to go, but the scheduler to which it’s assigned currently has another session running, then it will be placed in the “Runnable” queue. This means it’s in the queue to get on the processor. This is referred to as a signal wait.
The signal wait time is exposed by the signal_wait_time_ms column, and is solely CPU wait time. If a session is waiting for another resource to become available in order to proceed, such as a locked page, or if a running session needs to perform I/O, then it’s moved to the wait list. This is a resource wait and the waiting session’s status will be recorded as “Suspended.”
The reason for the wait is recorded, and exposed in the wait_type column of the sys.dm_os_wait_stats DMV. The total time spent waiting is exposed by the wait_time_ms column, so you can calculate the resource wait time as follows:
Resource waits = Total waits – Signal waits = (wait_time_ms) - (signal_wait_time_ms)
Signal waits are unavoidable in online transaction processing (OLTP) systems, which are comprised of a large number of short transactions. The key metric, with regard to potential CPU pressure, is the signal wait as a percentage of the total waits. A high-percentage signal is a sign of excessive CPU pressure. The literature tends to quote “high” as more than about 25 percent, but it depends on your system.
On our systems, we treat values greater than 10 percent to 15 percent as a worrying sign. Overall, the use of wait statistics represents a very effective means of diagnosing response times in your system. In simple terms, you either work or you wait. Response time equals service time plus the wait time.
If response times are slow and you find no significant waits, or mainly signal waits, then you need to focus on CPU. If you find response time is mainly comprised of time spent waiting for other resources (such as the network, I/O and so on) then you know exactly where to focus your tuning efforts.
Mario Broodbakker has written an excellent introductory series of articles on using wait events to diagnose performance problems. Our first script in the OS category uses the sys.dm_os_wait_stats DMV, which is defined as: Returns information about all the waits encountered by threads that executed. You can use this aggregated view to diagnose performance issues with SQL Server and also with specific queries and batches.
This query calculates signal waits and resource waits as a percentage of the overall wait time, in order to diagnose potential CPU pressure:
-- Total waits are wait_time_ms (high signal waits indicates CPU pressure)
SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms) AS NUMERIC(20,2)) AS [%signal (cpu) waits] , CAST(100.0 * SUM(wait_time_ms - signal_wait_time_ms) / SUM(wait_time_ms) AS NUMERIC(20, 2)) AS [%resource waits]FROM sys.dm_os_wait_stats ;
This query is useful to help confirm CPU pressure. Because signal waits are time waiting for a CPU to service a thread, if you record total signal waits that are roughly greater than 10 percent to 15 percent, then this is a pretty good indicator of CPU pressure.
These wait stats are cumulative since SQL Server was last restarted, so you need to know your baseline value for signal waits, and watch the trend over time. You can manually clear out the wait statistics, without restarting the server, by issuing a DBCC SQLPERF command, as follows:
-- Clear Wait Stats
DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR) ;
If your SQL Server instance has been running for quite a while, and you make a significant change like adding a new index, then you should consider clearing the old wait stats. Otherwise, the old cumulative wait stats will mask whatever impact your change has on the wait times.
Our second sample script using the sys.dm_os_wait_stats DMV will help determine the resources on which SQL Server is spending the most time waiting:
-- Isolate top waits for server instance since last restart
-- or statistics clear
WITH Waits AS ( SELECT wait_type , wait_time_ms / 1000. AS wait_time_s , 100. * wait_time_ms / SUM(wait_time_ms) OVER ( ) AS pct ,
ROW_NUMBER() OVER ( ORDER BY wait_time_ms DESC ) AS rn FROM sys.dm_os_wait_stats WHERE wait_type NOT IN ( 'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK', 'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE', 'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH', 'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN' ) ) SELECT W1.wait_type , CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s , CAST(W1.pct AS DECIMAL(12, 2)) AS pct , CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct FROM Waits AS W1 INNER JOIN Waits AS W2 ON W2.rn <= W1.rn GROUP BY W1.rn , W1.wait_type , W1.wait_time_s , W1.pct HAVING SUM(W2.pct) - W1.pct < 95 ; -- percentage threshold
This script will help you locate the biggest bottleneck at the instance level. This helps you focus your tuning efforts on a particular type of problem. For example, if the top cumulative wait types are disk I/O-related, then you’d want to investigate this issue further using disk-related DMV queries and PerfMon counters.
The DMV that exposes the PerfMon counters is sys.dm_os_performance_counters, which is described as: “Returns a row per performance counter maintained by the server.” This is a useful DMV, but it can be frustrating to use. Depending on the value for cntr_type for a given row, you’ll have to go through some interesting gyrations to get meaningful information from this DMV. It’s a replacement for the old sys.sysperfinfo from SQL Server 2000.
This script helps you investigate unusual conditions filling up your transaction log. It returns the recovery model, log reuse wait description, transaction log size, log space used, percentage of log used, compatibility level and page verify option for each database on the current SQL Server instance:
-- Recovery model, log reuse wait description, log file size,
-- log usage size and compatibility level for all databases on instance
SELECT db.[name] AS [Database Name] , db.recovery_model_desc AS [Recovery Model] , db.log_reuse_wait_desc AS [Log Reuse Wait Description] , ls.cntr_value AS [Log Size (KB)] ,
lu.cntr_value AS [Log Used (KB)] , CAST(CAST(lu.cntr_value AS FLOAT) / CAST(ls.cntr_value AS FLOAT) AS DECIMAL(18,2)) * 100 AS [Log Used %] , db.[compatibility_level] AS [DB Compatibility Level] , db.page_verify_option_desc AS [Page Verify Option]FROM sys.databases AS db INNER JOIN sys.dm_os_performance_counters AS lu ON db.name = lu.instance_name INNER JOIN sys.dm_os_performance_counters AS ls ON db.name = ls.instance_nameWHERE lu.counter_name LIKE ‘Log File(s) Used Size (KB)%’ AND ls.counter_name LIKE 'Log File(s) Size (KB)%' ;
This query helps evaluate an unfamiliar database server. It’s also useful from a monitoring perspective. For example, if your log reuse wait description is something unusual and your transaction log is 85 percent full, that should trigger an alarm.
Learn more about “SQL Server DMV Starter Pack” at red-gate.com/our-company/about/book-store.