Inside the SQL Server 2000 User Mode Scheduler
February 24, 2004
Summary: Ken Henderson profiles the User Mode Scheduler (UMS) in SQL Server 2000 that requires developers to write code that runs efficiently, and yields often enough in the appropriate places. UMS provides more control and allows the server to scale better than it could with the Windows scheduler. (10 printed pages)
This column is excerpted from The Guru's Guide to SQL Server Architecture and Internals (Addison-Wesley, 2003) by Ken Henderson. Used by permission. All rights reserved.
Up through version 6.5, SQL Server™ used the scheduling facilities in Windows® to schedule worker threads, switch between threads, and generally handle the work of multitasking. This worked reasonably well and allowed SQL Server to leverage the hard-learned lessons regarding Windows scalability and efficient processor use. Between versions 6.5 and 7.0, however, it became evident that SQL Server was beginning to hit a "scalability ceiling." Its ability to handle thousands of concurrent users and efficiently scale on systems with more than four processors was hampered by the fact that the Windows scheduler treated SQL Server like any other application. Contrary to what some people believed at the time, SQL Server 6.5 made use of no hidden APIs to reach the scalability levels it achieved. It used the basic thread and thread synchronization primitives that all multithreaded Windows applications use, and Windows scheduled SQL Server worker threads on and off the processor(s) just as it did any other process. Clearly, this one-size-fits-all approach was not the most optimal solution for a high-performance application like SQL Server, so the SQL Server development team began looking at ways to optimize the scheduling process.
UMS Design Goals
Several goals were established at the outset of this research. The scheduling facility needed to:
- Support fibers, a feature new in Windows NT® 4.0, and abstract working with them so that the core engine would not need separate code lines for thread mode and fiber mode
- Avoid causing a thread to switch into kernel mode whenever possible
- Avoid thread context switches as much as possible
- Support asynchronous I/O and abstract working with it so that the core engine would not need a separate code line for versions of Windows that do not support asynchronous file I/O (e.g., Windows 9x and Windows ME)
Ultimately, it was decided that SQL Server 7.0 should handle its own scheduling. From that decision, the User Mode Scheduler (UMS) component was born. UMS acts as a thin layer between the server and the operating system. It resides in a file named UMS.DLL and is designed to provide a programming model that's very similar to the Win32® thread scheduling and asynchronous I/O model. Programmers familiar with one would instantly be at home in the other. Its primary function is to keep as much of the SQL Server scheduling process as possible in user mode. This means that UMS necessarily tries to avoid context switching because that involves the kernel. Context switches can be expensive and can limit scalability. In pathological situations, a process can spend more time switching thread contexts than actually working.
User Mode vs. Kernel Mode Scheduling
You may be wondering what the advantage of moving the scheduling management inside the SQL Server process is. Wouldn't SQL Server just end up duplicating functionality already provided by Windows? With all the smart people Microsoft must have working on Windows and its scheduler, how likely is it that the SQL Server team could come up with something that was that much more scalable? I'll address this in detail below, but the short answer is that SQL Server knows its own scheduling needs better than Windows or any other code base outside the product could hope to. UMS doesn't duplicate the complete functionality of the Windows scheduler, anyway—it implements only the basic features related to task scheduling, timers, and asynchronous I/O and, in fact, relies on the thread scheduling and synchronization primitives in Windows. Several Windows scheduling concepts (e.g., thread priority) have no direct counterparts in UMS.
Preemptive vs. Cooperative Tasking
An important difference—in fact, probably the most important difference—between the Windows scheduler and the SQL Server UMS is that the Windows scheduler is a preemptive scheduler, while UMS implements a cooperative model. What does this mean? It means that Windows prevents a single thread from monopolizing a processor. Each Windows thread gets a specified time slice in which to run, after which Windows automatically schedules it off the processor and allows another thread to run if one is ready to do so. UMS, by contrast, relies on threads to yield voluntarily. If a SQL Server worker thread does not voluntarily yield, it will likely prevent other threads from running. You may be wondering why UMS would take this approach. If you're an old-timer like me, you might recall that Windows 3.x worked exactly the same way—it made use of a cooperative scheduler, and it wasn't difficult for a misbehaving application to take over the system. This was, in fact, why Windows NT was designed from the ground up to make use of a preemptive scheduler. As long as a single application could bring down the system, you could never have anything even approaching a robust operating system. UMS takes the approach it does in order to keep from involving the Windows kernel any more than absolutely necessary. In a system where worker threads can be counted on to yield when they should, a cooperative scheduler can actually be more efficient than a preemptive one because the scheduling process can be tailored to the specific needs of the application. As I said earlier, UMS knows the scheduling needs of SQL Server better than the operating system can be expected to.
How UMS Takes Over Scheduling
If UMS is to handle SQL Server's scheduling needs rather than allowing Windows to do so, UMS must somehow prevent the OS from doing what it does with every other process: schedule threads on and off the system's processor(s) as it sees fit. How do you do that in a preemptive OS? UMS pulls this off through some clever tricks with Windows event objects. Each thread under UMS has an associated event object. For purposes of scheduling, Windows ignores threads it does not consider viable—threads that cannot run because they are in an infinite wait state. Knowing this, UMS puts threads to sleep that it does not want to be scheduled by having them call WaitForSingleObject on their corresponding event object and passing INFINITE for the timeout value. When a thread calls WaitForSingleObject to wait on an object and passes INFINITE for the timeout value, the only way to awaken the thread is for the object to be signaled. When UMS wants a given thread to run, it signals the thread's corresponding event object. This allows the thread to come out of its wait state and permits Windows to schedule it to run on a processor. In order to prevent Windows from scheduling multiple threads on the same processor and thereby incurring the overhead and expense of context switches, UMS attempts to keep just one thread viable—that is, not in an infinite wait state—per processor. There are exceptions to this (e.g., full-text queries, security validations, xproc invocations, linked server queries, and so on), but the normal mode of operation is that the system allows one thread per processor to run at a time.
The UMS Scheduler
The UMS mechanism for managing the scheduling process and for ensuring that only one thread per processor is active at any given time is called a scheduler. When SQL Server starts, one UMS scheduler is created for each processor in the machine. These schedulers are not closely associated to specific processors by default, but Windows' scheduling algorithms work out such that, over time, with each UMS scheduler allowing just one thread to run, that thread should end up on its own processor. The worker pool for the server, regardless of whether it consists of threads or fibers, is distributed evenly across the UMS schedulers. This means that if you have the maximum worker threads set to the default of 255 and you have a four-processor machine, SQL Server creates four UMS schedulers, and each can host a maximum of approximately 64 workers.
The UMS Scheduler Lists
Each UMS scheduler maintains five lists that support the work of scheduling threads: a worker list, a runnable list, a waiter list, an I/O list, and a timer list. Each of these plays a different role, and nodes are frequently moved between lists.
The Worker List
The worker list is the list of available UMS workers. A UMS worker is an abstraction of the thread/fiber concept and allows either to be used without the rest of the code being aware of which is actually being used under the covers. As I said, one of the design goals of UMS was to provide support for fibers in such a way that it did not matter to the core engine code whether the system was using fibers or threads. A UMS worker encapsulates a thread or fiber that will carry out tasks within the server and abstracts it such that the server does not have to be concerned (for the most part) with whether it is in thread mode or fiber mode. Throughout this article, I'll refer to UMS workers instead of threads or fibers. If your SQL Server is in thread mode (the default), a UMS worker encapsulates a Windows thread object. If your server is in fiber mode, a UMS worker encapsulates a Windows fiber, the handling of which is actually implemented outside of the Windows kernel.
The Connection Process
When a client connects to SQL Server, it is assigned to a specific UMS scheduler. The selection heuristics are very basic: whichever scheduler has the fewest number of associated connections gets the new connection. Once a connection is associated with a scheduler, it never leaves that scheduler. Regardless of whether its associated scheduler is busy and there are inactive schedulers on the system, UMS will not move a spid between schedulers. This means that it's possible to design scenarios where SQL Server's support for symmetric multiprocessing is effectively thwarted because an application opens multiple persistent connections that do not perform a similar amount of work. Say, for example, that you have a two-processor machine, and a SQL Server client application opens four persistent connections into the server, with two of those connections performing 90% of the work of the application. If those two connections end up on the same scheduler, you may see one CPU consistently pegged while the other remains relatively idle. The solution in this situation is to balance the load evenly across the connections and not to keep persistent connections when the workload is unbalanced. Disconnecting and reconnecting is the only way to move a spid from one scheduler to another. (This movement isn't guaranteed—a spid that disconnects and reconnects may end up on the same scheduler depending on the number of users on the other schedulers.)
Once a spid is assigned to a scheduler, what happens next depends on the status of the worker list and whether the maximum worker threads configuration value in SQL Server has been reached. If a worker is available in the worker list, it picks up the connection request and processes it. If no worker is available and the maximum worker threads threshold has not been reached, a new worker is created, and it processes the request. If no workers are available and maximum worker threads has been reached, the connection request is placed on the waiter list and will be processed in FIFO order as workers become available. Client connections are treated within UMS as logical (rather than physical) users. It is normal and desirable to have a high ratio of logical users to UMS workers. This is what allows a SQL Server with a maximum worker threads setting of 255 to service hundreds or even thousands of users.
UMS processes work requests atomically. This means that a worker processes an entire work request—a T-SQL batch execution, for example—before it is considered idle. It also means that there's no concept of context switching during the execution of a work request within UMS. While executing a given T-SQL batch, for example, a worker will not be switched away to process a different batch. The only time a worker will begin processing another work request is when it has completed its current work request. For example, it may yield and execute I/O completion routines originally queued by another worker, but it is not considered idle until it has processed its complete work request, and it will not process another work request until it is finished with the current work request. Once that happens, the worker either activates another worker and returns itself to the worker list or enters an idle loop code line if there are no other workers to run and no remaining work requests, as we'll discuss in just a moment. This atomicity is the reason it's possible to drive up the worker thread count within SQL Server by simply executing a number of simultaneous WAITFOR queries. While each WAITFOR query runs, the worker that is servicing it is considered busy by SQL Server, so any new requests that come into the server require a different worker. If enough of these types of queries are initiated, maximum worker threads can be quickly reached, and, once that happens, no new connections will be accepted until a worker is freed up. When the server is in thread mode and a worker has been idle for 15 minutes, SQL Server destroys it, provided doing so will not reduce the number of workers below a predefined threshold. This frees the virtual memory associated with an idle worker's thread stack (.5 MB) and allows that virtual memory space to be used elsewhere in the server.
The "Runnable" List
The "runnable" list is the list of UMS workers ready to execute an existing work request. Each worker on this list remains in an infinite wait state until its event object is signaled. Being on the runnable list does not imply that the worker is schedulable by Windows. Windows will schedule it as soon as its event object is signaled according to the algorithms within UMS. Given that UMS implements a cooperative scheduler, you may be wondering who is actually responsible for signaling the event of a worker on the runnable list so that it can run. The answer is that it can be any UMS worker. There are calls throughout the SQL Server code base to yield control to UMS so that a given operation does not monopolize its host scheduler. UMS provides multiple types of yield functions that workers can call. As I've mentioned, in a cooperative tasking environment, threads must voluntarily yield to one another in order for the system to run smoothly. SQL Server is designed so that it yields as often as necessary and in the appropriate places to keep the system humming along. When a UMS worker yields—either because it has finished the task at hand (e.g., processing a T-SQL batch or executing an RPC) or because it has executed code with an explicit call to one of the UMS yield functions—it is responsible for checking the scheduler's runnable list for a ready worker and signaling that worker's event so that it can run. The yield routine itself makes this check.
In the process of calling one of the UMS yield functions, a worker actually performs UMS's work for it—there's no thread set aside within the scheduler for managing it. If there were, that thread would have to be scheduled by Windows each time something needed to happen in the scheduler. We'd likely be no better off than we were with Windows handling all of the scheduling. In fact, we might even be worse off due to contention for the scheduler thread and because of the additional overhead of the UMS code. By allowing any worker to handle the maintenance of the scheduler, we allow the thread already running on the processor to continue running as long as there is work for it to do—a fundamental design requirement for a scheduling mechanism intended to minimize context switches. A scheduler that is to minimize thread context switching must de-couple the work queue from the workers that carry it out. In an ideal situation, any thread can process any work request. This allows a thread that is already scheduled by the operating system to remain scheduled and continue running as long as there is work for it to do. It eliminates the wastefulness of scheduling another thread to do work the thread that's already running could do.
The Waiter List
The waiter list maintains a list of workers waiting on a resource. When a UMS worker requests a resource owned by another worker, it puts itself on the waiter list for the resource and enters an infinite wait state for its associated event object. When the worker that owns the resource is ready to release it, it is responsible for scanning the list of workers waiting on the resource and moving them to the runnable list, as appropriate. And when it hits a yield point, it is responsible for setting the event of the first worker on the runnable list so that the worker can run. This means that when a worker frees up a resource, it may well undertake the entirety of the task of moving those workers that were waiting on the resource from the waiter list to the runnable list and signaling one of them to run.
The I/O List
The I/O list maintains a list of outstanding asynchronous I/O requests. These requests are encapsulated in UMS I/O request objects. When SQL Server initiates a UMS I/O request, UMS goes down one of two code paths, depending on which version of Windows it's running on. If running on Windows 9x or Windows ME, it initiates a synchronous I/O operation (as I said above, Windows 9x and ME do not support asynchronous file I/O). If running on the Windows NT family, it initiates an asynchronous I/O operation. When a Win32 thread wants to perform an I/O operation asynchronously, it supplies an OVERLAPPED structure to the ReadFile/ReadFileEx or WriteFile/WriteFileEx function calls. Initially, Windows sets the Internal member of this structure to STATUS_PENDING to indicate that the operation is in progress. As long as the operation continues, the Win32 API HasOverlappedIoCompleted will return false. (HasOverlappedIoCompleted is actually a macro that simply checks OVERLAPPED.Internal to see whether it is still set to STATUS_PENDING.)
In order to initiate an asynchronous I/O request via UMS, SQL Server instantiates a UMS I/O request object and passes it into a method that's semantically similar to ReadFile/ReadFileScatter or WriteFile/WriteFileGather, depending on whether it's doing a read or a write and depending on whether it's doing scatter-gather I/O. A UMS I/O request is a structure that encapsulates an asynchronous I/O request and contains, as one of its members, an OVERLAPPED structure. The UMS asynchronous I/O method called by the server passes this OVERLAPPED structure into the appropriate Win32 asynchronous I/O function (e.g., ReadFile) for use with the asynchronous operation. The UMS I/O request structure is then put on the I/O list for the host scheduler.
Once an IO request is added to the IO list, it is the job of any worker that yields to check this list to see whether asynchronous I/O operations have completed. To do this, it simply walks the I/O list and calls HasOverlappedIoCompleted for each one, passing the I/O request's OVERLAPPED member into the macro. When it finds a request that has completed, it removes it from the I/O list, and then calls its I/O completion routine (this I/O completion routine was specified when the UMS I/O request was originally created). If you've worked much with asynchronous I/O facilities in Windows, you know that when an asynchronous operation completes, the operating system can optionally queue an I/O completion APC to the original calling thread. As I said earlier, one of the design goals of UMS was to provide much of the same scheduling and asynchronous I/O functionality found in the OS kernel without requiring a switch into kernel mode. UMS' support for I/O completion routines is another example of this design philosophy.
A big difference between the way Windows executes I/O completion routines and the way UMS does is that, in UMS, the I/O completion routine executes within the context of whatever worker is attempting to yield (and, therefore, checking the I/O list for completed I/O operations), rather than always in the context of the thread that originally initiated the asynchronous operation. The benefit of this is that a context switch is not required to execute the I/O completion routine. The worker that is already running and about to yield takes care of calling it before it goes to sleep. Because of this, no interaction with the Windows kernel is necessary. If running on Windows 9x or ME, the I/O completion routine is called immediately after the Win32 I/O API call. Since the operation is handled synchronously by the operating system, there is no reason to go on the I/O list and have perhaps another worker actually run the I/O completion routine. Given that we know the I/O has completed when we return from the Win32 API call, we can go ahead and call the I/O completion routine before returning from the UMS I/O method call. This means that, on Windows 9x/ME, the I/O completion routine is always called within the context of the worker that initiated the asynchronous I/O operation in the first place.
The Timer List
The timer list maintains a list of UMS timer requests. A timer request encapsulates a timed work request. For example, if a worker needs to wait on a resource for a specific amount of time before timing out, it is added to the timer list. When a worker yields, it checks for expired timers on the timer list after checking for completed I/O requests. If it finds an expired timer request, it removes it from the timer list and moves its associated worker to the runnable list. If the runnable list is empty when it does this—that is, if no other workers are ready to run—and the server is running in thread mode it also signals the worker's associated event so that it can be scheduled by Windows to run. If the runnable list is empty and the server is in fiber mode, the yielding worker will simply perform a user-mode switch to the runnable worker.
The Idle Loop
If, after checking for completed I/O requests and expired timers, a worker finds that the runnable list is empty, it enters a type of idle loop. It scans the timer list for the next timer expiration, and then enters a WaitForSingleObject call on an event object that's associated with the scheduler using a timeout value equal to the next timer expiration. The Win32 OVERLAPPED structure contains an event member that can store a reference to a Windows event object. When a UMS scheduler is created, an event object is created and associated with the scheduler itself. When an asynchronous I/O request is initiated by the scheduler, this event is stored in the hEvent member of the I/O request object's OVERLAPPED structure. This causes the completion of the asynchronous I/O to signal the scheduler's event object. By waiting on this event with a timeout set to the next timer expiration, a worker is waiting for either an I/O request to complete or a timer to expire, whichever comes first. Since it's doing this via a call to WaitForSingleObject, there's no polling involved and it doesn't use any CPU resources until one of these two things occurs.
Certain operations within SQL Server require that a worker "go preemptive"—that is, that it get taken off of the scheduler. An example is a call to an extended procedure. As I've said, because UMS is a cooperative multitasking environment, it relies on workers to yield at regular points within the code in order to keep the server running smoothly. Obviously, it has no idea of whether an xproc can or will yield at any sort of regular interval, and, in fact, there's no documented ODS API function that an xproc could call to do so. So, the scheduler assumes that an xproc requires its own thread on which to run. Therefore, prior to a worker executing an xproc, it removes the next runnable worker from the runnable list and sets its event so that the scheduler will continue to have a worker to process work requests. Meanwhile, the original worker executes the xproc and is basically ignored by the scheduler until it returns. Once the xproc returns, the worker continues processing its work request (e.g., the remainder of the T-SQL batch in which the xproc was called), then returns itself to the worker list once it becomes idle. The salient point here is that because certain operations within the server require their own workers, it's possible for there to momentarily be multiple threads active for a given CPU. This means that Windows will schedule these threads preemptively as it usually does, and you will likely see context switches between them. It also means that since executing an xproc effectively commandeers a UMS worker, executing a high number of xprocs can have a very negative effect on scalability and concurrency. Each xproc executed reduces UMS's ability to service a high number of logical users with a relatively low number of workers. Besides xprocs, there are several other activities that can cause a worker to need to go preemptive. Examples include sp_OA calls, linked server queries, distributed queries, server-to-server RPCs, T-SQL debugging, and a handful of others. Obviously, you want to avoid these when you can if scalability and efficient resource use is a primary concern.
When the server is in fiber mode, things work a little differently. A Win32 fiber is a user mode concept—the kernel knows nothing of it. Since a thread is actually Windows' only code execution mechanism, code that is run via a fiber still has to be executed by a thread at some point. The way this is handled is that the fiber management APIs in Windows associates a group of fibers with a single thread object. When one of the fibers runs a piece of code, the code is actually executed via its host thread. Afterward, user code is responsible for switching to another fiber so that it can run—a concept not unlike the cooperative tasking offered by UMS.
Given that when SQL Server is in fiber mode, multiple workers could be sharing a single Windows thread, the process that's followed when taking a worker thread preemptive won't work when we need to switch to preemptive mode with a worker fiber. Because the execution mechanism within Windows is still a thread, the thread that hosted the fiber would have to be taken off of the scheduler, and this would, in turn, take all the other fiber workers hosted by the same thread off of the scheduler as well—not a desirable situation. Instead, what happens here is that a hidden thread-based scheduler is created to service xprocs and other external calls that cause a worker to need to switch to preemptive mode. (The scheduler is hidden in the sense that it does not show up in the DBCC SQLPERF(umsstats) output.) When a worker fiber then needs to switch to preemptive mode to run one of these components, the work request is moved to this hidden scheduler and processed. Once it completes, the fiber is moved back to the original scheduler and processing continues as normal. The upshot of this is that executing things like xprocs and linked server queries can be extremely inefficient in fiber mode. In fact, there are a number of components within the server that aren't even supported in fiber mode (sp_xml_preparedocument and ODSOLE, for example). If you need to run lots of xprocs, linked server queries, distributed transactions and the like, fiber mode may not be your best option.
The server creates hidden schedulers for other uses as well. Other processes within the server require the same type of latch, resource management, and scheduling services that UMS provides for work request scheduling, so the server creates hidden schedulers that allow those processes to make use of this functionality without having to implement it themselves. An example of such a facility is SQL Server's backup/restore facility. Given that many backup devices do not support asynchronous I/O and the fact that doing a large amount of synchronous I/O on a regular UMS scheduler would negatively impact the concurrency of the entire scheduler because it would allow a single blocking synchronous I/O call to monopolize the worker (not unlike calling external code does), SQL Server puts backup/restore operations on their own scheduler. This allows them to contend with one another for processor time and permits Windows to preemptively schedule them with the other schedulers.
I mentioned DBCC SQLPERF(umsstats) earlier, and you may already be aware of it given that, although it's undocumented, it's mentioned in the public Microsoft Knowledge Base. DBCC SQLPERF(umsstats) allows you to return a result set listing statistics for the visible UMS schedulers on the system. It can list the total number of users and workers for the scheduler, the number of workers on the runnable list, the number of idle workers, the number of outstanding work requests, and so on. It's very handy when you suspect you're experiencing some type of issue with a scheduler and need to know what's going on behind the scenes. For example, you should be able to quickly tell from this output whether a scheduler has reached its maximum number of workers and whether they're currently busy.
In order to increase scalability and support Windows fibers, SQL Server has managed its own scheduling since version 7.0 via UMS. UMS serves as a thin layer between the server and the operating system that provides much of the same functionality offered by the Win32 thread and scheduling primitives, but it does so without requiring as many transitions into kernel mode or as many context switches. A key difference between UMS and Windows' scheduler is that UMS is a cooperative scheduler. It relies on workers to voluntarily yield often enough to keep the system running smoothly. By putting control of when a thread is scheduled under the direction of the server, a much greater responsibility is placed on the developers of SQL Server to write code that runs efficiently, and yields often enough, and in the appropriate places. However, this also provides a much finer granularity of control and allows the server to scale better than it could ever hope to using the "one-size-fits-all" scheduling approach in Windows, because SQL Server knows its own scheduling needs best.
SQL Server for Developers
Ken Henderson is a husband and father living in suburban Dallas, Texas. He is the author of eight books on a variety of technology-related topics, including the recently released The Guru's Guide to SQL Server Architecture and Internals (Addison-Wesley, 2003). An avid Dallas Mavericks fan, Ken spends his spare time watching his kids grow up, playing sports, and gardening.