The Perils of Fiber Mode
Summary: Ken Henderson explains the effects that SQL Server fiber mode coupled with the User Mode Scheduler can have on your system, and what to consider carefully before enabling fiber mode. (4 printed pages)
In a previous column, I detailed how SQL Server's User mode Scheduler (UMS) component schedules work to be performed within the server. I mentioned that UMS can be configured to run in either thread mode or fiber mode, and that thread mode was the default. I also talked about how fiber mode can reduce context switching between threads, and switches between user mode and kernel mode on the CPU. In this column, I'll delve more into why fiber mode isn't generally recommended and the details behind it.
As I pointed out previously, fiber mode uses Windows fibers, rather than threads, to service UMS workers. Windows fibers are lighter weight execution mechanisms than threads, with one thread typically hosting multiple fibers. A thread creates a fiber via the CreateFiber() (or CreateFiberEx()) API and schedules it to run via SwitchToFiber(). Because the base execution facility in Windows is still the thread, Win32 code that executes on a fiber still runs via a thread. It's just that context switches between threads typically occur far less, and expensive switches to kernel mode occur less frequently, because fibers are user-mode constructs that the kernel knows nothing of. Context switches can occur between multiple fibers hosted by a given thread rather than between threads, and some operations that would normally require a switch into kernel mode can instead be carried out entirely in user mode. Using fibers effectively teaches threads to juggle.
Given this and SQL Server's questionable name for fiber mode, lightweight pooling, you might be tempted to run out and enable fiber mode on all your production SQL Server machines, especially those facing scalability challenges. After all, isn't "lightweight" better in terms of performance? Wouldn't that be more scaleable? Not so fast. As a rule, you should stay away from fiber mode unless a Microsoft engineer or partner tells you otherwise. There are some weighty issues you need to be aware of when contemplating SQL Server's lightweight pooling mode.
Components That Don't Work
The first thing to be aware of is that certain components within SQL Server simply don't work (or behave erratically) when the server is in fiber mode (e.g., SQLXML or SQL Mail). Some components, in fact, aren't supported at all when the server is in fiber mode (e.g., heterogeneous queries and SQLCLR in SQL Server 2005). The reason for this is that they make use of thread-specific facilities such as Thread Local Storage (TLS) or thread-owned objects. As the name implies, TLS is data storage that is private to a given thread. Because multiple UMS workers may share a given thread when in fiber mode, the TLS changes one worker makes may inadvertently affect other workers. If the component in question is supported at all in fiber mode, it may behave erratically, especially when the system is under load.
Another way in which a fiber can make use of thread-specific facilities is to create thread-owned objects such as mutexes (a mutex is a type of Win32 kernel object) and critical sections. Critical sections and mutex objects support the notion of being owned by a particular thread (other kernel-mode objects, such as section objects, semaphores, and files, do not). If a fiber-based UMS worker makes an ODS (Open Data Services) call (for example, an extended procedure making srv_* calls) after taking ownership of a kernel object or critical section, the worker can actually end up on a different thread when it returns from the call. This would orphan the kernel object or critical section, potentially resulting in hanging the next worker that attempts to acquire it. If enough of these workers hang waiting on orphaned objects, your SQL Server may appear to be hung.
Another example is what we might call the GetLastError() problem. The Win32 GetLastError() API makes use of info that is stored locally for each thread (in the thread's Thread Environment Block, or, TEB), so external code running within SQL Server that makes use of it can also fall victim to fiber mode. For example, consider an extended procedure that calls a Win32 API that sets the last error code for the host thread. Before retrieving that code via GetLastError(), the extended procedure makes another ODS call (e.g., srv_sendmsg()) in order to report the return value of the API it just called. If running in fiber mode, the procedure may be in for a nasty surprise when it returns from the ODS call. It may actually be running on a different thread! How is that possible? Well, let me explain...
As I detailed in the earlier column, fiber mode poses special challenges to the server when running external code such as extended procedures, COM objects, and OLE-DB providers. These external entities are not typically UMS-aware, so they cannot be depended upon to call the appropriate UMS yield routines when necessary. Because the server can't count on them to yield as all good UMS citizens must (remember: UMS follows a cooperative multitasking model), it must give them their own thread to run on that is outside the regular UMS worker pool. This way, if external code takes a long time to run or attempts to monopolize the CPU, the other fish in the UMS sea can swim along happily, virtually undisturbed.
This is not such a big deal when the server is already running in thread mode. Its thread is simply set aside and ignored for the time being by UMS, and a new worker thread is queued up to begin servicing the UMS scheduler object on which it was running. When the external code finishes, the thread is returned to the UMS pool, and other workers can be scheduled on to it just as they were previously. While it is more expensive than running code within the server engine that cooperates with the UMS, external code can still work reliably when the server is in thread mode.
Fiber mode complicates matters significantly, however. When the server is in fiber mode, an entire UMS scheduler object must be created for the express purpose of servicing external code execution requests. This scheduler is not displayed in the output of the DBCC SQLPERF(umsstats), and is therefore known as a "hidden" scheduler. The scheduler's workers are always threads; hence, when a fiber-based worker on another scheduler needs to run some external code, its work can be transferred to this scheduler object and executed on a separate thread. When the external code finishes, flow passes back to a regular UMS worker (though not necessarily the originator) and business proceeds as usual.
Problems occur, however, when in the course of running on the hidden scheduler, external code makes use of an ODS function that forces it to call back into the server. Because this call must be scheduled via UMS like any other work within the server, it causes the execution flow to move back into a fiber-based worker, then move back to the hidden scheduler when the ODS call returns. The problem is, there's no guarantee that the thread the work is scheduled on when it returns to the hidden scheduler will be the same as the one it was originally running on. It can be scheduled on any thread on the hidden scheduler, and its original thread may be busy servicing some other external request. When a system is under load, it will likely end up on a different thread, wreaking havoc if it makes use of thread-specific data or constructs and does so in an unsafe manner.
Seasoned Windows programmers will point out that the first thing good code does when returning from a Win32 API that sets the calling thread's last error is to call GetLastError() and cache the return value in a stack variable or class member of some type so that subsequent code doesn't reset it. They will opine that it's no surprise that code that doesn't do this (as is the case in the example above) behaves erratically. That's a fair point, but keep in mind that not everyone writing extended procedures, COM objects, and the like is aware of this. Novice developers, in particular, may not realize that a simple ODS call may invoke all sorts of Win32 APIs within the server that are likely to reset the last error value of the host thread or make unexpected changes to TLS or other thread-specific stores and constructs.
You might wonder whether it's possible for Windows itself to cause problems by scheduling the hidden scheduler's worker threads on and off of the CPU as it sees fit, as it does with the other threads in the system. Would it be possible, for example, for Windows to cause the last error value for a thread to be overwritten by scheduling the thread off of the CPU after the thread runs a Windows API function, but before it calls GetLastError()? No, it wouldn't. If that were possible, no multithreaded Windows app would work reliably. The problem arises with the SQL Server fiber mode because multiple logical execution contexts—fibers—share a single physical execution context: a thread. When code that was designed to run on a thread makes changes to its execution context under the assumption that it will not affect anyone else because each thread has its own execution context, it will often not behave very well when a fiber is inserted between it and that thread context. To put it succinctly, external code that relies on thread-specific constructs and storage such as the last error value will probably not run in a predictable manner when the SQL Server fiber mode is enabled.
You may be familiar with the concept of thread safety in multithreaded programming. Certain SQL Server components are not supported or do not work reliably in fiber mode because they are not "fiber safe." The same types of issues that afflict multithreaded applications that are not thread safe affect them as well, only with respect to fibers rather than threads.
But Don't Fibers Help Scalability?
In addition to making some components within the server unusable, switching to fiber also rarely enhances performance or scalability on the typical system. Usually, there are far more pressing issues vis-à-vis performance tuning that should be addressed long before fiber mode is even on the drawing board. Switching to fiber mode is no quick fix, and enabling it may cause critical components of your system to fail. Add to this the fact that thread-based scheduling has been greatly streamlined in Windows Server 2003, and you have very little reason to use lightweight pooling made in production applications.
So, given all this, you may be wondering why lightweight pooling is in the product in the first place. When is it appropriate to enable it? Generally speaking, I would again say that you should not enable fiber mode unless Microsoft or one of its partners recommends it. As a rule, any recommendation to enable lightweight pooling should be met with a healthy degree of skepticism. Look for other things to tune first. On the typical system, there are many things that will yield greater dividends, and that are safer to try, than switching from thread-based to fiber-based UMS workers.
More specifically, on a system that has been tuned to the maximum degree possible given the hardware and other constraints placed upon it—and on which the fact that components such as SQLXML may not be usable is not an issue—there may be a situation in which it's appropriate to enable lightweight pooling. It depends on the specific system (the hardware, the way it's being used, the type of application, etc.) and on whether the system is otherwise as highly tuned as it can be.
Fiber mode was intended for niche situations in which a scalability ceiling is hit due to UMS workers spending significant amounts of time switching between thread contexts, or switching the CPU into kernel mode and back again. Unless you have encountered this yourself, and have already tuned the system as much as possible using more obvious (and safer) techniques, I recommend that you stay away from fiber mode and focus your tuning efforts on other things. If you run into a situation where you feel strongly you need fiber mode, it's probably worth a call into Microsoft Product Support Services to confirm your diagnosis. Better to get a second opinion than to break your server in subtle and pernicious ways with little hope of determining how or why, and with no likely improvement in overall performance.
SQL Server for Developers
Ken Henderson is a husband and father living in suburban Dallas, Texas. He is the author of eight books on a variety of technology-related topics, including the The Guru's Guide to SQL Server Architecture and Internals (Addison-Wesley, 2003). An avid Dallas Mavericks fan, Ken spends his spare time watching his kids grow up, playing sports, and gardening.