Debugging and Tuning

Article
02/20/2014

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

By Kevin Miller

wrox

Chapter 10 from Professional NT Services, published by Wrox Press

If debugging a service were simply a matter of setting a few breakpoints and hitting F5, like you would for a normal executable, there wouldn't be much to talk about on the topic. We've all used Visual C++'s interactive debugger, and by now we're probably experts at doing so. Unfortunately, that experience is not going to be much help here.

The first problem, as you already know, is that you can't run a service simply by running the executable that houses it. Running a service requires interacting with the SCM and launching it using StartService(). The second problem is that even if you've solved the first one, the service runs under a different security context — usually the LocalSystem account. As a consequence, many of the normal feedback mechanisms you might be tempted to use, such as console window output, simply don't work because they're not directed to the desktop of the interactive user — that is, you the developer.

In this chapter, I'll show you a number of different techniques that you can use to work around these problems, and then focus on some more general issues that can help to keep you out of trouble in the first place. Later on, I'll discuss some of the profiling tools that you can use to tune the service (or any application, for that matter) for maximum performance.

Debugging Techniques

To be honest, debugging a service really isn't that much more trouble than debugging a regular Win32 executable, as long as you know the right buttons to push — that is, if you know how to set up your development environment to make debugging work.

The first thing to do is to make sure that whatever interactive account you are logged on under has the SE_DEBUG_NAME (debug programs) right. All members of the local Administrators group have this privilege. If you're developing a service, you're probably a member of Administrators already, but you should keep this point in mind if you need to debug a service running on a remote machine.

Prying into a Running Service

Sometimes, you just need to get into a service that's already functioning to troubleshoot a minor bug. This is one of the easiest debugging problems, because it doesn't require that you customize the development environment or your machine in any special way. You simply use the ability of Visual C++ to step into and debug a running process.

To perform this task, select the Build | Start Debug | Attach to Process... menu item in Visual C++. When you do that, you'll be presented with the following dialog:

You'll need to check the Show System Processes box at the bottom in order to be able to see the module that contains your service; this checkbox allows you to see processes running in the system context. Highlight the name of the executable that houses your service, then hit OK.

Developer Studio will immediately attach itself to the process. To go into debug mode, select the Debug | Break menu item. The debugger will stop on the very next line of code in the service that it possibly can — this is usually somewhere in the middle of some assembly language code. If you're an assembler buff, you'll know what to do from here. If not, you just need to hit F10 a few times to single-step through assembly statements until you get to one of the lines of code you can recognize (helpfully intertwined into the disassembly window, so you can tell where you are). Then you can use the call stack to view the source code window where the break occurred, and then set your breakpoints or single step from there.

Using Task Manager

The other method of worming your way into a running process is to use the Windows NT Task Manager's process pane to find the name of the module you want, and then right click on it. Next, select the Debug item from the context menu. This will bring up the default debugger for the current Windows NT installation (usually Visual C++ if it's installed), or whatever debugger the current module is set up to launch when debugged (I'll show you how to do this in the next section). Then you can just proceed as above, by selecting Debug | Break to break into the module.

Of course, this doesn't give you the ability to set a breakpoint on a particular line of code before the service starts, and then to start the service and have it stop on the breakpoint. In other words, the only way to really use this debugging technique is if the function you are debugging is not on the main execution path of the running service. If the place you want to debug is in the 'initialization' or 'run' segments of the service, there's no way to stop it with a breakpoint before the function runs. You'd have to have pretty fast hands to start a service and then break into it during its initialization phase using this technique!

Setting Breakpoints Before the Service Starts

Instead of the above, what you may need to do is set breakpoints and have the service step into debug mode when it hits one. To do this, you must use the registry to attach a debugger to your module name. The example below assumes you want to use the Visual C++ debugger, but you're completely free to substitute it with your favorite.

In the registry path HKEY_LOCAL_MACHINE \SOFTWARE \Microsoft \Windows NT\CurrentVersion, create a subkey called Image File Executable Options. (This subkey may already exist.)
Beneath that key, create a key with the same name as the executable that houses the service you want to debug. If the service is housed in myservices.exe, then create a key with exactly that name.
Beneath the module name key, create a value of type REG_SZ named Debugger, with a string that's the fully qualified path to the debugger. On my machine, this would be d:\devstudio\sharedide\ bin\msdev.exe.
Your service is now ready to be debugged using calls to DebugBreak().

Alternatively, you can just set up msdev.exe as your default debugger for all modules. By default, this sets itself up as the default debugger for the system, but if that's not true in the case of your machine, you can use the HKEY_LOCAL_MACHINE \SOFTWARE \Microsoft \Windows NT\CurrentVersion\AeDebug registry key to set it up that way.

The key contains two values: Auto and Debugger. Auto can be set to either 0 or 1. If it is set to "0", the system will generate the usual popup message box that states that an exception has occurred and gives you the choice of either terminating the application by pressing OK, or debugging it by choosing Cancel. If the Auto value is set to "1", then the system dispenses with the dialog box and simply launches right into the debugger specified in the Debugger value. When Visual C++ is installed, it changes the Debugger value to the path name of msdev.exe.

Invoking the Debugger

Now you can use the DebugBreak() Win32 API function in your code to invoke the debugger. Simply put a call to DebugBreak() anywhere you would normally use a red dot breakpoint from the interactive debugger. When it's encountered at runtime, the function causes a breakpoint exception that launches the debugger.

It wouldn't feel right if everything were as easy as it looks, and of course this method of debugging has its own set of annoyances. Every time you want to change the locations of breakpoints, you have to move the calls to DebugBreak() around in the code, and then rebuild the application. Then, you have to go out to the command line and start the service, hit Cancel when the exception box pops up (spawning a new instance of Visual C++), and so on. It's just not as easy or as much fun as using F5 to start debugging straight from the editor.

A Quick Demonstration

We can look at this in action by placing a call to DebugBreak() in the Business Object project from Chapter 9. A reasonably good first place to put it would be right before the _Module::Start() call in _tWinMain(), like so:

   _Module.m_bService = false;
   if(lRes == ERROR_SUCCESS)
   _Module.m_bService = true;
   lRes = _Module.ReadParameters();
   if(lRes != ERROR_SUCCESS)
   return lRes;
   DebugBreak();
   _Module.Start();
   // When we get here, the service has been stopped
   return _Module.m_status.dwWin32ExitCode;
  }

Now, when you attempt to start the service from the Services applet, you quickly hit the exception breakpoint, as shown below:

Click the Cancel button and a new instance of the debugger will launch, putting you right into the thick of some assembly code:

When you hit F10 a couple of times, you'll start to recognize where you are. You can select the function you want to step into using the Context dropdown:

Once you select a function from here, you'll be placed into the familiar source file that the function lives in.

Now, you can use F10, F11, set breakpoints, and do whatever you are used to doing when debugging. You can also load other source files in the project, set a breakpoint, and then allow execution to run to that breakpoint. For instance, if you wanted to step into the debugger when a client called the IAuthor::GetAuthorList() method, just set a breakpoint there, like so, and then press F5 to let the project run:

If a client calls the method, the debugger window will immediately pop up at the breakpoint, allowing you do debug the method call in action.

When you've finished, and you close the workspace, save your interactive debug environment each time. Then, next time you debug the service it will reload the old workspace with all your breakpoints intact and source files loaded.

DebugBreak and Structured Exception Handling

When you're using DebugBreak() to cause an exception in your service so that it can be debugged, be aware that if the function that calls DebugBreak() is nested inside a structured exception handling framework that wraps unhandled exceptions, things won't work in quite the way you expected. Instead of getting an exception dialog box that allows you to click Cancel to spawn the debugger, your service will halt on the breakpoint but never give you the option of spawning the debugger.

This happens because the DebugBreak() call is really an EXCEPTION_BREAKPOINT exception. If a generic exception handler traps it, the exception message gets quashed; the only time that the dialog is raised is when the exception goes unhandled. There are a couple of ways around this problem:

You can use a special preprocessor symbol to omit selectively the exception handler that's wrapping the DebugBreak() call. When debugging is needed, just build with the preprocessor symbol defined. For instance:

#ifndef _DEBUGBREAK __try { #endif // Some code #ifndef _DEBUGBREAK } __except(EXCEPTION_EXECUTE_HANDLER) { // Handler } #endif

A second option is to fool the system into thinking that the exception is not being handled, so the message box appears. You can do that by wrapping the DebugBreak() call in its own structured exception handling block, giving the inner block the chance to handle the call. The trick is to use the Win32 API function UnhandledExceptionFilter() to tell the system that the call was unhandled and that it should therefore display the 'unhandled exception' message box:

#ifdef _DEBUGBREAK __try { DebugBreak(); } __except(UnhandledExceptionFilter(GetExceptionInformation())) { // Do nothing } #endif

DebugBreak and Non-LocalSystem Accounts

The DebugBreak() method has another little hitch. If the 'log on as' account for the service is not the LocalSystem account but a member of the local Administrators group, the following error will pop up when you hit Cancel to enter the debugger:

If the 'log on as' account is not a member of the local Administrators group (but still isn't LocalSystem), the message will be different, although it indicates the same problem:

This should look familiar to you from the Chapter 7 — the problem here is that the account you're running the service under doesn't have access to the interactive user's window station and desktop. Incidentally, this will happen even if the interactive and 'log on as' accounts are the same account, because the two logon sessions get separate window stations. To solve this problem, you can do one of two things:

Do your debugging while the service is running under the context of the LocalSystem account, and then switch to the real 'log on as' account when the problem is solved.
If the problem the service is having is actually related to the account it is being run under, and it's necessary to debug the service while it is running under that account, you can run the following short program. It applies a null DACL to the interactive window station and desktop (winsta0\\default). Running this program before debugging will set the DACL for the duration of the user logon, and reset it after log off. Be aware that your screensaver will not work after you do this.

#include <windows.h> int main() { HDESK hdesk = NULL; HWINSTA hwinsta = NULL; SECURITY_DESCRIPTOR sd; SECURITY_INFORMATION si = DACL_SECURITY_INFORMATION; __try { // Open the window station and desktop hwinsta = OpenWindowStation("winsta0", FALSE, WRITE_DAC); if (hwinsta == NULL) __leave; hdesk = OpenDesktop("default", 0, FALSE, WRITE_DAC | DESKTOP_WRITEOBJECTS | DESKTOP_READOBJECTS); if (hdesk == NULL) __leave; // Create a security descriptor with a NULL DACL if(!InitializeSecurityDescriptor(&sd, SECURITY_DESCRIPTOR_REVISION)) __leave; if(!SetSecurityDescriptorDacl(&sd, TRUE, (PACL)NULL, FALSE)) __leave; // Set the security descriptors on the winstation and desktop // to the new one you just created if(!SetUserObjectSecurity(hwinsta, &si, &sd)) __leave; if(!SetUserObjectSecurity(hdesk, &si, &sd)) __leave; } __finally // Close the handles on error or as normal flow { if(hdesk != NULL) CloseDesktop(hdesk); if(hwinsta != NULL) CloseWindowStation(hwinsta); } return 0; }

Other Debugging Strategies

When it come to debugging a service during development, there are a couple of other techniques you might want to consider. In this section, I will discuss some of the overall strategies that can help make debugging a service a little bit easier. I won't go into general debugging techniques, for which every experienced programmer has his or her own preference.

Write the Base Portions of your Code Before Creating a Service

In order to work around some of the annoyances of having to debug the service by starting it outside of the interactive debugger, you can choose another technique. Write the basic functionality of the service as a good old executable server, troubleshoot and debug the code, and then move the whole architecture into the service infrastructure. Of course, how well this technique works varies according to what your service is doing; sometimes, it's very the act of trying to put in proper SCM control handling that turns a working executable server into an ugly beast with lots of thread synchronization and communication problems.

Adding the service-specific stuff at the end of the development process usually works best for simpler services, but you should wary of unwittingly introducing the race conditions or thread communication problems we discussed at the end Chapter 2 by adding service code as an 'afterthought'. Furthermore, if you want to use the C++ service class from Chapter 4, then merging the code from the two sources might take a while anyway. In summary, and as so often, this technique can sometimes work well, while at other times it may be more trouble than it's worth.

Use a Switch to Start the Service as a Regular Application

The second overall strategy is to structure the code so that by using a special command-line switch, the service can be started either as an executable server or as a service. For instance, in the main loop of the code, you might select different code paths based on the switch. You might also create an additional project configuration, say Executable Server, which has different settings and #define statements so that service infrastructure code is excluded from this build type. The advantage of the special build is that checking for the switch doesn't weigh down your production code too much.

This technique has similar drawbacks to the previous strategy, although it does at least allow you quickly to isolate problems in the logic of the server from problems in the service code itself, and to debug the server logic without too much trouble. On the downside, it's sometimes a bit hard to extricate the service infrastructure and the synchronization mechanisms from the real processing logic. It takes a bit of up-front planning, but it may be worth the extra work if you can do it without overburdening the performance of the main service code.

ATL Debugging

The default implementation provided for services by Microsoft's Active Template Library provides most of the necessary infrastructure for COM services to be started as COM servers if you desire it. However, you need to be careful about a couple of things if you intend to write the service so that it can be started both ways.

First, keep any 'real' initialization code (that's needed for both servers and services) out of ServiceMain() — ServiceMain() is skipped if the executable is started as a COM server. Second, only perform status updates outside of the Run() function. Since this is really a hard rule to follow if the service does anything interesting besides serving up COM objects, you can conditionally bracket the calls to SetServiceStatus() so that they are not called if the process wasn't started as a service.

Using the Event Log

One of the most useful ideas I can give you is to use the event log to track problems with API calls and your own functions. If the return code on a function is an error, jump out to the error handler, call GetLastError(), and format a message string to output what happened to the event log. Being conscientious about doing this can save lots of time tracking down problems in the long run — especially security issues, which always tend to happen at installation ("Well, it ran on my machine!").

For instance, you could use a variation of the following function to send Win32 errors to the event log, which is taken from the classes we developed in Chapter 4:

// Generic error handler that gathers the last error, looks up the description
//  string, and optionally prints the string to the event log and/or raises an
//  exception to stop the service
DWORD CService::ErrorHandler(const TCHAR* psz, bool bPrintEvent,
                                            bool bRaiseException, DWORD dwErr)
{
 LPVOID lpvMsgBuf;
 TCHAR sz[512 + 50]; // Max message len + pre-string
 if(dwErr != 0)
 {
 if(!FormatMessage(
 FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_ALLOCATE_BUFFER, 0, dwErr,
 MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), (LPTSTR)&lpvMsgBuf, 0, 0))
 {
 wsprintf(sz, _T("%s failed: Unknown error %lu"), psz, dwErr);
 }
 else
 {
 wsprintf(sz, _T("%s failed: %s"), psz, (LPTSTR)lpvMsgBuf);
 }
 LocalFree(lpvMsgBuf);
 }
 else
 {
 // This is a custom error that is application-specific
 wsprintf(sz, _T("%s\n"), psz);
 }
#ifdef _DEBUG
 OutputDebugString(sz);
#endif
 if(bPrintEvent)
 PrintEvent(sz);
 if(bRaiseException)
 RaiseException( dwErr, EXCEPTION_NONCONTINUABLE, 0, 0 );
 return dwErr;
}

If you require it, this function even raises an exception to stop execution and invoke your exception handler. You might call it like so:

   m_hServiceStatus = RegisterServiceCtrlHandler(m_szName, lpHandlerProc);
   if(!m_hServiceStatus)
   {
    ErrorHandler(_T("RegisterServiceCtrlHandler"));
    return FALSE;
   }

OutputDebugString and DBMON

Sometimes, it can be handy simply to send a trace of what's happening in the service to an output window, rather than to the event log. To do that, use the OutputDebugString() function. (You saw this technique in Chapter 8; the DbMon output below is from the Quartermaster sample):

VOID OutputDebugString(
 LPCTSTR lpOutputString   // Pointer to string to be displayed);

This function simply accepts any string to be printed to the system debug window. If the service is running in debug mode inside a Visual C++ 5.0 debug session, it causes messages to be printed to the output window. However, the debug messages can also be read from DbMon, which is a handy tool for monitoring debug messages that's shipped with the Platform SDK. It allows you to read the output without launching the debugger and stepping into the service. DbMon provides a simple console window where the messages are displayed:

This function will do work whether the application is in debug mode or not, so it's usually a good idea to surround your calls to OutputDebugString() with #ifdef _DEBUG statements to compile out the code when building for release. However, it can be useful to be able to 'trace' the application when the release build is running on a customer's machine, so you may wish to provide a command-line parameter that allows an administrator to start the service in 'verbose' mode. This would enable debug strings to be output to the window when (and only when) more detailed information was desired.

MessageBox

To show a message box from a service, even though it is running in a different window station from the interactive user, call the MessageBox() function, specifying the MB_SERVICE_NOTIFICATION flag. This will display a message box in the current desktop, whether or not there is a user logged on at the moment. However, making this call stops all activity in the service, so it should only be used for debugging, and certainly not for a production service.

Debugging Thread Routines

Debugging multithreaded applications can be incredibly difficult, at best. Because of the rapid context switches between breakpoints, it can be particularly hard to step through a worker thread routine that several threads are using at once — the debugger will cycle among the multiple threads that are using the same function. Furthermore, the very act of debugging changes the balance of the ordering of context switches, timings, and so forth. If your problem is a race condition, simply examining the problem in the debugger will probably solve it! When you go back to a test run with no breakpoints, the problem crops up again.

The Visual C++ debugger has one relatively simple tool to try to help you debug multithreaded apps: the Debug | Threads... window, which is only available when the application is stopped at a breakpoint. It shows the currently running threads and their IDs, as well as the names of the functions they are currently in, if available.

The screenshot below shows the threads running in the Business Object service from the previous chapter. At the time of this snapshot, there are six threads. The main thread is in CServiceModule::Start(), waiting for the control dispatcher to release it. The ServiceMain() thread is cranking along in CServiceModule::ServiceMain(). The two worker threads — the allocator and the dead session poller — are both alive in the _threadstartex threads. The other two threads are unknown, but they were probably created by the ODBC driver manager to manage the open connections.

One nice feature of this tool is that threads can be suspended from execution to help slow down context switches while you are stepping through. Simply suspend all threads that you are not interested in looking at, and the cycling will stop. The focus of the debugger can also be switched to other threads to help get you back into the right execution context (the thread with the asterisk is the one with the current debugging focus). This is handy when you launch the debugger on a running service and don't exactly know where you jumped into the code. Remember, though, that debugging a service using this technique actually changes its behavior, and it may be difficult to duplicate the errors you are seeing in non-debug mode in the debugger. That's one of the things that make debugging multithreaded applications so difficult.

Another useful way to debug complex threading interactions is to use a logging technique. You can use OutputDebugString() to show the activities of individual threads, and the order they occurred in. Alternatively, you can use stream output functions to write the information to a file. This usually works pretty well, because the debug version of the C runtime serializes calls to I/O functions, to make sure that only one function can call them at a time — the output is therefore in order. The drawback is that these serialization mechanisms themselves take time, and can change the behavior of the routine you're trying to debug.

As a final word on this topic, it is sometimes difficult to know when different threads are changing the values of certain member or global variables. Visual C++'s Breakpoints window can help you here — simply select Edit | Breakpoints... and switch to the Data tab. Then, enter the expression or variable you want to break on if it changes:

This little technique can give you a handy warning when your memory is getting trashed and you don't know by whom. You can also specify raw addresses and cast to the type of the variable you want to watch, like so:

DW(0x0041a6a0)

Using this technique is faster because it uses the hardware registers instead of doing a software evaluation.

Stopping a Dead Service

Sometimes during development and debugging, you will get a service into a state such that it cannot be stopped. You don't have to reboot! The command line kill directive is ideal for such situations. If your service is housed in myservice.exe, for example, then just issue 'kill myservice.exe'. Remember to use the executable name, not the service name.

The other alternative is to run the 'special' version of Taskmgr.exe that I showed you in Chapter 7. This version has the SE_DEBUG_NAME privilege enabled for the process, so that the End Process command inside the task manager actually works.

Other Miscellaneous Tips

A few other tips you might find handy:

Install the .dbg symbol files, which are very helpful when you're debugging and need to step into a Windows API call. They are usually installed using an icon provided by the SDK or Visual C++ itself, and they're shipped with the operating system or service pack version you are using. Unfortunately, because of the rapid changes in certain software for Windows NT, they can be very difficult to keep up to date. Many system DLLs are changed much more frequently than with a new service pack, and new debug symbols are not usually shipped with them. The hallmark of an incorrect debug symbol version is the message, "No matching symbolic information found," which appears when the debugger loads a DLL.
If you're really serious, you can install the checked build of Windows NT. Essentially, this is NT compiled with the _DEBUG symbol, and it's available with MSDN. The checked build contains all of the ASSERTs and has debug information built in, with the inevitable result that it's substantially larger and slower than the 'normal' version. It is used primarily by device driver developers.
You can set breakpoints inside a system call by specifying the function to break on in the Breakpoints dialog. If the function you want to break on is a __stdcall function, simply preface the function name with an underscore, and add @nn onto the end. (nn is the number of parameters multiplied by four; if any of the parameters are WORDs, just multiply those by two.) If the call type is __cdecl, leave out the @nn.

If you have trouble figuring a function out, you can also look for the proper name by using Dumpbin.exe on the module. Look up the parameters to the call in the Windows API documentation.
To debug the initialization routine of an auto-start service properly, you may need to change your load time to demand-start. If the problem goes away when you do so, you probably have a load-ordering group or dependency problem. Re-think the services the problematic service depends on to discover the problem.
A variety of other applications you already know about are also available to help you debug, including PView, PWalk, Spy++, and so on.
The Dependency Walker is a handy SDK tool that gives you a tree view of all the DLL dependencies an executable has, and all the functions the executable calls from each dependency. It also shows a great deal of handy information about the DLL itself, such as the load address, version and so forth. It is useful for debugging load address problems and DLL dependency version problems (you think the executable is using one version of a DLL, but it is really using another).

I've given you a number of tools and techniques that will help you to debug services and (hopefully) to write services with fewer bugs in to begin with. I think that the best tip I can give you going forward is to single-step through your code in debug mode, even if there is nothing wrong with it that you know of. You'll catch more problems by doing that than you could ever imagine. In a service, the way to do this is to put a DebugBreak() call into the execution path early, and just start single-stepping from there. Do this at least once for the whole project and for any major code changes.

Performance Tuning

Tuning your service for maximum performance is every bit as important as making sure that it is bug-free. Often, the best way to tune a service is to get feedback on exactly what the code is doing, and how long each thing that it does takes. The Win32 SDK and the NT Resource Kit provide a variety of profiling and tuning tools that can help you to understand what your code is really up to. Then, you can use the feedback to make adjustments and optimize those areas that are least efficient.

Tuning the Code

Let's start, though, with the code. When optimizing code, it's important to know what you are tuning for. It's usually the case that you're forced to make a trade-off between being more concerned about execution speed, or memory efficiency. These are not always mutually exclusive, but often one is achieved at the expense of the other, at least in terms of the code that you write. Your mission as a software designer is to decide which of these is more important before trying to tune at all. In fact, I would argue that the best time to decide about memory versus speed is when you are still in the design phases of the project. The decisions you make there will not only influence how you tune; they will affect you long before tuning by directing your choice of underlying algorithms and architectural strategies.

Think about the question, and commit your decision to paper. If you are on a large project, everyone should know what the strategy is, and be aware of its ramifications. When discussing the trade-offs, don't accept criteria like, "It should be as fast as possible," or, "It should use as little memory as possible." (Or worse, both!) These are non-statements. Have specific numbers in mind, like, "Service 500 requests per second," or, "Use a working set of only 18K." Then, if you're not meeting the parameters that were agreed upon, you can tune to meet your design goals.

Software development is too difficult and too time-consuming for you to be able to tune your programs to the edges of possibility, but there is nothing wrong with using profiling tools to expose and repair misuses of system resources, even if the component is functioning within specifications. Here, though, are some techniques that you can employ before resorting to those tools.

Use Fast Algorithms

This goes without saying, I suppose, but it emphasizes the point: The greatest performance gains or losses are going to come from your algorithms, not from optimizing tools. The nice thing about the profiling tools you'll see in the next section is that they will help you to find those poor algorithms and fix them. Choose fast algorithms to begin with, and performance will almost always be within a few percent of what you needed. Choose slow algorithms, and nothing can help you.

Optimizers

While many people have questioned the quality of code that optimizers produce, it is generally much better than the code that the non-optimized (debug) build would spit out. The commonest complaint from programmers is that the optimizer 'broke' code that was already tested and functioning correctly.

First of all, it is very rare for the 'optimize for size' option ever to break anything that wasn't already broken. If you are careful to unit-test your algorithms and build at Visual C++'s warning level 4, you will seldom (if ever) have a problem. Speed optimizations are a slightly different story: occasionally, I've heard of circumstances where the compiler has been over-aggressive and introduced bugs. These sometimes have to do with optimizations that keep temporary copies of variables in registers that may get changed by other threads. This can happen in situations where pointers to memory locations are passed to functions, and then the function uses the pointer multiple times — the optimizer generates assembly code that only reads the value once and never reads it again. If another thread changes value of the memory location being pointed to, the value might not get updated from the perspective of the optimized function.

If the desired effect is that the value should never seem to change in mid-stream, you can wrap the function that causes the read operation with a critical section, or you can make a temporary copy. On the other hand, it may be that the function in question is called from inside a loop in a thread function. In this case, you probably want to know about variable changes, and you can use the C++ volatile keyword to tell the compiler not to optimize a particular variable by keeping temporary copies of it.

Presuming you're able to 'fix' your code so that optimizations will work, the question now is which kind to use. Strangely, both of the optimizations that the compiler produces are very similar — a size optimization is aimed at cutting down the number of assembly instructions which, consequently, usually trims down execution time as well. (Contrast this to your own 'algorithmic' design decisions, which affect things at a higher level: "Do I cache this recordset from the database, or re-read it each time?")

The big difference is that the compiler usually optimizes for speed alone by making intrinsic functions inline, which means that instead of calling out to particular functions, it will insert the function source code right in the middle of your calling function. However, if the increased code size pushes it over a 4kb page boundary, this 'improvement' can actually be more costly than not doing it — be careful that your 'speed' optimization doesn't turn out to be slower than the 'size' optimization!

Incremental Linking and Debug Information

Before shipping, you'll want to turn off incremental linking on the release build. Incremental linking works by inserting INT 3 instructions at various points inside the executable to make extra space for inserting changes. On a re-link, any changes are just inserted into the padded areas, but this convenience can fatten up your executable size by more than 30%. (The size of Quartermaster.exe went from 30kb to 55kb with incremental linking turned on.) There is simply no need to ship a file full of padding.

In a similar vein, debug information (if it is used with a release build) should be kept in a PDB file, since it stores the debug information separately and doesn't load down the file with extra information. The only extra overhead in the executable is then a simple filename pointer to the PDB file.

Rebasing DLLs

To improve the load time of your service executable, it is always a good idea to rebase the dependent DLLs so that they don't conflict and end up having to be relocated dynamically by the Win32 loader. You can do this either by using the /BASE:nnnn switch when you base your own DLLs, to change them to an unused address, or by using the Rebase utility. Rebase takes a list of the files that are in a project and figures out where they should all load so that they do not conflict with other DLLs in the process. Usually, rebasing is done as a build step after the link step. Take a look at the SDK documentation for Rebase for more information.

Tuning Tools

In this section, we'll examine some of the different tools that are available to help applications run faster and more efficiently. As it turns out, there are a whole host of different tools and gadgets that give detailed information about applications while they're running. Most of these tools are categorized as profilers, or tools that give you information about:

How long an application takes to run a function
The number of times a function was called
The percentage of function coverage, or the overall percentage of code that was hit when the application ran

We'll also look briefly at the NT Performance Monitor and examine how to use it to expose resource bottlenecks in the system. Throughout this section, I'll use the Quartermaster example from Chapter 8 as a guinea pig. Most of the tools we will use are installed by the January '98 version of the Platform SDK, so if you haven't installed that yet, you will need to do so in order to work along with the book.

The Visual C++ Profiler

The Profiler is a handy tool that's available from within Visual C++. In fact, Visual C++ is just wrapping up the command line interface available in the PREP, Profile and PLIST tools (Microsoft's command-line preparation, run, post-run, and formatted output tools for doing code profiling), automating the run cycle, and formatting the results into a special pane in the Output window.

To use the Profiler:

Go to the Project | Settings... dialog and then to the Link tab. Select General in the dropdown and check the Enable profiling box. You may also want to check the Generate mapfile box while you're in there (you'll see why very shortly).
Check the Generate debug info box on the same tab. (This enables you to do line-by-line profiling as well.)
Select the Build | Profile... item from the menu. The following dialog box will appear:

From here, you can select several options:

Function timing. Outputs function times and counts for specified functions.
Function coverage. Outputs degree of code coverage in specified functions.
Line coverage. Profiles each line, with counts, and shows code coverage (That is, the percentage of code that was 'covered' by the profiling run. Usually this will be a number between 75-100%, depending on what activity you were profiling and the amount of error-handling code you have.)
Merge. Advanced. Allows merging of the results of several runs into one output sample.
Custom. Allows you to use a custom setting file, written as a batch, to drive the profiler.
Advanced settings. This box allows you to specify any of the PREP command line options as additional settings. The PREP options are shown below:

For instance, you can use the /INC option to specify particular files or source code lines to profile, so to specify all the source lines in a particular module, specify the OBJ file as:

     /EXCALL /INC pooler.obj

Alternatively, the following options will profile the lines 0-50:

     /EXCALL /INC pooler.cpp(0-50)

Running the Visual C++ Profiler actually executes a series of commands, like so:

A PREP (phase I) call, to set up the profiling run and specify which files to run against, etc.
Profile, to execute the profiling session.
PREP (phase II), which assembles the results.
PLIST, to display the results from the output (.pbt) file.

Profiler has a great many other options, which are best discovered by exploring the documentation.

Tips on Using Profiler

Usually, it is not very helpful just to do a function profile of an entire application run. The information that gets returned is too bulky and too difficult to sort through. A better approach is often to profile specific functions, to see whether they are giving the degree of performance expected. Choose the place you think a bottleneck might be occurring and start from there.

To do that, you need to use the /SF 'advanced' setting. It allows you to profile a specific function by specifying its name. Be aware, though, that you can't simply use the C function name when you're compiling a C++ program, and this is where the mapfile that I recommended you to build comes in handy. Open the .map file, find the mangled name of the function and specify it as the parameter, as shown below:

To make best use of this tool, you should know that while the line profiling and line counting features work fine in a multithreaded environment, timing functions can be problematic when the application is multithreaded. Generally, avoid trying to profile functions that might cut across multiple threads, because these functions' processing may be interrupted, producing hard-to-interpret results. Another reason is that functions profiled from the main thread when most of the work occurs in worker threads do not have the proper degree of timing granularity. A better idea is to try to profile a single thread's work at a time, at the very most. This is possible by using the /SF flag and specifying the thread's starting function, and including all the program's functions.

Incidentally, there is a macro called Profiler.xlm available in the Vc\Bin directory that you can use to do more sophisticated reporting in Excel based on the contents of the PBT file produced.

Profiling a Service

Now that I've explained a few of the niceties of profiling, it's time for an admission: profiling an NT service while it's actually running as such is a bit more trouble than profiling the executable alone. This is a good reason to check your algorithms out in a standard executable before building a service wrapper around them, if possible. However, if you do need to profile a running service, you can do it like so:

Add a couple of new system environment variables:

__ProfilePBI=<full path of executable> __ProfilePBO=<full path of executable>

Make sure that Profile.dll is in the path.
Build the executable with profiling turned on in the linker settings.
Figure out where to profile from. If the work you are profiling occurs in ServiceMain(), you'll need to specify ServiceMain in the command below. Otherwise, specify the name of the worker thread.
Run the PREP command to create an EXE modified for profiling, like so:

PREP /om /ft /sf <name of the thread function to time>

Copy MySvc._xe to MySvc.exe.
Run the service and stop it when timing is complete.
Run the following commands to prepare the output file (MySvc.pbt):

PREP /m MYSVC PLIST MYSVC

FIOSAP

One of the more interesting profiling tools available is FIOSAP, the File I/O and Synchronization Profiler, which looks specifically at synchronization and file I/O calls. It is particularly useful in services, because it tabulates all the activity going on in all of the synchronization objects. This tool will show timings per call type, number of calls, and average time spent per call.

FIOSAP works by routing each call that normally requires Kernel32.dll through a different library known as Fernel32.dll. This allows the profiler to track statistics on the time spent in each call. To activate the FIOSAP profiler, make the following command in the directory in which the executable to be profiled is located:

apf32cvt fernel32 <appname>.exe

To undo the operation, use

apf32cvt undo <appname>.exe

Each command is only good for a particular build of the executable, so if you build, use FIOSAP, then build again, you'll have to reset using the first command again. When the executable is deleted and rebuilt, the mapping to fernel32.dll goes away.

When the application terminates, a file called fernel32.end is created in the directory where the executable is located. It's a simple text file with formatted output containing timings and other information about what the application did when it made calls to various synchronization objects. An example of the output is shown below; I ran the profile that generated this file from the Quartermaster example in Chapter 8.

                 SYNCHRONIZATION PROFILER OUTPUT
              (Note: All times are in microseconds)
              
              
 1. Event Profiler
 
------------------------------------------------------------------------------
Event:           Type: Manual Reset
--------------------+----------+----------+----------+------------------------
     Operation     |  Total   |Number of | Average  |Successful
        Name        |   Time   |operations|   Time   |  Waits   
--------------------+----------+----------+----------+----------
Overall             |     79934|       464|       172|         - 
Create              |        38|         1|        38|         - 
Set                 |        17|         1|        17|         - 
Wait                |4387779010|       462|   9497357|         2.
Single              |4387779010|       462|   9497357|         2.
   
------------------------------------------------------------------------------
Statistics for all event activity  (Number of handles used: 1)
--------------------+----------+----------+----------+------------------------
     Operation      |  Total   |Number of | Average  |Successful
        Name        |   Time   |operations|   Time   |  Waits   
--------------------+----------+----------+----------+----------
Overall             |     79934|       464|       172|         - 
Create              |        38|         1|        38|         - 
Set                 |        17|         1|        17|         - 
Wait                |4387779010|       462|   9497357|         2.
Single              |4387779010|       462|   9497357|         2.
   
 2. Semaphore Profiler
 
------------------------------------------------------------------------------
Semaphore:               Max Count: 255
--------------------+----------+----------+----------+------------------------
     Operation      |  Total   |Number of | Average  |Successful
        Name        |   Time   |operations|   Time   |  Waits   
--------------------+----------+----------+----------+----------
Overall             |      1009|     14993|         0|         - 
Create              |        45|         1|        45|         - 
Release             |8590014710|      7494|   1146252|         - 
Wait                |3.86766e10|      7497|   5158950|      7497.
Single              |3.86766e10|      7497|   5158950|      7497.
Close               |        11|         1|        11|         - 

------------------------------------------------------------------------------
Statistics for all semaphore activity  (Number of handles used: 1)
--------------------+----------+----------+----------+------------------------
     Operation      |  Total   |Number of | Average  |Successful
        Name        |   Time   |operations|   Time   |  Waits   
--------------------+----------+----------+----------+----------
Overall             |      1009|     14993|         0|         - 
Create              |        45|         1|        45|         - 
Release             |8590014710|      7494|   1146252|         - 
Wait                |3.86766e10|      7497|   5158950|      7497.
Single              |3.86766e10|      7497|   5158950|      7497.
Close               |        11|         1|        11|         -

Notice that for each synchronization object, there is a table showing how much time was spent in each operation on the object. In the case of the semaphore, that means creating, releasing, waiting for, and closing it. The table also shows the raw number of calls to each function, and the average time per call. Finally, it shows how many successful waits occurred. The objects are listed by name, if they were given names when created (I chose not to). There is also a summary table for each type of object that was tracked.

Unfortunately, since critical sections are not kernel objects, no timings for critical sections are available using FIOSAP.

The nice thing about FIOSAP is that you can even use it on a running service. Simply set up the call, then start the service. When you have completed the testing, stop the service and the summary file will be created. It is one of the few profilers that you can use on your service while it is actually working in service mode. It also provides very useful information on where the service might be having performance problems because of contention for resources.

Interpreting the Results

If you are unfamiliar with what the Quartermaster does, or you could use a reminder, now may be the time to take a quick glance at Chapter 8 to understand what is going on.

In profiling the Quartermaster, I had a couple of things in mind: First, I wanted to make sure that the data structures used inside the allocation mechanism would not bottleneck under a very high load (several thousand transactions per second). So, I was interested in raw performance. I was also interested in seeing what impact changing the number of available resources would have on allocation and the wait times involved. To determine these things, I profiled with no 'check-out' time (HANDLEALLOCCOST = 0), and with two different 'available resource' numbers.

The sample figures above were generated in a simulation that contained 25 client threads, each of which ran 300 iterations of the function. It was given 20 resources to start off with, and each resource was 'held' by each client for 250 milliseconds. The handle re-allocation mechanism (the ability to create more handles on the fly when the handle count equals zero) was left intact and functioning, and the simulation took around one-and-a-half minutes to run. You can see that the Quartermaster served these clients fairly well; each semaphore wait was successful.

The second test run that I performed was radically different from the first. It had the same 25 client threads running 300 iterations, and the resource handles were held on to for 250ms each time. However, this time I allowed only 2 resources and turned off the dynamic reallocation mechanism. The same test run took almost 15 minutes to complete, and FIOSAP produced the following output:

2. Semaphore Profiler
------------------------------------------------------------------------------
Semaphore:               Max Count: 255
--------------------+----------+---------+----------+-------------------------
     Operation      |  Total   |Number of | Average  |Successful
        Name        |   Time   |operations|   Time   |  Waits   
--------------------+----------+----------+----------+----------
Overall             |   2000002|     11142|       179|         - 
Create              |        35|         1|        35|         - 
Release             |     92444|      3652|        25|         - 
Wait                |2.71781e10|      7488|   3629554|      3657.
Single              |2.71781e10|      7488|   3629554|      3657.
Close               |        11|         1|        11|         -

You can see the rather predictable results. There were only 3,657 successful waits out of 7,488 attempts. Most of the waits for the semaphore (set for a maximum of two seconds) timed out, and would have returned an error to the caller.

For the third test run, I kept all variables the same as for the second run, except that I assumed there was no real cost associated with work done on the handle. I did this to test whether the allocation data structures themselves (rather than the fact that each client holds a resource for a relatively long period) were the bottleneck. The parameters were as follows:

NUMCLIENTS = 25
NUMSTARTRESOURCES = 2
SLEEP = 0
NUMALLOC = 300

This scenario took roughly one second to run, and the following FIOSAP results were generated:

------------------------------------------------------------------------------
Semaphore:               Max Count: 255
--------------------+----------+----------+----------+------------------------
     Operation      |  Total   |Number of | Average  |Successful
        Name        |   Time   |operations|   Time   |  Waits   
--------------------+----------+----------+----------+----------
Overall             |2147483910|     15003|    143136|         - 
Create              |        68|         1|        68|         - 
Release             |5.28570e10|      7501|   7046660|         - 
Wait                |7.75390e10|      7500|  10338544|      7500.
Single              |7.75390e10|      7500|  10338544|      7500.
Close               |2147483661|         1|2147483661|         -

As you can see, all the waits were successful, even under this highly loaded scenario. This validates the assumption that the primary reason for bottlenecking in the Quartermaster will be that the clients are checking out resource handles for proportionally long periods of time, rather than that the allocation mechanism is not fast enough.

ApiMon

The ApiMon tool has a similar use to the FIOSAP profiler, but it counts and times the calls that an application makes to various system API functions. It also has a nice graphical interface that allows you to view the final analysis as a list, and to sort on different columns. You use the tool by opening ApiMon, and then loading an executable. Next, press the start button, and the application will run. As it runs, you can watch the timings and counts for the various API calls go by in the window. Another pane shows the DLLs upon which the executable is dependent.

Scenarios

To illustrate the use of ApiMon, I've included screenshots of several scenarios that show the results of changing various parameters. Once again, I've used the Quartermaster application as my guinea pig, but make sure that you undo the mapping to fernel32.dll before you try this example. The first of these test runs is the same as the third of the FIOSAP samples above — 300 iterations on 25 client threads, with no work done on the handles. The output window is below, sorted by the Time column:

debugt18

You can see that most of the application's time is spent in single object wait operations. At the very top of the list is the operation in which a client waits for the semaphore before grabbing a free handle. (I know that because that's the only single object wait call I make in Quartermaster.exe!)

The next scenario involves 25 clients being checked out for a 250ms period, for 100 iterations each. As you can see from the output, there was a lot more waiting going on, for a much longer period of time. Clients were simply bottlenecked here, waiting for other clients to give back the handles:

debugt19

The parameters for the third and final run were the same as those for scenario 1: 25 clients, no work time, and 300 iterations. However, instead of a standard critical section, a spin count critical section InitializeCriticalSectionAndSpinCount() was used. As you can see, not only is this radically faster in terms of wait times, but also it's noticeably quicker to the naked eye — by at least a second or two.

My evaluation is that the critical sections on the heavily burdened internal data structures are not changing to mutexes on each call (the way they would do if the critical section blocked), and so less time is spent waiting for the data structures to free up. This translates into less time waiting for a lock on the semaphore.

debugt20

I present this example not to prove to you that the spin count critical section is faster than an 'ordinary' one, but to show you how having real timings on real function calls can have a big effect on how well you can tune the service. The only drawback to ApiMon (apart from the fact that you have to restart it each time you want to run another profile) is that you can't use it on a running service. Because of this, critical data structures that need tuning should be profiled separately, outside the service — possibly using custom loading infrastructures to pound against them (like the Quartermaster has).

The Performance Monitor (PerfMon)

The performance monitor is a rich tool that allows you to view and chart any counter or other feedback metric exposed by any process on the system. I'm not even going to attempt to describe all the functionality of the tool here; I just want to show you how to analyze your service in a couple of areas that are easy to understand. After that, you'll be on your own to explore it at your leisure. The Windows NT Resource Kit includes complete documentation on using the tool and on the meanings of each of the standard counters.

Working Set

The first of the two areas I would like to explore is the working set. I chose this because it demonstrates one of the areas where your service can be tuned not only to improve its performance, but also to use fewer precious system resources, thereby making more resources available to other applications.

The working set of a process is the physical memory assigned to it by the operating system. It contains the code and data pages most recently referenced by the process. Whenever a process needs code or data pages that are not in the working set, a page fault occurs, causing the system to load the pages from virtual memory. The larger the working set requirement (the more data and code pages required to run a process during 'normal' operation), the less likely that the system will keep all the code and data pages for the process in memory. In turn, this means that there will be proportionally more page faults occurring.

When memory is not a scarce resource on the system, the memory manager leaves older pages in the working set when new pages are loaded, growing the size of the working set. As free memory becomes scarcer (when it falls below a certain threshold), the memory manager will move older pages out of the working sets and perform page replacement. Under these circumstances, a process can incur many expensive page faults, substantially slowing its operation.

An efficient application's working set can shrink smaller and smaller (within reason), without causing an excessive number of page faults. Generally speaking, the more you can arrange for data to be stored in the sequence that it is used, the smaller the working set can be — pages that are needed together tend to be loaded together.

Let's examine the PerfMon counters necessary for looking at the working set of our Quartermaster application. We will chart the following:

The process's working set
The number of page faults per second in the process
The number of free bytes available in memory

The PerfMon counter above shows that working set is relatively small, though not extremely so, at about 1Mb. Because the system is not heavily loaded (it's a 128Mb system), there is really no page faulting going on at all. I then stressed the system memory by opening about twenty other applications and recorded the following PerfMon results:

You can see that when memory requirements increased (as demonstrated by the top trace), the working set shrank, as shown by the downward spike in the middle trace at around the center of the chart. At the same time, the number of page faults increased, indicated by the upward spike at the same point in the bottom trace. Later on, at the far right of the screen, the system attempted to shrink the working set some more, but was greeted by a sharp rise in the rate of page faults, which it countered by growing the working set again. After that, the process hummed along with no more page faults, so it would appear that 364,544 bytes seems to be a ballpark estimate of a minimal working set size.

It is difficult to say what constitutes a 'good' working set for your service, because it mostly depends on the sizes of internal data structures, code size, and so forth. For guidance, try to compare the working set size for your service to that of a known entity, such as SQL Server. The way it's configured on my machine, SQL Server uses about 8Mb of working set. If, for instance, you wrote a service on the scale of the Ping Monitor in Chapter 4 and determined that its working set was 30Mb, then you'd probably draw the conclusion that tuning was needed!

If, having discovered the size of the working set, you feel that it needs some tuning, you can use the Working Set Tuner that's available with the SDK. (This comprises several applications whose names begin with Wst, stored in the bin\winnt directory). This tool attempts to tune your EXE file by rearranging the functions in the executable image so that they appear in an order that reflects which ones are used together most often. By placing these functions close to one another, code page paging is reduced.

The tuner does this by analyzing the application and tracking how often and in what order various functions are used. It then figures out an ordering sequence based on the statistics. You have to be sure to construct your usage scenario carefully, because the order in which you call the functions will determine the ordering of the pages in the working set. Choose your most common usage scenario, and perform it realistically. In a service like the Business Object, the scenario is fairly straightforward: start the service and let it instantiate COM objects for clients. A GUI application would be much more difficult to model.

Setting up the tuner for a run test is a several-step process. Instead of reprinting those steps here, I'll refer you to the SDK documentation for precise usage.

Processor Usage

The second metric I want to measure is the load of the service or application on the processor. If the service uses too much processor time, it is probably hogging cycles from other applications. We want our services to be efficient! The way to track processor usage is to chart the following items; the results are shown in the screenshot:

The system's processor queue length. This shows the number of threads that are waiting on work to be done by the processor. Queues indicate that the processor is bottlenecked.
The percentage of processor time being taken by a process.
The 'thread count' counter, just as a baseline to make sure that things were working properly.

This paints a clear picture. Even with a substantial thread count (the flat line in the middle), the service is using an insignificantly small percentage of processor time (the flat line at the bottom). The queue length spikes that you can see occurred when I opened another instance of PerfMon and switched tasks between several different applications.

The next chart tells a very different story. This time, I ran 30,000 iterations on 25 client threads with 5 initial sessions and no SLEEP time. The application took about 30 seconds to complete its test run:

As you can see, the process ate up all the processor time for the duration of its execution, and this is entirely due to the fact that there was no wait state between calls. The length of the queue also built up substantially. If you introduce even a 1-millisecond sleep time between calls, utilization of the processor drops back to virtually zero.

It's very tough to state any general rules about what might be tunable in your service if, after checking the results from PerfMon, it seems as though it's hogging the processor. The problem truly could be any variety of things, but 'busy loops' (while loops that run and run, perhaps waiting on a flag to get set, with no Wait() call in them) should certainly be avoided if possible. Alternatively, you may have too many worker threads in your service, loading down the processor with excessive context switching. If you identify that the service uses too much CPU time, you might try to home in on the culprit by using the Thread | % Processor Time counter to narrow the high usage down to fewer threads in the service. By knowing that, it may be easier to find the offending code.

Again, I show this as an example of how to use the tools to help you find problems in your applications, rather than to prove anything significant about the test subject. If your application used different resources, such as file handles or network pipes or sockets, you would want to do PerfMon statistics on those counters as well as the ones examined here.

Summary

In this chapter, we've reviewed the tools and strategies that you can use to debug and tune your services. While being, of necessity, quite discursive, the usefulness of these techniques in improving the performance of the Quartermaster service should be obvious.

About the Author

Kevin Miller works for Microsoft Corporation as a Consultant in the Southwest District MCS practice in Phoenix, Arizona. He is a Microsoft Certified Solutions Developer, has an MBA in Technology Management, and an undergraduate degree in Philosophy. Kevin works with a variety of Fortune 500 clients, helping them to architect and develop systems using the latest Microsoft technologies.

We at Microsoft Corporation hope that the information in this work is valuable to you. Your use of the information contained in this work, however, is at your sole risk. All information in this work is provided "as -is", without any warranty, whether express or implied, of its accuracy, completeness, fitness for a particular purpose, title or non-infringement, and none of the third-party products or information mentioned in the work are authored, recommended, supported or guaranteed by Microsoft Corporation. Microsoft Corporation shall not be liable for any damages you may sustain by using this information, whether direct, indirect, special, incidental or consequential, even if it has been advised of the possibility of such damages. All prices for products mentioned in this document are subject to change without notice.

International rights = English only.