SQL Server 2000 Operations Guide: Monitoring and Control

Article
01/28/2010

Monitoring and Control

Updated : October 26, 2001

Abstract

This chapter presents the tools and procedures used to perform proactive monitoring in an enterprise data center. Alerts, the current activity window, the event log, and other monitoring technologies are discussed. Reactive monitoring and problem solving techniques are also presented. After implementing these recommendations, the DBA will have a monitoring regimen that is tailored to the individual operating environment.

Introduction

Data center operations is at the heart of Information Technology (IT) systems and bears the responsibility of ensuring that all of the IT services promised and committed to by IT management to its customers are delivered according to the specific service level agreements (SLAs). These SLAs are negotiated with and agreed to by IT management and its various customer business units (see Chapter 8). To perform this very important task, it is imperative that the operations team implement a capability commonly referred to as service monitoring.

Service monitoring allows the operations team to observe the health of a service in real-time. Real-time, as defined here, is essentially timing dictated by the context of the source of data and the need for management information or corrective action. That is, determining issues such as whether a service is up, down, or marginal. While important, however, this type of observation is purely reactive in nature. In today's competitive marketplace, it is not enough for service monitoring to allow operations to merely react to service problems in the IT environment. In addition, it is extremely important that service monitoring provide the operations team with the ability to observe service behavior proactively. Proactive monitoring means finding problems and potential service outages before they occur. Providing operations with both a sound reactive and proactive monitoring capability is a service monitoring best practice.

Knowing the current health of a service or determining that a service outage may occur is not worth much unless the operations team has the ability to do something about it. This may mean taking action themselves, or at the very least, notifying the appropriate group that a specific type of reactive or proactive action needs to occur. This is what is meant by the term "control." When combined and implemented properly, service monitoring and control gives the operations team the critical capability it needs to ensure that service levels are always in a state of compliance. Without proper service monitoring and control, SLAs are not enforceable, are unproductive and therefore unusable.

This chapter aims to provide you with all of the knowledge you need to create a comprehensive monitoring system. This chapter looks at the process you need to go through, the technologies you need to evaluate, and the counters you should ultimately monitor.

Design Considerations

A good monitoring system bridges the gap between the development and operations teams. It provides information that is useful to both teams. The monitoring system suggested in this chapter attempts to please both ongoing operations and the application development team. This assumes that development personnel are not performing ongoing application monitoring. This separation of responsibilities has guided the design of the following monitoring system. The system described is intended to provide the kind of information a developer would need to fix a problem, while making the best use of the operations personnel that are in the field watching the application.

Resource Requirements

At a minimum, a database administrator (DBA) and a staff of monitoring personnel are required for a proper monitoring system. The monitoring personnel must be familiar with the monitoring tools used on-site. The DBA must be familiar with the application being monitored, as well as the various analytical tools used in the investigation of application problems.

System Prerequisites

This chapter does not require an existing monitoring system to be in place (the development of your monitoring system will be discussed over the course of the chapter). It does require that you be familiar with each of the components that make up your application and the level of service that you are required to provide as documented in your SLA.

Process Flowchart

Instituting a monitoring system for Microsoft® SQL Server™ is a relatively simple process. There is a requirements gathering phase, a design phase, and a deployment phase. As you go through the phases you will be required to make decisions about the technologies to employ, the counters to monitor, and the method of deployment. If you get any of these decisions wrong, you can always go back and rework your solution. Your monitoring solution should be an ever-evolving system.

The one element of your system that you must get right is the operational regimen. The regimen you employ will determine the success or failure of your system. If operational personnel show resistance to the monitoring procedures you have developed, application problems may go completely unnoticed. To ensure that you implement a good operational regimen follow these two guidelines:

Avoid complacency.
Be proactive.

Avoid Complacency

If it is possible for monitoring personnel to skim over your reports or status indicators without checking the details, then over time, that is what will happen. Do not allow this kind of complacency to set in. Encourage your monitoring personnel to act as if they were doctors, and to treat your reports and indicators as if they were symptoms. Allow the monitoring personnel to make conclusions regarding the application's health. Though the application may appear to be healthy, allow the monitoring personnel to discuss at length any issues that occur, even if the issue is very small. It will engage the monitoring personnel in the process and prevent them from missing the details. Receive this feedback from the monitoring personnel in the form of either a regularly scheduled meeting or e-mail.

Be Proactive

Ensure that your monitoring personnel are rewarded for heading off problems before they occur. One way to give monitoring personnel this ability is to provide them with tools that allow them to further investigate symptoms as they appear. While the initial symptom might have indicated a temporary network outage, further investigating might have found a faulty hard disk drive in the Redundant Array of Independent Disks (RAID) cage. If your monitoring personnel are not encouraged and enabled to do this kind of investigating, you may not hear about this problem until the entire disk array goes down.

After you establish a good regimen, you are ready to develop the other components of your monitoring solution. First, you need to document the need that the solution meets. This stage in the development process should not be skipped. Understanding and documenting your system requirements helps set the scope of your project. Will this be a one-day project or will you be working on this for the next two months?

Next, it is imperative you design a solution that best meets the need you previously documented. The most important elements of your design will be the regimen and the technologies that you use. Make sure to test the final solution before progressing to the next phase.

Finally, you need to determine the best way to execute your solution. You should not disrupt the operations of your application in any way. You should also have a contingency plan, in case the introduction of the monitoring solution does cause a problem and must ultimately be removed.

The flowchart shown on the following pages documents this development process from start to finish (see Figures 5.1, 5.2, and 5.3).

Figure 5.1: Monitoring and Alerting Solution Design and Development Flowchart

Approaches to Monitoring

The issues surrounding application monitoring can easily be viewed as religious in nature. While one person could be convinced that a large number of counters need to be monitored on a daily basis, another could just as easily be convinced that excessive memory paging and a poor cache hit ratio are the only indicators that need to be watched. Each person will have drawn their respective conclusions from personal experiences. Each will contend that their system works best. Which system will you choose?

The monitoring system that you implement simply needs to produce information that allows operations and development personnel to be notified of problem states before they impact service. If your application is rarely down, then you might not need to spend too much time analyzing health data. If you have hundreds of servers to monitor but no operational support, necessity may dictate that you limit your monitoring regimen. On the other hand, if your application has frequent code changes, is located in a chaotic data center, or is the center of a critical business process, it might be a good candidate for a robust, proactive monitoring system.

This chapter outlines two monitoring approaches. The monitoring approach you select will depend completely upon the environment you work in. Determine how chaotic your data center is, how much down-time your application experiences and how important your application is. Use these factors to make your choice. If you find that the approach you selected is not helping you provide continuous, reliable service, you should refine your approach until it provides you with the appropriate information. This may be the other approach that is outlined here, or a hybrid of the two approaches.

The monitoring approaches that are outlined in this chapter are generally referred to as proactive monitoring and exception management.

Proactive Monitoring

There are numerous reasons to monitor your application in a proactive manner. Foremost is that proactive monitoring has the best chance of catching errors before they occur. This type of monitoring can usually keep you out of "fire fighting" mode (a mode in which you are continually chasing down new errors as they occur). Finally, this kind of monitoring provides you with a considerable amount of information concerning your operating environment and the way in which your application works in that environment. This will allow you to make better decisions regarding your application in the long run. So why would anyone choose not to proactively monitor an application?

One reason is that many DBAs do not have the time required to do this kind of monitoring. Each individual application in your data center must be closely analyzed on a daily basis if you are to catch problems before they occur. If you are the only person responsible for hundreds of applications, this is simply not going to occur.

There are also deep philosophical divisions between many DBAs. Where some would fight for the right to examine graphs and charts, others feel this is a waste of time. You may be completely happy doing nothing more than responding to errors as they appear. If this is the case, proactive monitoring may not be for you. If you would like to learn more about your application, however, and have the time to do so, it is recommended that you employ a proactive approach to monitoring.

Proactive Monitoring Regimen

Where does it all begin? First, you will need to set a baseline for each application. You can do this prior to reviewing the technologies involved or developing your regimen, so that developing a baseline is a good place to start. With this baseline, you will begin to become familiar with the operational behaviors of each application.

The goal of baseline monitoring is to document clearly how the application acts under normal conditions. Does the application eat up all the CPU cycles made available to it? Does it need so much memory that paging occurs? Are all the user requests serviced? After you understand how the application acts under normal load you will be able to answer these questions. It is important that you understand the application in this way so that future problems can be identified when they occur.

You must be able to identify problems that can arise under normal load. Furthermore, these problems need to be divided into two groups: one group of problems that require a response and a group of problems for which there is no response. The group of problems for which there is no response, a group that will be referred to as "background problems," can cause an enormous amount of disruption to the monitoring process if not properly identified.

Background Problems

One of the key elements of any monitoring system is the response the monitoring personnel must take when a symptom of a problem appears. It could be argued that this is the most critical element of the system, because if the response has no impact on the function of the application, all of the monitoring prior to the response was wasted effort. Unfortunately, all possible symptoms and their appropriate responses cannot be documented in advance. Many responses are learned responses, developed by the monitoring personnel over the lifespan of the application. This process of learning, then, could be one of the most crucial elements in your monitoring system.

To enable this process of learning, the one thing you must do from the very start is to document as many background problems as possible. If you can catch as many "do not respond" scenarios as possible at the beginning of this process, you will ensure that your monitoring personnel do not waste their time when dealing with your application. You want your monitoring personnel continually to come across new and interesting situations to which they must respond. You want to encourage the attitude that troubleshooting your application will lead to operational improvement. You do not want the opposite to occur. You do not want your monitoring personnel to spend countless hours researching symptoms that simply end up needing no response. If your monitoring personnel learn on the job that a large number of symptoms exhibited by your application simply need to be ignored, it will cause the monitoring personnel to not pay as close attention to your application and may result in them missing a real error.

To keep your monitoring personnel proactive and alert you want them to know in advance which symptoms to ignore. If the learning process simply teaches them that there are many new symptoms to ignore, it is likely your entire application will be ignored as well. Document the background problems at the start as part of your baseline.

Creating The Baseline – Suggested Counters

The baseline should begin with a single System Monitor chart accompanied by extensive documentation interpreting the results. This section will suggest the performance counters to include in the chart, along with the steps needed to create the chart. Many elements of the chart, such as the time frame, will depend on your environment. Also, because you know your application better than anyone, the interpretation of the chart will be up to you.

The System Monitor chart can be created in real-time (and saved as an .htm file) or based on a performance log file. It is recommended that you base your chart on a log file. This allows you to record statistics for as long as you want. When you are finished, you can view either a specific time slice or the entire sample. If you simply save a System Monitor graph (based on current activity) as an .htm file, you see only the data that was most recently sent to the screen.

There are two configuration options for which you must determine the correct value. The first is the sampling interval. Use of the performance counters should not add any overhead to your SQL Server (except for the disk I/O required to record them). If you find that your baseline data is taking up too much disk space, you can use a larger interval between data samples. Be warned, however, that the larger the interval, the less accurate your graph.

Secondly, you must determine which server you will use to monitor your SQL Server. You can monitor remotely, but use of the counters across a network connection for an extended period of time could congest traffic on your network. If you have space on your SQL Server for the performance log files, it is recommended that you record performance log information locally.

Because of these performance concerns, the use of performance counters needs to be properly implemented. You will need to test the number of counters and frequency of collection that best suites your environment. For the initial baseline, however, it is recommended that as many counters as desired be used with the highest frequency available.

The following counters should be included in your baseline:

Memory – Pages/sec. Defined as the number of pages read from or written to disk to resolve hard page faults. (Hard page faults occur when a process requires code or data that is not in its working set or elsewhere in physical memory, and must be retrieved from disk). This counter was designed as a primary indicator of the kinds of faults that cause system-wide delays. It is the sum of Memory: Pages Input/sec and Memory: Pages Output/sec. It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory: Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.
Network Interface – Bytes total/sec. Defined as the number of bytes traveling over the network interface per second. If this rate begins to drop, you should investigate whether or not network problems are interfering with your application.
PhysicalDisk - Disk Transfers/sec. Defined as the rate of read and write operations on the disk. You should define a counter for each physical disk on the server.
Processor - % Processor Time. Defined as the percentage of time that the processor is executing a non-Idle thread. This counter was designed as a primary indicator of processor activity. It is calculated by measuring the time that the processor spends executing the thread of the idle process in each sample interval, and subtracting that value from 100 percent. (Each processor has an idle thread that consumes cycles when no other threads are ready to run). It can be viewed as the percentage of the sample interval spent doing useful work. This counter displays the average percentage of busy time observed during the sample interval. It is calculated by monitoring the time the service was inactive, and then subtracting that value from 100 percent. After all of the processors devoted to SQL Server have reached 100 percent utilization, it is likely that end user requests are being ignored.
SQLServer:Access Methods - Full Scans/sec. Defined as the number of unrestricted full scans. These can either be base table or full index scans.
SQLServer:Buffer Manager – Buffer Cache Hit Ratio. Defined as the percentage of pages that were found in the buffer pool without having to incur a read from disk. When this percentage is high your server is operating at optimal efficiency (as far as disk I/O is concerned).
SQLServer:Databases - Log Growths (run against your application database instance). Defined as the total number of log growths for the selected database.
SQLServer:Databases Application Database - Percent Log Used (run against your application database instance). Defined as the percentage of space in the log that is in use.
SQLServer:Databases Application Database - Transactions/sec (run against your application database instance). Defined as the number of transactions started for the database.
SQLServer:General Statistics - User Connections. Defined as the number of users connected to the system. Dramatic shifts in this value should be researched.
SQLServer:Latches – Average Latch Wait Time. Defined as the average latch wait time (in milliseconds) for latch requests that had to wait. If this number is high, your server may be facing contention for its resources.
SQLServer:Locks – Average Wait Time. Defined as the average amount of wait time (milliseconds) for each lock request that resulted in a wait.
SQLServer:Locks – Lock Waits/sec. Defined as the number of lock requests that could not be satisfied immediately and required the caller to wait before being granted the lock.
SQLServer:Locks - Number of Deadlocks/sec. Defined as the number of lock requests that resulted in a deadlock.
SQLServer:Memory Manager - Memory Grants Pending. Defined as the current number of processes waiting for a workspace memory grant.
SQLServer:User Settable – Query. The final counter that should be included in your baseline requires some explanation. The counter, SQLServer:User Settable – Query, is an application-specific counter. The value displayed by this counter is set by your application. To set this value, your application needs to call sp_user_counter1 and provide a numeric value. The suggested manner in which you should employ this counter is as follows:
- Determine an inexpensive way to check the health of your application with a single SQL statement (perhaps count the number of items that were sold in the last hour). The health check must not degrade the performance of your application in any way.
- Write a stored procedure that first checks the health of your application, putting the result into a variable. The stored procedure should then call sp_user_counter1 and provide that health check variable as the input parameter.
- Set up a scheduled job that runs this stored procedure every 15 minutes.
- Include the SQLServer:User Settable – Query User Counter 1 in your baseline and ongoing monitoring regimen. (You can create up to 10 user settable counters.)

Creating A Baseline Chart

Now that you have reviewed the counters that will be included in your baseline, it is time to walk through the creation of the baseline chart. The following steps should be performed on your production server during a time of normal server activity.

To create the baseline chart, perform the following steps on your production server during a time of normal server activity:

Open the Administrative Tools program menu and select Performance.
Double click Performance Logs and Alerts in the left window. Type the name of your baseline in the name field and select "OK". In the Select Counters window that appears, select your first Performance Object, and then select the related Counter. Select "Add".
Repeat this step for each counter you are tracking. For counters that allow you to track by disk, or by instance, and where you have more than one, repeat for each disk number or instance that appears in the Instance list.
In the Select Counters window select "Close".
Set the sample interval that you have decided upon. The time interval that you choose will depend on the amount of data that you want to collect, and the disk space available for that data. A longer sampling interval will allow you to collect less data over a longer time in the same amount of file space. For this reason you should try a short baseline first and reevaluate. (Note that the default value is "every 15 seconds". For a very detailed baseline this is ideal. But, many DBAs will find that this produces more data than they want to store.)

Select the Log Files tab and enter the file location in the "Location" field. Congratulations! You have scheduled a baseline log .

After the log is created, you can click the System Monitor option in the Performance window and then click View Log File Data in the right pane (the icon is a gray cylinder and is the fourth icon from the left). By opening the log file in the Select Log File window that appears, you will be able to view a graph of the logged results.

Now that you have created your baseline, you can begin to develop your ongoing monitoring solution. Clearly System Monitor will play an important role in your regimen. But what other technologies will you use? The next sections detail the tools at your disposal and the ways in which future System Monitor graphs can be compared to the baseline graph will be explored.

Ongoing Use of Charts

Not only are System Monitor charts useful for the initial profiling of your application, they should be a major part of your ongoing monitoring solution as well. This section discusses how to properly add System Monitor charts to your ongoing monitoring regimen.

You should have a Performance log running against your application at all times. To keep these logs a manageable size, you should cut them off at the end of each day (or each shift), save them and then start a new one. You should also reduce the sampling rate of the counters in the graph, from every 15 seconds to every 15 minutes. Finally, you should limit the number of counters used during your ongoing monitoring. The following counters are recommended for your daily monitoring regimen:

Memory – Pages/sec
Network Interface – Bytes total/sec
Physical Disk – Disk Transfers/sec
Processor - % Processor Time
SQLServer:Access Methods - Full Scans/sec
SQLServer:Buffer Manager – Buffer Cache Hit Ratio
SQLServer:Databases Application Database - Transactions/sec
SQLServer:General Statistics - User Connections
SQLServer:Latches – Average Latch Wait Time
SQLServer:Locks - Average Wait Time
SQLServer:Locks - Lock Timeouts/sec
SQLServer:Locks - Number of Deadlocks/sec
SQLServer:Memory Manager - Memory Grants Pending

The chart that you generate each day (from the log file data) should be reviewed (the following day) to determine the health of your application. Any strange behavior should be investigated. If possible, you should attempt to correspond dramatic swings on the chart to real-world causes. Did a new promotion begin that day? Did it overwhelm the server for any period of time? Some amount of time should be set aside each day for this analysis. It will provide you with a deeper understanding of how your application responds to load, which will be invaluable when a problem state arises.

When you find that there is a problem, it is a good idea to go back to the original baseline-monitoring mode to troubleshoot the problem. The baseline monitoring mode produces much more information, primarily because the counters are queried every 15 seconds. This is great when you want to scrutinize a short time span of graph data. When you are troubleshooting, you should operate only in the baseline-monitoring mode for short periods of time. After you have sufficient data recorded in baseline mode, you should save the log and start a new one with the limited set of counters used on an ongoing basis. A good practice to remember is "ongoing monitoring when things look healthy, baseline monitoring when problems occur."

After you have a few performance logs saved away, you can begin to delve into the art of graph trend analysis. This process involves comparing charts over a given time period (for example, a week or a month). You need to ask questions such as "Why is the CPU utilization so high on Fridays?" and "Does that reoccurring blip on the graphs occur because the server is falling down?" Do not always assume that your server went down simply because the counters registered no value for a period of time. When there is network congestion, it sometimes seems that the data sent by your counters is the first thing to get lost.

Use corroborating evidence to help you determine the cause of certain recurring graph behaviors. If users complain that the server is slow in the morning, closely analyze the value of the graph during that time period. Compare graphs to one another. Get to know your graphs, it will serve you well in the long run.

Alerts

Another great monitoring technology you have at your disposal is the alert. An alert is a defined event that causes a notification of some kind. Are you worried that when your CPU utilization hits 100 percent you may lose users? Set up an alert for that. Let the system tell you when it is in trouble.

There is extensive documentation on the correct procedure for defining alerts and notifications. There is, however, less information on the conditions for which you should have an alert defined. This section lists the conditions that a production environment should be concerned with. These conditions are similar to the conditions that your System Monitor graphs should watch for. This is yet another way to monitor the same critical health indicators.

The following conditions should send an alert to your monitoring personnel:
Errors affecting service. Each known critical error should have an alert associated with it, if possible. If the error itself cannot be tied to an alert, it is possible to write a query to test for the error state, run the query on an ongoing basis, and alert when the query finds the error.
Deadlocks. Any SQLServer:Locks – Number of Deadlocks over a given threshold should generate an alert. This threshold should be set to 1 or 2 if your application does not usually cause deadlocks to occur.
CPU utilization. If your application freezes or degrades at a certain CPU utilization level, you need to set an alert when the CPU utilization exceeds that level.
Disk utilization. If your application freezes or degrades at a certain disk utilization level (or queue level), you need to set an alert when the disk utilization exceeds that level.
Scans. Your application database should generate an alert if excessive table scans take place. The definition of excessive, however, is completely up to you. You may have many small tables that do not require indexes, which would cause many acceptable table scans to occur. Monitor the SQLServer:Access Methods - Full Scans/sec counter for a period of time to determine the baseline scan value before you set an alert to watch for the excessive value.

SQL Server Log

Another critical component of your ongoing monitoring efforts should be the SQL Server Log, also referred to as the SQL ErrorLog (which is found in Enterprise Manager). The SQL Server Log is an excellent resource for mining information about the health of your application. If you can read it, that is. The SQL Server Log is written to from the moment of start up, until the service is terminated. And endless arcane messages are written to it over that time span.

Because your monitoring regimen must engage the monitoring personnel (and not put them to sleep), try to specify what you want them to find in this log. This log will record all errors with a severity level of 19–25. As part of the monitoring regimen the SQL Server Log should be checked for errors that have a severity level of 19–25. Errors with this severity level have caused a transaction to fail and thus impact the proper functioning of your application. Errors with severity levels from 20 through 25 are considered fatal. If a fatal severity level is encountered, the client connection is terminated after receiving the message.

Other critical pieces of information that should be closely looked for are the application specific error messages that have been included in your code. Each error condition that your code can test for should employ the RAISERROR … WITH LOG syntax, after the error is caught. This will cause the error information to appear in the SQL Server Log (and the Event Viewer Application Log, discussed later in this chapter). It cannot be stressed how important the use of RAISERROR is. No matter what your application does, it absolutely must use RAISERROR … WITH LOG when an error condition is encountered. This is the only efficient, standard and sensible method for both alerting your operations personnel, and leaving a historical record of the problem.

SQL Agent Log

The SQL Agent service runs your jobs, maintains your replication and basically interacts with the outside world. It also has a log that can be an invaluable tool in your monitoring arsenal.

To access the SQL Agent log:

In Enterprise Manager, expand a server group, and then expand a server.
Expand Management, right-click SQL Server Agent, and then click Display Error Log.
In the Type list, click a type of logged item to filter the log contents.
Optionally, in the Containing text box, enter the message text to filter the log contents.
Click Apply Filter if you have selected filter parameters.
Under Filtered contents, view the log contents.

The default log location is C:\Program Files\Microsoft SQL Server\Mssql\Log\ Sqlagent.out. The current log has the "out" extension, while the previous nine logs have extensions "1" through "9".

Event Viewer

The Event Viewer allows users to monitor events recorded in the Application, Security, and System logs. These logs are separate from the SQL Server Log and SQL Agent Log, and thus provide additional information when it comes to your application. You will be primarily interested in the Application Log because SQL Server messages can be found there. SQL Server messages can be identified as those messages with a source of "MSSQLSERVER" or "SQLSERVERAGENT." Your RAISERROR messages will also appear here.

To view the details associated with the messages in the Application Log:

Open Event Viewer.
In the console tree, click the Application Log.
In the Details pane, right-click the event you want, and choose Properties.

There are many third-party monitoring applications that offer the ability to route these messages to a central console. In fact, these applications are the backbone of most enterprise-scale monitoring systems. Your company may already use one of these applications. Though this chapter is not intended to cover all of the console applications out there, a brief look at these applications is possible.

If you are concerned about SQL Server security and want to ensure that individuals with expired login accounts or bogus credentials are kept out of your SQL Server, you can monitor the Security Log for failed login attempts. This information can be useful for spotting patterns of fraudulent entry. It can also alert you to applications that are using bad credentials and need to be properly configured.

Monitoring Job Run-time Trends

It is likely that your application includes one or more scheduled jobs, such as jobs for backing up the database, processing files, or clearing out staging tables. It is also very likely that the health of these jobs directly impacts the health of your application. Luckily, SQL Server provides an excellent way to monitor the health of your scheduled jobs. Associated with each scheduled job is a history of job execution. This history should be checked on a regular basis.

To check the job history:

Open the Microsoft SQL Server program menu and select Enterprise Manager.
Inside Enterprise Manager, find the server group that includes your server and click on the group and then your server. Right click on the first job associated with your application. A menu will appear. Select "View Job History" from the pop-up menu.

In the Job History window, the result and duration of each job execution is listed. Pay close attention to the job run-time trends. Does the job always fail at 1:00 A.M.? Is the job taking longer to run each time it's executed? These trends can warn you of a failure prior to the failure's occurrence. As you click each execution, detailed information about the result can be found in the Errors and/or messages from the job/step run at (execution date) field.

If you want more information, select the Show Step Detail check box in the upper right hand corner of the window. This will give information on every step in the job.

Monitoring the job histories associated with your application can help you spot a problem before it causes any damage. Reviewing the job histories is also a great way to troubleshoot a problem once damage has been done. But the job histories supply only some of the information you need when managing a complex application. The procedures run by your jobs should be written so that they raise errors in the SQL Server Log when they encounter a difficulty. This will help ensure that the correct message gets to your monitoring staff. It is possible for the job history to provide no further detail than "job was terminated." This is why it is critical that the code run by your jobs include clear, descriptive error messages when error conditions arise.

Other Monitoring Applications

You will find that most enterprise-scale monitoring applications focus on the network and mostly ignore your SQL Server application. These monitoring tools focus on error messages that appear in the Event Viewer and have logic to determine the severity of the error and routing to a central console. This is why it is so critical that you employ RAISERROR … WITH LOG in your code, so that at least some portion of your health information gets to the central console.

Though these enterprise-scale monitoring applications know little about your application, they know a lot about your server. These applications come with a pre-defined monitoring regimen meant to catch the first sign of trouble on your server. One of the outstanding members in this family of monitoring tools is the Microsoft Operations Manager (MOM).

Microsoft Operations Manager (MOM)

NetIQ AppManager for SQL Server, NetIQ Operations Manager, and Microsoft Operations Manager (MOM) are enterprise-scale monitoring tools that come from the same code base. These tools monitor the following items and events, which are critical to good SQL Server and application performance:

ServerDown. This immediately generates an alert if SQL Server or related services such as SQL Agent fail and can automatically restart them.
TopCPUUsers. This monitors the CPU being consumed by SQL Server users and correlates CPU usage and SQL Server statements being executed by users.
TopIOUsers. This tracks I/O consumption by SQL Server users and correlates I/O with SQL Server statement activity.
TopLockUsers. This identifies locking activity by user and identifies SQL Server statements causing locks.
TopMemoryUsers. This monitors memory usage by SQL Server users and correlates memory utilization to SQL Server statement activity.
DBSpace. This monitors and generates alerts on the available space for your SQL Server databases.
DevSpaceAvail. This monitors any database device's available space.
LogSpace. This detects if you are running low on SQL Server log space and provides an option to automatically truncate the transaction log file.
NearMaxConnect. This indicates whether or not you are running low on available connections.
NearMaxLocks. This identifies if SQL Server is almost out of locks.
ServerThroughput. This measures key I/O statistics such as transactions per second.
UserConnections. This tracks who is connecting into SQL Server and how many connections each person is using.
CacheHitRatio. This monitors the SQL Server buffer cache, generating an alert if it is necessary to increase the memory allocated to SQL Server.

Proactive Hardware Monitoring

Your primary responsibility as a DBA when it comes to hardware is to notify hardware personnel when there is a problem. An excellent way to meet this responsibility is take a proactive monitoring approach, identify problem situations early on, and hand the problems off to operations at the first opportunity. The following section discusses this monitoring process

One of the primary goals of a proactive hardware monitoring system is to monitor your hardware in order to better understand and manage the capacity of the system. This helps to prevent problems before they occur. An equally important goal is create a system that will allow you to be forewarned of any problems that may arise.

The following list provides the basic types of information that you will need to reach these goals. You can gather this information using tools such as Microsoft Operations Manager (MOM) or Windows® Management Instrumentation (WMI). You should consider gathering information on the following:

CPU cache sizes
CPU model names
CPU speeds
Disk drive capacities
Disk drive firmware revisions
Disk drive models
Hardware error messages
Hardware error timestamps
IP numbers
Memory module sizes
Model name
Network interface descriptions
PCI board names
ROM version
Serial number
Server name
System description
Total memory

In some data centers, there is a team whose sole job is to upgrade hardware when required. If this describes your working environment, you may need to concern yourself with only a subset of this data. If not, then you should choose to track all the information listed. After you have the information that you need, you should evaluate the ability of your environment to meet the current application load.

Monitoring for Capacity Planning

After you have an application environment that meets the load requirements, you can begin monitoring the changes in application load. You must watch the application load closely as you plan for future capacity. This process is essentially monitoring for the purpose of capacity planning. To begin monitoring for capacity planning, choose a set of measurable objects that relate directly to the hardware performance of your system. The objects you measure should include:

Disk drive capacities
Size of databases, size of free space inside the database
Difference between the database size compared to the disk drive space
The rate of database growth
Location (drives) where data files are located (*.MDF, *.NDF, *.LDF)

A great way to monitor these indicators over time is to create an ongoing performance log regimen (much like the application monitoring regimen).

Use of Performance Log

Hardware problems can be spotted in a performance log, as long as you include the appropriate counters. You should consider monitoring the following counters on a regular basis:

Network Interface()\Packets Outbound Errors. The number of outbound packets that could not be transmitted because of errors.
Network Interface()\Packets Received Errors. Shows the number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol.
Server\Errors System. Shows the number of times that an internal server error was detected. Errors can reflect problems with logon, security, memory allocation, disk operations, transport driver interface operations, communication [such as receipt of unimplemented or unrecognized Server Message Blocks (SMBs)], or I/O Request Packet stack size for the server. Many of these errors are also written to the System log and Security log in Event Viewer. The server can recover from most the errors displayed by this counter, but they are unexpected and should be reported to Microsoft Product Support Services.

The Monitoring Process

You should list the thresholds for each of the monitored objects. A threshold is a level at which you will want to take action. If no specific threshold is appropriate, indicate which status of the monitored object indicates that an alert should be generated. If desired, enter one or more corrective actions that should take place when an alert condition occurs. If the corrective action is "simple," it will be automated. If it cannot be automated, the action details will be required in a Troubleshooting Guide.

Enter the time frequency in which you want this monitoring to be executed, for example, every 10 minutes. You can request a "complex" schedule, such as twice a day at 2:00 A.M. and 4:00 P.M. Keep in mind that as monitoring is executed more frequently, more processor utilization is required on the server to perform that monitoring function.

Enter the number of CONSECUTIVE alert condition occurrences that must occur before an alert, and corrective action if specified, is executed. The default is one, meaning that the first time an alert condition is detected to create an alert and, if specified, initiate the specified corrective action.

If the monitoring is complex, such as correlating two or more objects, you must indicate exactly what the expected outcome should be. For example, suppose that you wanted to monitor the Memory\Available Megabytes counter: If that counter dropped below 20 MB, you want to look for Windows event 1365 from source "XYZ," and generate an alert if such an event has been received after the threshold was crossed. You must specify each part of the inter-related monitoring information as has been detailed in this example.

You should not only monitor point-in-time data, but perhaps more importantly, you should monitor rate of change. If your processor usage is slowly increasing at a steady rate, and the rate of increase goes from 2 percent to 10 percent in a short period of time, your data collection and analysis process should bring this to your attention. Often it is the change in the rate of change that warns you that a critical shortage is going to occur sooner than anticipated. This warning will allow you to avert the otherwise unexpected problem.

Exception Management

The alternate approach to monitoring your application is referred to as "exception management." When exception management is used to monitor an application, only the exceptions to normal service are monitored (thus the name). An exception could be anything from hardware failure to job failure to performance degradation. Your system would gather only data concerning exceptions and would only notify you when an exception occurred. This system allows you to have the most impact for the least investment of time.

Exception management may need to be introduced if your monitoring personnel are currently being overwhelmed with monitoring information. In an enterprise-scale data center using a central console, it is easy to get to a point at which the sheer volume of application messaging becomes unmanageable. When this occurs, the messages sent to the console are either ignored or dealt with much later in the day than required. This breakdown can only be resolved by limiting the messages that are sent to the console. Attacking this problem with exception management can produce a more optimal monitoring system.

One of the drawbacks of exception management is the knowledge you must have concerning your application's exceptions. You must be familiar with every single problem your application could encounter. If you do not know what problem states to look for and how to catch every single state (as it occurs) your system will fail to warn you when one of these problem states is encountered.

Exception Management Regimen

The exception management regimen depends on limited performance monitoring, properly defined alerts, and extensive use of error log information. The details of these system elements are discussed in this section.

Performance Monitoring

Performance monitor should be employed only as a tool to catch problem states. The performance monitor data should still be stored in the performance log format, but instead of viewing the counters in a graph, the results could be imported into Microsoft Excel for quicker analysis. In Excel, you would search for any individual values that indicated a problem had occurred. If no problem values were found, you would simply close the Excel file without saving.

The counters that should be monitored are as follows:

Memory – Pages/sec
Network Interface – Bytes total/sec
Physical Disk – Disk Transfers/sec
Processor - % Processor Time
SQL Buffer Manager – Cache Hit Ratio

Alerts

Alerts should be set up for all known failure states. Each job should generate an alert when a failure is encountered. Failed login attempts should also generate an alert. If there is a known limit to the number of users that can access the database, an alert should be launched when that user count is exceeded.

SQL Server Log

The one element of the exception management approach that is time-consuming is the use of the SQL Server log. The SQL Server log is the place where all errors are trapped (expected and unexpected). Because the exception management approach demands that you be aware of all known problem states, you must familiarize yourself with each error message that appears here. How are the errors grouped? How do they escalate over time? What real-world events might correspond to the errors? Do you have alerts associated with each error? Do you have resolutions for each error?

Your ability to solve the problems presented in the SQL Server log will in part determine the suitability of the exception management approach for your environment. Is the log enough? Are you comfortable finding out about errors after they have occurred? Do you find the text in the log sufficient for problem solving? If you are comfortable starting the problem solving process with nothing more than the error messages stored in the SQL Server log, you should consider this approach.

Hardware Exception Management

The time you spend dealing with hardware issues should be somewhat limited. Unless you built the servers in the first place, hardware maintenance should not be a part of your job description. Because of your limited role, it is possible to monitor your hardware using the exception management approach. Following this approach dictates that you react only after hardware problems arise. This approach is also useful if you find that you are not allowed to monitor the hardware on an ongoing basis, but are frequently brought in to help after there is a problem.

Reacting to Hardware Exceptions

The tools associated with reactive hardware monitoring assume that you have limited use of your problem hardware. As you use these tools, you will determine how much use of the hardware you still have. The goal of reactive hardware monitoring is to determine which parts of the server are dysfunctional and which are healthy. As you use the tools, if you determine that no part of your hardware shows a clear sign of dysfunction you should fall back to application troubleshooting.

The tools you have at your disposal are as follows:

Lights. The most basic tool at your disposal. The lights found on servers, external storage components and individual drives indicate more than you would think about server health. Take a full inventory of the lights on your hardware. Do all the pieces have power? If a light has the ability to indicate a problem (with a red or yellow color indicating a problem versus a green color indicating health) does it indicate a problem? Also, if the lights indicate activity (as most drive lights do) does the light indicate normal activity? If you are booting a server the boot drive should show activity. Use this inventory of light information to begin to isolate your problem.
Boot up messages. Probably the most important tool when dealing with a server down state. The messages that appear at the time of server boot indicate the health of various server components. You must know the order and content of the expected boot messages to identify the problem messages that may be flashing in front of you. If you are unsure of this information, try rebooting an identical piece of hardware that is healthy and write down its boot messages.

Often the unexpected messages you receive do not directly point to a problem. Try to determine the area from the message. Then do further research into that area.
Boot system choice. All Microsoft Windows 2000 server operating systems provide a "clean boot" option. This option loads the minimum set of drivers required to boot the server. If you are still attempting to isolate the area of problem, this "clean boot option" is recommended. When this option is used, drivers will not be loaded for a great deal of your peripheral hardware. If you perform a "clean boot" and your problem seems to go away, you have isolated the area of the problem to either the drivers or the actual hardware that is not currently in use.
Boot from disk. Another tool at your disposal is the use of a system-formatted floppy disk at boot time. As long as your BIOS searches drive A prior to searching drive C, you can use this tool. If your BIOS does not search drive A during the boot process, it may be a simple matter of changing a BIOS setting to accomplish this.

Booting from disk has limited applications. You will only learn whether the CPU can access the BIOS, whether the BIOS can access drive A, and whether the keyboard, mouse, and monitors work without the usual operating system drivers. If you are unsure if any of these components work, boot from a floppy disk.
BIOS and SCSI configuration tools. Many servers have BIOS and SCSI configuration tools that are available at boot time, prior to the launch of the operating system. These tools will be useful in many situations. These tools have the power, however, to completely change the way in which your server works. Do not access these tools if you are unfamiliar with how they work. You will certainly do more damage than good if, in your attempt to troubleshoot a hardware problem, you accidentally erase your BIOS settings.
Event Viewer. If you can actually get your operating system to boot, you will have numerous tools at your disposal to help troubleshoot your problem. The first of these is your Event Viewer log. After the operating system boots, always check this log first (in a problem situation). Review all messages that appear in the System and Application logs. These messages are dated, so the time your problem first surfaced and the time of various reboots are critical milestones in the logs. Are there errors that correspond to problem times or boot times?

There is an important thing to note about log messages and boot times. All of the messages that were written after a boot time apply to that boot. The groups of messages between boots apply to that period of uptime. Always keep a record of which hardware problems were on display during each period of uptime. By doing this, you can match the log messages associated with the uptime to the hardware symptoms displayed during the uptime. This can often be hard to keep track of, but it is critical if you want to isolate your hardware problems.

Often, when you are trying to solve hardware problems, you will try numerous changes to the server to isolate (and hopefully solve) the problem. After you have implemented a change, you may need to reboot the server for the change to take affect. This causes there to be many uptime periods during your troubleshooting time. If you don't keep track of the server changes made, each uptime period that took place after a change, and the log messages that were generated during each uptime period, you will not be able to isolate your problem.
Device Manager (in Control Panel, double-click the System icon, click the Hardware tab, and then click Device Manager). Anther excellent tool at your disposal is the Device Manager that is provided with Windows 2000. Device Manager lets you look at the devices currently recognized by your operating system and check the status of the device. By clicking the icon associated with the device, a window with a Troubleshooter button appears. The "troubleshooter" will attempt to determine the kind of problem your hardware is experiencing. It is recommended that you use this tool often.

Identifying CPU and Memory Constraints

Some problems that appear to be hardware related are, in fact, caused by resource constraints. The resources that you should be most aware of are CPU and memory. If your CPU resource is close to fully utilized, there is a chance that some hardware requests are being forced to wait too long for satisfactory performance. This is also the case with memory utilization. If an operation that requires the use of hardware has to wait for memory allocation, this wait may be longer than is tolerable for your application's satisfactory performance. In both cases, it can seem like the request for hardware failed because the hardware itself failed. Make sure that you have a clear indication that the hardware did fail before you come to this conclusion. Always examine the load being placed upon your CPU and memory as a possible cause.

If CPU and memory issues are continually popping up, consider switching to a more proactive approach. Regardless of your approach, be sure to employ good capacity planning and management when it comes to your CPU and memory resources to avoid wait states. Try to be aware of the load being placed on your CPU and memory resources at all times. A good indicator of this load is user count. Another good indicator is the total number of user requests. If either your user count or total user requests begin to climb, review your CPU and memory capacity to ensure that your hardware can handle the additional load.

Clustering Hardware

Clustered database servers also introduce an additional class of hardware into your data center. Cluster database servers required shared storage to grant the multiple servers access to the same database. This storage is most frequently accessed via a fiber switch.

If the cluster service is non-functional, you will have difficulty accessing the shared storage. This makes it difficult to determine the state of either the fiber switch or the shared storage during problem times. Make sure that your operations personnel are thoroughly trained in the administration of cluster specific hardware.

Reacting to Problems

There will come a time when your monitoring system has properly alerted you of a problem. Now what will you do? As you will see in Chapter 7, "Problem and Incident Management," you have only just begun the troubleshooting process. In fact, the alert that started off this series of events was merely a symptom. You may have many more steps to go before you identify the actual problem.

For in-depth troubleshooting you should follow the methodology for the isolation and management of application problems outlined in Chapter 7. There are some tools, however, that should be part of your basic monitoring regimen, which allow further investigation of problem states. Remember, even if your organization's monitoring personnel cannot solve the problems that occur, it is important that they feel empowered to find out more about the problem, gather additional symptom data, and stay involved in the resolution process.

Tools

Some of the tools that are used in problem analysis are the same tools used to monitor the application in the first place. When it comes to reactive monitoring, the tools are used much more for diagnosis, rather than reporting.

SQL Error Log and SQL Agent Log

As discussed earlier, there is a lot of debris in any given SQL Error Log. The same holds true for the SQL Agent Log. When it comes time to analyze a problem situation, however, this debris can unlock the mystery surrounding your problem state. Use the date and time stamps in the logs to review the history of the problem. When did the SQL Service boot? How long did the boot process take? Were any databases left unrecovered? Did any full text indexes fail to finish building?

Now review the processing time that led to the problem state. What kinds of messages were entered into the log? Is there anything in the log that correlates with the Event Viewer or other data source? By using the logs as a historical map, you can begin to isolate and locate the source of your problem.

SQLDiag.exe

To get a good idea of the current state of your SQL Server, including current user information, DLL versions, configuration information, and database size information, you can run a utility called SQLDiag.exe. The default location for this utility is C:\Program Files\Microsoft SQL Server\Mssql\Binn\SQLDiag.exe. When you run this utility, the following messages appear in a command prompt window:

Connecting to server (YOUR SERVER NAME) 
Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG
Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.1
Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.2
Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.3
Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.4
Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.5
Getting file C:\Program Files\Microsoft SQL Server\MSSQL\log\ERRORLOG.6
Getting registry information
Getting library version information
Getting configuration information 
Getting current user information 
Getting lock information 
Getting database information 
Getting product information 
Getting extended procedures information 
Getting process information 
Getting input buffers 
Getting head blockers 
Getting machine information. Please wait; this may take a few minutes 
Data Stored in C:\Program Files\Microsoft SQL Server\MSSQL\log\SQLdiag.txt

After this file runs, it leaves a file on your server that lists all the detailed information about your server that you might ever need. If you contact Microsoft Product Support Services at some point, the information in this file will be useful them.

If you intend to run SQLDiag.exe more than once and would like to save the output from each run, make sure to rename the SQLDiag.txt file (to something unique) after each run. Otherwise, the contents of SQLDiag.txt will be overwritten each time you run the executable file.

Profiler

Profiler is perhaps the best tool to use when gathering detailed information about a problem state. In fact, Chapter 7 includes step-by-step instructions for using Profiler.

When performing your initial analysis of a problem state, it is best to limit the events that are traced with Profiler. While Profiler is the best tool at your disposal, it is also the most expensive. Introducing Profiler into a problem state can easily aggravate that state.

You can take steps to make your Profiler investigation easier. First, run Profiler locally on your SQL Server. Next, select only events found in the "Errors and Warnings" Event Class. Select only one or two events that you think might be helpful in your investigation. Finally, run your first trace for only a moment or two. Check with operational personnel to determine if running profiler has aggravated the problem state.

If your use of Profiler is not making things worse, try a longer trace with the same limited counter set. Always run the trace locally on your SQL Server. If this trace has not disrupted operations, consider slowly adding the events recommended in Chapter 7.

Current Activity

Another great tool that you should use in your reactive monitoring is the Current Activity window, found in Enterprise Manager. This tool has no cost associated with it, so it should be one of the first tools you use during a problem state. The Current Activity window displays a snapshot of information regarding processes, user activity, and locks held by processes and locks held on objects. It is important to note that this window offers only a snapshot. If you are interested in the ongoing state of your server, you must refresh this snapshot regularly.

You can use the Current Activity window to monitor blocked and blocking transactions. You can view currently connected users and their last executed statement. You can view all locks by database object. You can also end a selected process or send a message to the user who is executing a problematic transaction. Keep in mind that if you decide to end a process, there may be a lengthy rollback associated. Sparingly use the power to end processes.

The following values can be found in the Current Activity window for each process (the values are sampled when the window is first opened, or subsequently refreshed):

Process ID. SQL Server Process ID.
Context ID. Execution context ID used to uniquely identify the sub-threads operating on behalf of a single process.
User ID. User ID of the user who executed the command.
Database. The database currently being used by the process.
Status. The status of the process (for example, running, sleeping, runnable, and background).
Open Transactions. The number of open transactions for the process.
Command. The command currently being executed.
Application. The name of the application program being used by the process.
Wait Time. The current wait time in milliseconds. When the process is not waiting, the wait time is zero.
Wait Type. Indicates the name of the last or current wait type.
Wait Resources. Textual representation of a lock resource.
CPU. Cumulative CPU time for the process. The entry is updated only for processes performed on behalf of Transact-SQL statements executed when SET STATISTICS TIME ON has been activated in the same session. The CPU column is updated when a query has been executed with SET STATISTICS TIME ON. When zero is returned, SET STATISTICS TIME is OFF.
Physical IO. Cumulative disk reads and writes for the process.
Memory Usage. The number of pages in the procedure cache that are currently allocated to this process. A negative number indicates that the process is freeing memory allocated by another process.
Login Time. The time at which a client process logged into the server. For system processes, the time at which SQL Server startup occurred is displayed.
Last Batch. The last time a client process executed a remote stored procedure call or an EXECUTE statement. For system processes, the time at which SQL Server startup occurred is displayed.
Host. The name of the workstation.
Network Library. The column in which the client's network library is stored. Every client process comes in on a network connection. Network connections have a network library associated with them that allows them to make the connection. For more information, see "Client and Server Net-Libraries" in Books On Line.
Network Address. The assigned unique identifier for the network interface card on each user's workstation. When the user logs in, this identifier is inserted in the Network Address column.
Blocked By Process ID. SPID of a blocking process.
Blocking Process ID. SPID of processes that are blocked.

SP_WHO

There may be times when Enterprise Manager cannot access your SQL Server but a Query Analyzer session can. In these instances, SP_WHO can be used to discover information about current connection count and connection details.

Summary

It is important that you follow through with the monitoring system development process outlined in this chapter. It is easy to ignore the need for monitoring until a dramatic outage makes it clear how valuable a monitoring system is. Be sure to first document all the application components that are involved. Determine which monitoring technologies are best suited to meet your needs. Then create a strict regimen that is suited to your working environment. Finally, implement the regimen and track its success.

Regardless of the monitoring approach you take, it is recommended that you have numerous views of your application's health available to you. While one tool may miss a critical symptom, other tools may catch it. Use all the tools at your disposal in concert to determine the health of your application. Even if your application seems completely healthy, investigate the minutia concerning user load, CPU utilization, and disk access. The ongoing analysis of this data is the key to preparedness. Be continually on the watch for symptoms of a problem. If you catch the symptoms early enough, you may be able to prevent a serious problem from occurring.