SQL Server 2000 Operations Guide: Problem and Incident Management

Article
01/28/2010

Problem and Incident Management

Updated : October 26, 2001

Abstract

This chapter describes briefly the best practices for problem and incident management. It provides tips on identifying problems that appear to be database problems but could be connectivity problems or, when clustering is used, problems that could be clustering issues. Review of log files and the tools to document problem resolution are discussed briefly. After implementing the best practices for problem and incident management, the database administrator (DBA) should have the documentation and procedures in place to support problem solving so that recurring incidents can be prevented, and successful incident resolution steps can be used at the next occurrence, once the cause has been correctly identified.

Introduction

The procedures outlined in this chapter will aid the troubleshooting and resolution of any application problem you encounter. Although these procedures focus on troubleshooting your Microsoft® SQL Server™, the close relationship that SQL Server maintains with other application components dictates that you be involved in all instances of problem management.

The following procedures will help you properly gather symptom data, isolate problem location, determine the type of fix required, and use the lessons learned to improve the operational environment altogether.

Design Considerations

This chapter details a methodology for the isolation and management of application, and specifically SQL Server, problems. This methodology is meant to deal with the main stumbling block to good problem management: confusion. When a wide variety of symptoms are generated at many application layers, it is difficult to separate the meaningful symptoms from the red herrings. It is also difficult to know where to begin when a number of possible courses of action present themselves.

The problem management methodology was designed to eliminate this confusion. The methodology first deals with the types of symptoms that are given weight. With this methodology, the user starts by going through a set of questions that touch upon the most common problem areas. As possible problem areas are eliminated, the problem at hand becomes isolated. Once the problem has been completely isolated, the methodology recommends an appropriate solution.

This systematic approach was designed to deal with the chaos of a large, multi-tier application environment so that the problem management process does not degrade into confusion.

Resource Requirements

At a minimum, a database administrator is required for proper problem management. Depending on the type of problem, client, middleware, and SQL developers may need to assist with troubleshooting and resolution.

An application debug tool or operational mode may also be required. This would allow you to monitor the health of the application at each application layer between the client and the SQL Server. If you have no other way of determining the health of the application between the client and the SQL Server, then this tool or operational mode is needed.

A configuration problem that can be resolved after an analysis of Tier 1 symptom data (defined later in this chapter) should require no more than two to four hours of work on the part of the database administrator.

A code problem that can be resolved after an analysis of Tier 1 symptom data could require two to four hours of work on the part of the database administrator and 2-16 hours on the part of the original developer.

A problem that requires the use of Tier 2 analysis (defined later in this chapter) could potentially require numerous days on the part of the database administrator. The analysis and resolution of this problem might also require assistance from the development team and the network administration team throughout that time period.

It is important that the database administrator understand the fixes that are associated with various problem states. A wide variety of resources are available that provide this guidance. Microsoft SQL Server 2000 Reference Library, Microsoft SQL Server 2000 Administrator's Companion and Inside Microsoft SQL Server 2000 (all published by Microsoft Press) are highly recommended.

System Prerequisites

At the onset of any incident, a specific data set, referred to as Tier 1 symptom data, should be gathered. This data set will be discussed at length in the following chapter.

Process Flowcharts

There are a number of steps to the problem management methodology detailed in this chapter:

Gathering symptoms
Classifying symptoms
Connecting direct symptoms to the problem
Isolating the problem
Determining solution alternatives
Attempting to solve the problem

Each step will be dealt with in greater detail later in this chapter. An overall process flow has been developed that will help walk you through these steps. The process flow is described by three process flowcharts, which are described here: the Application Flowchart, the Data Tier Flowchart, and the Data Tier Operations Flowchart.

Application Flowchart

The Application Flowchart (Figure 7.1) describes the process of gathering and analyzing symptom data. Following this flowchart will also allow you to determine whether the problem is related to SQL or not.

Figure 7.1: Application Flowchart

The Tier 1 symptom data referred to in this flowchart comes from five sources:

SQL Error Log
Event Viewer - System Log
Event Viewer - Application Log
The application behavior
Any errors handled by an independent process, including the following:
- Screen errors
- DTS (data transformation service) errors
- bcp errors
- Application log errors

Often it is possible to determine the root cause of the symptoms from the Tier 1 symptom data. For example, when the DTC (distributed transaction coordinator) service is first started in an improperly configured clustered environment, the error message found in the Event Viewer – System Log will state that comclust.exe must be run to configure the DTC service for use with the cluster. This is a rather clear statement of problem and resolution!

When the exact problem cannot be determined from the initial Tier 1 symptom data, the Application Flowchart requires that you confirm the symptoms exist at each layer of the application, starting with the client. As you move toward the SQL Server, you may be able to isolate the problem to a middle tier or connectivity component. Only once you are sure the symptoms are being generated by the SQL Server itself are you ready to move on to the next flowchart.

The Data Tier Flowchart

The Data Tier Flowchart (Figure 7.2) begins by asking the question "Has the application ever worked in this environment?" The answer to this question will determine the kind of troubleshooting that needs to be done. If the application was previously working and only recently encountered a problem, then this flowchart can be skipped altogether.

If the application has yet to work in the current production environment, you will need to become familiar with the release process that led to this state. The Data Tier Flowchart will walk you through the elements of the release process that might be the source of your problem. Your primary goal is to find any changes that were introduced between development, test and production.

If the changes that occurred during the release process have a high correlation to your symptoms, eliminating the changes should eliminate the problem. This exercise is also a good way to improve the release process so that similar problems do not crop up in the future. If conforming your environment to a known working environment still does not alleviate your problem, you will have to move on to the final flowchart.

Figure 7.2: Data Tier Flowchart

Data Tier Operations Flowchart

The Data Tier Operations Flowchart is the final flowchart in the problem-management process. This flowchart focuses on the operations of your SQL Server. It determines whether your problem is based on faulty change management or one of the seven common problem areas that plague SQL Server-based application components.

When looking at the common SQL Server problem areas, these flowcharts make use of Tier 2 analysis. Tier 2 analysis is composed of three types of in-depth analysis:

Profiler
Query Analyzer - Execution Plan
Other trace utilities, including the following:
- ODBC Data Source Administrator – Tracing
- Network trace

Tier 2 analysis is the next stage in developing symptom data. By monitoring all activity on the SQL Server with Profiler, it is possible to isolate the problematic behavior to a specific query or set of queries. Once the problematic queries have been isolated, it is possible to delve even further into their workings with the Query Analyzer's Execution Plan option. Other trace utilities also provide this enhanced ability to isolate the problematic behavior occurring on your SQL Server.

Figure 7.3: Data Tier Operations Flowchart

Performing Tier 2 analysis can be exhausting, but it is well worth it. Becoming adept at isolating your problem and properly defining its scope will help ensure that the solution you come up with actually addresses the problem.

It is important to note that the Data Tier Operations Flowcharts should not be used until a preceding flowchart instructs you to do so. It is extremely important that you rule out client and middle tier components as the source of your problem before you focus on the SQL Server. It is also important that you first rule out installation problems when it comes to recently installed application environments. Only by starting the process at the beginning (with the Application Flowchart) can you ensure that all possible options are pursued in the proper order.

Problem Management

The process of dealing with application problems can be viewed at many different levels of granularity. As a program manager, you will want to ensure that all the personnel involved in the process are following the right methodology. As a database administrator, you will want to make sure that all symptom data has been retrieved from each application layer. To assist each of these roles, a number of step-by-step processes have been developed that, together, make up the recommended problem management methodology.

Terminology Used

A symptom is any behavior, message, or error that is not expected in the normal operation of your application. A problem is the actual deleterious code, configuration, or data that is causing the symptom. The difficult thing about problem management is connecting the symptom to the actual problem. Once the problem has been discovered, there is often a vast array of known solutions to choose from. It is the misdiagnosis of the problem that causes the problem resolution to drag on. Misdiagnosis, and the steps taken because of it, can easily make the problem worse.

High-level Process

At a high level, the process of managing problem resolution is as follows:

Analysis phase: Gather the initial set of symptoms.
Separate symptoms into the following three categories: symptoms directly caused by the problem, symptoms related to the problem, and symptoms not related to the problem.
Determine a working theory of the problem. Try to base your theory on the symptoms classified as being directly caused by the problem. If the problem is apparent, go to step 8.
If the problem is not apparent, gather additional symptom data and separate it into the three symptom categories as it is logged. Keep reworking the theory of the problem as new symptom data is logged. When gathering additional symptom data be sure to focus on the theorized problem area. Your research should allow you to rule out certain areas and isolate the problem area.
After isolating the problem area, determine if the problem is apparent. If so, go to step 8.
If the problem is still not apparent, use in-depth analysis tools such as Profiler and Query Analyzer's Execution Plan option to determine the source of the symptoms. When the problem is apparent go to step 8.
If the problem still cannot be determined after in-depth analysis, re-examine your classification of the symptoms. You may be placing too much emphasis on a non-related symptom and ignoring the symptoms that point straight to the problem. If continued analysis of the symptoms does not reveal an apparent problem, go to step 8 anyway. By going through the solution phase you will gain new data with which to re-analyze the situation.
Solution phase: Once you have a clear view of the problem, there are bound to be a number of solutions that can be attempted. Prioritize your solution alternatives. Choose the solution that best addresses your theory of the problem and the directly related symptoms you are encountering, and that is considered a best practice.
If possible, test your solution in a test environment. If the test of the solution indicates that it is not the best alternative, go back to your original set of solution alternatives and test the next most desirable solution. Continue testing until an appropriate solution presents itself. If you run out of solution alternatives or at any time one of your solutions makes matters worse, go to step 13.
Plan the implementation of your solution. Attempt to minimize the impact of the implementation on the operation of your application (that part that is still functional). Have all parties involved in the operation of your data center provide feedback on the impact of your solution and your proposed implementation of it.
Implement your solution according to the final plan.
Gather a new set of symptom data after the solution has been implemented. Compare the new set of symptom data to the original set of symptom data. If the symptoms have been alleviated, the solution worked. If the symptoms have only been somewhat reduced, go back to your original set of solution alternatives and choose the next most desirable one to implement. Determine how that solution can be implemented in light of the first implementation you executed (you may need to roll back your current solution). Repeat the implementation process with each known solution until the symptoms are alleviated. If at any point the symptoms get worse go to step 13.
Re-examination phase: The act of implementing a failed solution presents a whole host of new symptom data. As symptoms disappear or reappear between solution attempts you are given greater insight into the nature of the problem. Closely analyze the changes that occurred to your application after the failed solution attempt. Let the differences between symptom data sets help determine if your original classification of symptoms was correct. Use all this new data to develop a better theory of the problem. If necessary, go back to step 6 and perform in-depth analysis again, this time focusing on the new problem theory. If the new data makes the problem apparent, go to step 8.

Detailed Analysis Process

Once the problem management process has been initiated, the most important task will be the discovery and isolation of the problem. If the symptoms can be traced back to the SQL Server, the best tool for providing detailed analysis of the problematic behavior is Profiler. The following steps outline the use of Profiler for problem detection and isolation:

On the SQL Server program menu, select Profiler.
Inside Profiler, select File, then New, then Trace.
Logon to your SQL Server.
On the General tab, specify the Trace name and Save to file location , and disable Enable file rollover.
On the Events tab, remove all existing events from Selected event classes. Add the following events to Selected event classes:
- Errors and Warnings – Attention
- Errors and Warnings – ErrorLog
- Errors and Warnings – EventLog
- Errors and Warnings – Exception
- Errors and Warnings – Execution Warnings
- Errors and Warnings – Hash Warnings
- Errors and Warnings – Missing Column Statistics
- Errors and Warnings – Missing Join Predicate
- Errors and Warnings – OLEDB Errors
- Errors and Warnings – Sort Warnings
- Stored Procedures – SP:Completed
- Stored Procedures – SP:StmtCompleted
- TSQL – SQL:StmtCompleted
On the Data Columns tab, add the following data to Selected data:
- DatabaseName
- DBUserName
- HostName
On the Filters tab, select DatabaseName, then select Like, and add the name of your database.
Click the Run button to begin the trace. When you feel you have captured enough data, click the Stop button.
Analyze the rows captured during the trace. Were there errors generated by the client activity? There will be a huge amount of data to sift through, so attempt to focus on the rows generated while the problematic code ran. If you know the exact statements that you are looking for, you can use the search functionality provided by Profiler. Click the first row of the trace, select Edit, and then Find.

Enter a few words from the statement you are searching for in the Search Value text box. Select TextData from the Data column drop-down box. Click the Find Next button. Profiler will display the first instance of the search text.

Problem Management Versus Incident Management

At some point in the management of your problem, you may find that the symptoms disappear and the problem seems to go away magically. This is often the case with load-based problems. When the user load subsides, the symptoms may simply vanish.

When this occurs, it is very difficult to continue the problem management process. Unless you have a way to generate load that does not result in actual data changes to the production environment, you can no longer gather symptom data. It also difficult to measure the effectiveness of new fixes in this situation when no symptoms exist to correct.

In this situation, you may need to switch over to an incident management approach. An incident management approach is much simpler than a problem management approach. With an incident management approach, you are simply attempting to limit the damage caused by the incident for the duration that it occurs. Your response may be as simple as rebooting servers the moment they freeze. Or you may have to engage in an extensive data-cleansing operating that returns the production environment to a working state once the symptoms subside.

It is always better to approach a problematic situation with the notion that you can fix the problem. If you are simply trying to get through the current incident, your efforts will accomplish nothing in the long run. If the randomness of your production environment is convincing you that you need to try the incident management approach instead of the problem management approach, you should consider building out a test environment with load tools that can allow you to continue with your problem management.

Problems That Appear as Database Problems

One of the goals of the problem management methodology is to isolate database problems from other application problems. This section looks at two specific problem areas that are frequently misdiagnosed as database problems. By understanding the nature of these problems, you can identify them when they appear.

Connectivity

Connectivity may be one of the hardest application components to troubleshoot because of the wide variety of technologies used to connect the client tier to the data tier. It is clear when the client is not connecting; but it can be difficult to determine the source of the connection problem.

The first step is to stop using any connectivity technology that is no longer supported. If the client uses RDO or dblib to connect to your SQL Server, there is a good chance that your connectivity problem is being caused by an incompatibility. If possible, replace the data access library that is being used with one that is supported and test the connectivity of the application.

Next, try connecting to the SQL Server using a mechanism other than the data access layer that is employed by the application. This will confirm whether the SQL Server is responding at all. If you are unable to ping the SQL Server, the network and/or name resolution server (DNS) may be the problem. If you can ping the server, but cannot open a Query Analyzer session pointed at the server, the SQL service itself may be the problem. If you are able to open a Query Analyzer session pointed at the server, the session should attempt to handle your application's connection request. When testing connectivity with Query Analyzer, make sure you attempt to connect to SQL Server using the same network library that your application employs.

Then check the credentials that your application is using at connect time. Is a valid login/password combination being used? If you were able to open a Query Analyzer session with the same credentials, the credentials are probably not the source of the problem.

Next, be sure that your application is handling any connection errors that your data access layer may be encountering. When using ActiveX® Data Objects (ADO), be sure to check the Errors collection of the Connection object in Performance Monitor. The errors returned to the data access layer may provide the best clue as to the cause of the connectivity problem.

Another possible problematic state is when your application succeeds in connecting to the SQL Server, but is unable to use the connection (because of problematic client side code). You can determine if this is the case by opening SQL Enterprise Manager and selecting your SQL Server, Management, Current Activity, and Process Info. If your client's PC name can be found in the Host column (and no other process on your client's PC has access to SQL) then your client has successfully connected to the SQL Server. If you double-click the row with the correct "Host" value you will be able to see "Last Transact SQL command batch." This will show you what queries, if any, your client has attempted to execute against the SQL Server. You may have to right-click Current Activity, and select Refresh to see the most recent query.

You can also confirm whether or not your client is sending queries to the SQL Server by running Profiler and filtering out all other hosts except your client. If there is evidence that queries are being sent to the SQL Server, you do not have a connectivity problem.

The final connectivity issue to watch for is extensive use of server-side cursors. If your application client is attempting to work with a large recordset, but is not opening a client-side cursor, the cursor will be located at and managed by the SQL Server. Asking the SQL Server to manage a cursor that needs to walk through a large number of rows and asking it to hold locks on those rows so the cursor can make changes as it goes, could end up causing a massive amount of locking. Though this might appear to be a SQL Server issue, it is a connectivity problem. The slow link that your client is working over will not free the locks in a timely manner. Also, if your user walks away from the interface without ending the session, the connection will stay open and the locks will remain. This problem can be avoided easily by using a client-side cursor in the cursor declaration syntax. This will cause SQL Server to move the entire recordset across the connection when the rows are first requested. From this point on the client, not the SQL Server, will manage the recordset.

Clustering

Extensive documentation exists concerning the administration of Windows clustering in the Microsoft Windows® 2000 operating system and failover clustering in Microsoft SQL Server. For the purpose of troubleshooting, however, only a general understanding of these technologies is required. The primary aspect of the technologies that must be understood is the concept of the stand-alone restart.

A cluster provides an automated failover system that, in essence, appears to the application layer as a series of restarts. While the failover is immediate and appears seamless, the transactions in memory fail forward or fail back much in the way that a SQL Server restart fails forward or fails back the transactions in its log (depending on completeness). An application must be able to handle these restarts appropriately to gain the most benefit from the cluster.

Most Web-oriented applications have no problem within this environment. The restarts do not affect performance when a database is accessed frequently, state information is pushed down to the table level (to be retrieved by a subsequent call) or kept off the database altogether (in a cookie or URL string), or recordsets are either disconnected or maintained without server-side locking. If a restart were to occur at the exact moment a page was being rendered, and that page could not be rendered because of a lost database connection, the user could simply refresh the page to re-establish a connection to the new primary database server.

If your application has lost contact with the SQL Server and cannot re-establish it, however, your application may not be tolerating the restarts. If your application is managing a connection pool used to access the SQL Server, it may not be re-establishing connections once they become invalid. The proper way to process the discovery of an invalid connection handle is to release it and establish a new connection (using the same virtual server name).

The best way to determine if the cluster is the source of your problem is to remove the active node from the cluster, shut down the cluster service, and access the previously active node directly. This will require that your application use a new server name for the SQL Server connection because the virtual server name will no longer be valid. After configuring your application to use the actual server name of the previously active node, restart your application and check to see if the symptoms persist. If the symptoms have been alleviated, your problem involves the cluster, rather than the SQL element of your application.

Summary

You have learned some strategies for isolating and solving problems with the SQL Server. How can you prevent such problems from occurring? In analyzing your current problem, you may have encountered some problematic operational processes. With your newfound knowledge of these problematic processes, you would greatly benefit your organization by leading the reform and introducing better operational control. This should be the next step that you take upon resolution of the current problem. By closing the gaps in the operations of your organization, you will help ensure that it will be a long time until the next disruption.