Process 3: Research the Problem

Article
04/27/2008

After you make the decision to solve the problem, the next process is to do the research necessary to find a fix or workaround.

Figure 5. Research the problem

Activities: Research the Problem

For effective and meaningful Problem Management, researching a problem must follow disciplines similar to those of the scientific method. This includes:

Reproducing the problem in a test environment.
Observing the symptoms of the problem and noting your observations.
Performing root cause analysis.
Developing a hypothesis and testing it.
Repeating this process until the root cause has been determined.

At first glance, this may seem overly complicated. However, it is critical that Problem Management identifies the root cause of the problem and determines the exact steps to eliminate it. This process allows you to examine one variable at a time. This is important because introducing multiple variables at the same time can make it impossible to isolate the valuable ones. This could lead to ineffective or over-complicated fixes being deployed.

The output of this process is the production of a known error record. However, since it can be difficult to pinpoint exactly when the data required to create the known error record will become apparent, you should ask the following questions during each activity in this process. If the answer to any of the questions is yes, it is time to create the known error record:

Has any information been discovered that would aid others in resolving incidents or events matching this specific problem?
Has a definitive root cause been identified?
Have any actions been uncovered that would reduce the frequency or impact of the error?
Can a date be projected for when the error will be resolved?
Is there meaningful information available to share about the progress of resolving the error?
Are there actions that Problem Management needs individuals to take to aid in the research efforts?
Has a workaround been discovered?
Has a fix been designed?

The following table lists the activities involved in researching the problem. These include:

Reproducing the problem.
Observing the symptoms of the problem.
Performing root cause analysis.
Developing a hypothesis.
Testing the hypothesis.

Table 6. Activities and Considerations for Researching the Problem

Activities

Considerations

Reproduce the problem

Key questions:

Can the problem be reproduced at will?
What user context or security access is required to reproduce the problem?
Will special lab equipment be required or can this be reproduced on any system?

Input:

Problem record

Outputs:

An environment where the problem can be studied and observed
A new or updated known error record

Best practices:

Production systems should not be used for Problem Management work if at all possible. In scenarios where the problem can only be reproduced in production, extreme care must be taken so that the act of observation does not affect the system. System monitoring and debugging tools can cause drops in performance. In some cases, the service might have to be taken offline to use these tools. Activities like this must be treated as changes and should be passed through the Change Management and Change Control processes. See the Change and Configuration SMF for more information.
It may be tempting to introduce small changes to systems and services, disguising them as “Break-Fix” activities. This should not be allowed. Change and problem processes should work hand-in-hand to drive stability and reliability into production services. Circumventing reviews and approval activities for changes can have negative impacts.
The steps discovered to reproduce the problem should be documented in full detail in the problem record. In the event that others get involved in working on the problem, this ensures that the steps are reproduced exactly the same way each time.

Observe the symptoms of the problem

Key questions:

What are the symptoms of the problem?
How can they be observed?
What tools are required to capture and record the occurrence of the problem?

Inputs:

Problem record
Lessons learned during reproduction

Outputs:

An understanding of the timing, triggers, and results of the problem
New or updated known error record

Perform root cause analysis

Key question:

What technique should be used for performing root cause analysis? To learn about some of the available techniques, see “Root Cause Analysis Techniques” in this guide.

Inputs:

Selected root cause analysis technique
Problem record

Outputs:

Hypothesis to test
New or updated known error record

Develop a hypothesis

Key questions:

What actions might work around this problem?
What actions might fix this problem?
Could this problem be the result of another problem?
Have changes been made to the service or system recently that may have created the problem?

Inputs:

Output from root cause analysis
Problem record

Outputs:

Hypothesis to test
New or updated known error record

Best practice:

Document, document, document! Effort documented today is effort avoided tomorrow. As the Problem Management process is repeated, it can become increasingly more efficient by using data created during previous efforts. This means that all hypotheses should be documented in the problem record. Information such as the reasoning behind the hypothesis, how to test it, and what the expected results are should be captured.

Test the hypothesis

Key questions:

What actions might work around this problem?
What actions might fix this problem?
Could this problem be the result of another problem?
Have changes been made to the service or system recently that may have created the problem?

Inputs:

Hypothesis to test
Problem record

Output:

New or updated known error record

Best practices:

Keep a control system in place to compare results of the testing. This system should remain unmodified during the testing of any hypothesis. This enables the testers to determine if their actions have resolved the problem or if some other uncontrollable factor has been introduced.
Test only one hypothesis at a time, and test each hypothesis one step at a time. Introducing complex modifications can make it difficult to pinpoint the actual workaround or fix.
Document all of the results—both positive and negative outcomes.
If circumstances force these activities to take place on production systems, make sure that proper change procedures have been followed and that back-out plans are tested and in place.

Root Cause Analysis Techniques

A difficult area of Problem Management for most organizations is analyzing the root cause of a problem. Root cause analysis techniques are used to identify the conditions that initiate an undesired activity or state. Since problems are best solved by attempting to correct or eliminate their root causes, this is a critical part of resolving any problem.

There are many techniques available for performing root cause analysis. Two of the most popular are:

Fishbone diagrams.
Fault tree analysis.

Fishbone Diagrams

Visual techniques are often used to assist IT professionals to determine the root cause of a problem. One tool useful in visually diagramming the process is the Ishikawa, or fishbone, diagram. The following figure illustrates a fishbone diagram.

Figure 6. Fishbone diagram

Fault Tree Analysis

Fault tree analysis is another visual technique used to assist with root cause analysis. It is a top-down approach to identifying all potential causes leading to a defect. In the final stage of diagnosis, the root cause is identified and the problem is moved from an unknown state to a known state. The following figure shows an example of fault tree analysis.

Figure 7. Example of fault tree analysis

This accelerator is part of a larger series of tools and guidance from Solution Accelerators.

Process 3: Research the Problem

Activities: Research the Problem

Root Cause Analysis Techniques

Fishbone Diagrams

Fault Tree Analysis

Additional resources