Failure Mode Analysis

Applies To: System Center Operations Manager 2007

Failure Mode Analysis is a process for identifying different issues that an application may experience and providing instrumentation within the application that may be used for monitoring. It is primarily used by software developers in designing their application, but the basic concepts may also be valuable to the management pack author designing a management pack. The process is discussed in detail in MP University which is a training program for designing and building management packs targeted at software developers. It is briefly described here, and the training program should be referenced for more information.

This topic presumes that the management pack author has no control over changes to the application itself, and the health model must be designed according to the information that the application makes available. The quality of a management pack though is directly affected by the quality this available information. If an application has features that cannot be monitored, then the management pack will be unable to provide a complete health measurement. Even if a basic failure can be detected, the management pack may be unavailable to identify the underlying cause of this failure unless the application makes that information available. If the management pack author can collaborate with software developers, then they may be able to influence modifications to the application itself that allow it to be monitored more effectively.

Basic Concept

The basic concept of failure mode analysis is to analyze what can fail in an application and provide a means for detecting that failure. By implementing such instrumentation within the application, a management pack can accurately measure health of the application and detect any potential failures. Failure mode analysis resembles threat modeling because it is a proactive effort to identify potential weaknesses in the application. It is an attempt to identify the components of the application that are vulnerable to failure, what those failures may be, and how they might be detected.

Steps in Failure Mode Analysis

List what can go wrong

This list should go beyond software issues and also consider specific implementations and issues that can arise in the environment. It includes each component of the application and each component that the application relies on – such as network connectivity, server resources, and dependent software components. The underlying cause should be considered in this analysis to distinguish a failure from a symptom of that failure. For example, a database may run out of space causing other errors in the application.

As soon as a complete list is established, it should be prioritized according to the following criteria. This will assist in prioritizing instrumentation for the different scenarios.

  • Probably of the failure occurring

  • Impact to overall system health

  • Potential cost to the customer for a failure

Identify a detection strategy for each failure mode

Identifying a potential failure is useless if there is no way of detecting it. Each failure mode needs at least one means of detection, and especially high impact issues should have multiple means of detection. This detection could be a predictable error in code generating an event or may need additional code to continuously watch for a particular operation.

Add detection elements to application code

Each of the detections has to be added to the code of the application in order to expose this information to a management pack. Some of these elements may be monitors watching for particular event to occur, or they may be probes periodically performing some test or collecting some information. The result of the detection should be information that can be accessed by a management pack. This might be an event in the Windows event log or a performance counter providing a numeric measurement of the application’s health.

Plan management pack content

The end result of failure mode analysis is to design the management pack based on the information exposed by the application. This will include defining monitors that use the events and performance counters exposed by the application in order to set appropriate health states on classes defined in the application’s service model.

Operational failures

The following table lists a set of operational failures that impact most applications. This list can be referenced in identifying a set of potential failures for a particular application.

Category Failure Mode

Access Control Lists

Insufficient permissions accessing resource

Run as account has expired

Run as account locked out

Run as account soon to expire

ACL on configuration resource too permissive

Capacity

Disk is full

Disk utilization above threshold

Critical resource starved

Critical resource above threshold

Capacity trend predicts coming failure

Queue reads failing behind queue writes

Restart under load fails

Component is hung

Configuration

Configuration file missing

Configuration file corrupt

Connection string incorrect

Critical setting missing or incorrect

Database

Database offline

Login denied

Execute permission denied

Database doesn’t exist

Timeout on database connection

Index corrupt

Table or database corrupt

Transaction log full

SPID blockage

Incorrect results in configuration table

Network

Access to critical network resource denied

Critical network resource timing out

Average network latency exceeds tolerance

Throughput diminished causing backlog

Packet loss rate on subnet exceeds tolerance

Resend rate on subnet exceeds tolerance

Latency exceeds tolerance

Transaction

Transaction rate below minimum threshold

Transaction response time above threshold

Transaction rate exceeds engineering limits

Incorrect response being returned

Failure response rate exceeds threshold

With these base causes for the failure identified, the part of the application that performs this database connection should include code to detect each failure and provide a unique event that may be accessed by monitors in the management pack. If the application just reports that the database connection failed, then the management pack cannot identify the underlying cause. Only if the application provides this detailed information can the management pack accurately detect the issue and identify the cause to the user.