Failure Mode Analysis

Article
04/16/2013

Applies To: System Center Operations Manager 2007

Note

This topic has been updated in the latest version of the System Center Management Pack Authoring Guide on the TechNet Wiki.

Failure Mode Analysis is a process for identifying different issues that an application may experience and providing instrumentation within the application that may be used for monitoring. It is primarily used by software developers in designing their application, but the basic concepts may also be valuable to the management pack author designing a management pack. The process is discussed in detail in MP University which is a training program for designing and building management packs targeted at software developers. It is briefly described here, and the training program should be referenced for more information.

This topic presumes that the management pack author has no control over changes to the application itself, and the health model must be designed according to the information that the application makes available. The quality of a management pack though is directly affected by the quality this available information. If an application has features that cannot be monitored, then the management pack will be unable to provide a complete health measurement. Even if a basic failure can be detected, the management pack may be unavailable to identify the underlying cause of this failure unless the application makes that information available. If the management pack author can collaborate with software developers, then they may be able to influence modifications to the application itself that allow it to be monitored more effectively.

Basic Concept

The basic concept of failure mode analysis is to analyze what can fail in an application and provide a means for detecting that failure. By implementing such instrumentation within the application, a management pack can accurately measure health of the application and detect any potential failures. Failure mode analysis resembles threat modeling because it is a proactive effort to identify potential weaknesses in the application. It is an attempt to identify the components of the application that are vulnerable to failure, what those failures may be, and how they might be detected.

Steps in Failure Mode Analysis

List what can go wrong

This list should go beyond software issues and also consider specific implementations and issues that can arise in the environment. It includes each component of the application and each component that the application relies on – such as network connectivity, server resources, and dependent software components. The underlying cause should be considered in this analysis to distinguish a failure from a symptom of that failure. For example, a database may run out of space causing other errors in the application.

As soon as a complete list is established, it should be prioritized according to the following criteria. This will assist in prioritizing instrumentation for the different scenarios.

Probably of the failure occurring
Impact to overall system health
Potential cost to the customer for a failure

Identify a detection strategy for each failure mode

Identifying a potential failure is useless if there is no way of detecting it. Each failure mode needs at least one means of detection, and especially high impact issues should have multiple means of detection. This detection could be a predictable error in code generating an event or may need additional code to continuously watch for a particular operation.

Add detection elements to application code

Each of the detections has to be added to the code of the application in order to expose this information to a management pack. Some of these elements may be monitors watching for particular event to occur, or they may be probes periodically performing some test or collecting some information. The result of the detection should be information that can be accessed by a management pack. This might be an event in the Windows event log or a performance counter providing a numeric measurement of the application’s health.

Plan management pack content

The end result of failure mode analysis is to design the management pack based on the information exposed by the application. This will include defining monitors that use the events and performance counters exposed by the application in order to set appropriate health states on classes defined in the application’s service model.

Operational failures

The following table lists a set of operational failures that impact most applications. This list can be referenced in identifying a set of potential failures for a particular application.

Category	Failure Mode
Access Control Lists	Insufficient permissions accessing resource
	Run as account has expired
	Run as account locked out
	Run as account soon to expire
	ACL on configuration resource too permissive
Capacity	Disk is full
	Disk utilization above threshold
	Critical resource starved
	Critical resource above threshold
	Capacity trend predicts coming failure
	Queue reads failing behind queue writes
	Restart under load fails
	Component is hung
Configuration	Configuration file missing
	Configuration file corrupt
	Connection string incorrect
	Critical setting missing or incorrect
Database	Database offline
	Login denied
	Execute permission denied
	Database doesn’t exist
	Timeout on database connection
	Index corrupt
	Table or database corrupt
	Transaction log full
	SPID blockage
	Incorrect results in configuration table
Network	Access to critical network resource denied
	Critical network resource timing out
	Average network latency exceeds tolerance
	Throughput diminished causing backlog
	Packet loss rate on subnet exceeds tolerance
	Resend rate on subnet exceeds tolerance
	Latency exceeds tolerance
Transaction	Transaction rate below minimum threshold
	Transaction response time above threshold
	Transaction rate exceeds engineering limits
	Incorrect response being returned
	Failure response rate exceeds threshold

With these base causes for the failure identified, the part of the application that performs this database connection should include code to detect each failure and provide a unique event that may be accessed by monitors in the management pack. If the application just reports that the database connection failed, then the management pack cannot identify the underlying cause. Only if the application provides this detailed information can the management pack accurately detect the issue and identify the cause to the user.

Share via