Failure Mode Analysis
Applies To: System Center Operations Manager 2007
Note
This topic has been updated in the latest version of the System Center Management Pack Authoring Guide on the TechNet Wiki.
Failure Mode Analysis is a process for identifying different issues that an application may experience and providing instrumentation within the application that may be used for monitoring. It is primarily used by software developers in designing their application, but the basic concepts may also be valuable to the management pack author designing a management pack. The process is discussed in detail in MP University which is a training program for designing and building management packs targeted at software developers. It is briefly described here, and the training program should be referenced for more information.
This topic presumes that the management pack author has no control over changes to the application itself, and the health model must be designed according to the information that the application makes available. The quality of a management pack though is directly affected by the quality this available information. If an application has features that cannot be monitored, then the management pack will be unable to provide a complete health measurement. Even if a basic failure can be detected, the management pack may be unavailable to identify the underlying cause of this failure unless the application makes that information available. If the management pack author can collaborate with software developers, then they may be able to influence modifications to the application itself that allow it to be monitored more effectively.
The basic concept of failure mode analysis is to analyze what can fail in an application and provide a means for detecting that failure. By implementing such instrumentation within the application, a management pack can accurately measure health of the application and detect any potential failures. Failure mode analysis resembles threat modeling because it is a proactive effort to identify potential weaknesses in the application. It is an attempt to identify the components of the application that are vulnerable to failure, what those failures may be, and how they might be detected.
This list should go beyond software issues and also consider specific implementations and issues that can arise in the environment. It includes each component of the application and each component that the application relies on – such as network connectivity, server resources, and dependent software components. The underlying cause should be considered in this analysis to distinguish a failure from a symptom of that failure. For example, a database may run out of space causing other errors in the application.
As soon as a complete list is established, it should be prioritized according to the following criteria. This will assist in prioritizing instrumentation for the different scenarios.
Probably of the failure occurring
Impact to overall system health
Potential cost to the customer for a failure
Identifying a potential failure is useless if there is no way of detecting it. Each failure mode needs at least one means of detection, and especially high impact issues should have multiple means of detection. This detection could be a predictable error in code generating an event or may need additional code to continuously watch for a particular operation.
Each of the detections has to be added to the code of the application in order to expose this information to a management pack. Some of these elements may be monitors watching for particular event to occur, or they may be probes periodically performing some test or collecting some information. The result of the detection should be information that can be accessed by a management pack. This might be an event in the Windows event log or a performance counter providing a numeric measurement of the application’s health.
The end result of failure mode analysis is to design the management pack based on the information exposed by the application. This will include defining monitors that use the events and performance counters exposed by the application in order to set appropriate health states on classes defined in the application’s service model.
The following table lists a set of operational failures that impact most applications. This list can be referenced in identifying a set of potential failures for a particular application.
Category | Failure Mode |
---|---|
Access Control Lists |
Insufficient permissions accessing resource |
Run as account has expired |
|
Run as account locked out |
|
Run as account soon to expire |
|
ACL on configuration resource too permissive |
|
Capacity |
Disk is full |
Disk utilization above threshold |
|
Critical resource starved |
|
Critical resource above threshold |
|
Capacity trend predicts coming failure |
|
Queue reads failing behind queue writes |
|
Restart under load fails |
|
Component is hung |
|
Configuration |
Configuration file missing |
Configuration file corrupt |
|
Connection string incorrect |
|
Critical setting missing or incorrect |
|
Database |
Database offline |
Login denied |
|
Execute permission denied |
|
Database doesn’t exist |
|
Timeout on database connection |
|
Index corrupt |
|
Table or database corrupt |
|
Transaction log full |
|
SPID blockage |
|
Incorrect results in configuration table |
|
Network |
Access to critical network resource denied |
Critical network resource timing out |
|
Average network latency exceeds tolerance |
|
Throughput diminished causing backlog |
|
Packet loss rate on subnet exceeds tolerance |
|
Resend rate on subnet exceeds tolerance |
|
Latency exceeds tolerance |
|
Transaction |
Transaction rate below minimum threshold |
Transaction response time above threshold |
|
Transaction rate exceeds engineering limits |
|
Incorrect response being returned |
|
Failure response rate exceeds threshold |
With these base causes for the failure identified, the part of the application that performs this database connection should include code to detect each failure and provide a unique event that may be accessed by monitors in the management pack. If the application just reports that the database connection failed, then the management pack cannot identify the underlying cause. Only if the application provides this detailed information can the management pack accurately detect the issue and identify the cause to the user.