Event Monitors

Article
04/16/2013

Applies To: System Center Operations Manager 2007

Note

This topic has been updated in the latest version of the System Center Management Pack Authoring Guide on the TechNet Wiki.

Event monitors use one of the event data sources to identify a particular event that indicates an issue. As soon as the specific data source that holds the required information is identified, the logic used to determine different health states must be determined. In addition to the logic that indicates whether an error condition has occurred, additional logic must be defined to determine when the state should be changed back to a healthy condition.

Detection Logic

The different kinds of logic that can be used to detect an error condition by using events are listed in the following table. As noted in the table, some logic can only be used with Windows events.

Logic	Data Sources	Description
Simple Event	All	Detects an error state from the occurrence of a single event.
Repeated Events	All	Detects an error state from one or more occurrences of a particular event in a specified time window.
Correlated Events	Windows Events	Detects an error state from the occurrence of two events in a specified time window.
Correlated Missing Events	Windows Events	Detects an error state from an expected event not being detected in a particular time window after the occurrence of another event.
Missing Event	Windows Events	Detects an error state from an expected event not being detected in a particular time window.

Simple Event

Simple detection refers to a state change being triggered immediately after a single occurrence of the specified event. This is the most basic kind of detection and will apply to most scenarios.

Repeated Events

Repeated event detection uses one or more occurrences of a particular event in a time window to indicate an error condition. This typically applies to conditions in an application where a single event on its own can be ignored, but multiple occurrences of that event in a particular time window indicate a potential error. There are different algorithms that can be used for this detection, depending on the logic that best identifies the specific application issue. The following are details of the different algorithms:

Trigger on Timer

Trigger on timer consolidation of events uses a specified time window and is not dependent on the number of events received. A single event can trigger an error in the health state as in simple detection. Unlike simple detection which sets the health state immediately upon detection of the specified event, however trigger on timer consolidation waits until a specified time window to set the health state of the monitor. The time window can be a rotating time duration of specified length or a specific window based on day of the week.

Trigger on timer consolidation is useful for errors that should only be detected in a certain time window. Used with a time window based on a specific time of day, this disables the monitor outside that time period. It can also have the effect of delaying the change of state for a particular time during which an event that indicates a healthy state could be received. In this case, the health state would never be changed.

Trigger on Count

Trigger on count consolidation of events lets a monitor require multiple occurrences of the same event in a specified time window before it changes the health state to an error. The time window can be rotating time duration of specified length or a specific window based on day of the week.

Trigger on count consolidation resembles trigger on timer consolidation except that multiple occurrences of the event are required instead of just one. When the time window is reached, the event count is returned to zero, and the specific number of events must detected before the time window expires again for the health state to be changed.

Trigger on Count, Sliding

Trigger on count, sliding consolidation of events is similar to trigger on count consolidation except that the time window is reset every time that the specified event is received. The time window only expires if the time is reached after the occurrence of the last event.

Trigger on count, sliding consolidation is useful for error conditions that are detected by a certain number of events in a particular length of time. By using trigger on count consolidation, some events could be received in one time window and then other events received in the next time window with the result that the health state is never changed. Using trigger on count, sliding consolidation, the time window depends on when the event occurs preventing this condition.

Repeated Events Example

To help with understanding the different algorithms used for repeated event detection, the following table shows the effect on health state for monitors based on the different kinds of consolidation. This is based on a repeated event monitor that uses the following details:

Consolidation interval: 2 minutes
Compare count: 3 (ignored by Trigger on Timer)
Health state on repeated event: Critical
Reset Logic: Event reset using Event 3

Time	Event	Trigger on Timer	Trigger on Count	Trigger on Count, Sliding
00:00:00	-	Healthy	Healthy	Healthy
00:01:00	Event 1	Healthy	Healthy	Healthy
00:02:00	-	Healthy	Healthy	Healthy
00:02:30	-	Healthy	Healthy	Healthy
00:03:00	-	Critical	Healthy	Healthy
00:03:30	Event 3	Healthy	Healthy	Healthy
00:04:00	Event 1	Healthy	Healthy	Healthy
00:04:30	-	Healthy	Healthy	Healthy
00:05:00	Event 1	Critical	Healthy	Healthy
00:05:30	-	Critical	Healthy	Healthy
00:06:00	-	Critical	Healthy	Healthy
06:30:00	Event 1	Critical	Healthy	Healthy
07:00:00	Event 1	Critical	Healthy	Critical
07:30:00	-	Critical	Healthy	Critical
00:08:00	Event 1	Critical	Healthy	Critical
00:08:30	-	Critical	Critical	Critical
00:09:00	Healthy	Critical	Healthy	Healthy

Using trigger on timer, a critical state is set at 00:03:00 event though the event is received at 00:01:00 because the time window starts when the monitor is loaded. The start is reset to healthy at 00:03:30, but the critical state is again triggered at 00:05:00 from the time window started at 00:03:00.
Using trigger on count, the event at 00:05:00 does not trigger a critical state because the time window started by the event at 00:01:00 would have expired at 00:03:00. This event is instead part of the time window started by the event at 00:04:00 which expires at 00:06:00. The monitor triggers a critical state at 00:08:30 because of the 3 events detected in the time window started with the event at 00:06:30.
Using trigger on count, sliding, each occurrence of Event 1 starts its own window. The critical state is triggered at 00:07:00 from the 3 events detected in the time window started with the event at 00:05:00.

Correlated Events

A correlated event monitor uses two separate events in a particular time period to detect a single issue. This kind of monitor supports conditions where an issue cannot be identified by a single event alone.

When the first event is detected, a timer is triggered. If the second event is received within that period, the state change is triggered. If the second event is not received in the period, the timer is reset until the first event is received again. The monitor may be configured to better tune the specific conditions that must be met in order to perform correlation. These options include the following:

Whether the events must be in chronological order. One of the events may always be expected before the other one, or they may be expected in either order.
Whether the first or last occurrence of the first event should be used. If the first occurrence is specified, then each occurrence of the first event will have its own time window and search for corresponding occurrences of the second event. With the last occurrence specified, if the first event reoccurs with the time window, then the time window is extended based on the last event. The monitor can also be configured to reset the time window every time that the first event occurs. When the time window is reset, all previous occurrences of both events are ignored.
The number of occurrences of the second event that must be received to trigger the state change. Instead of changing the health state after receiving a single instance of the two events, multiple instances of the second event may be required.
Properties between the first and second event that must match for correlation to be performed. Instead of detecting two occurrences of each event, additional comparison may be required to determine whether the events are related. The monitor can, for example, confirm that a particular parameter matches between the two events to make sure that they match.

Correlated Events Example

The following table provides an example of a correlated event monitor by using the first and the last occurrence of the first event. The monitor uses the following details:

Event Log A: Event 1
Event Log B: Event 2
Correlation interval: 2 minutes
Number of occurrences of Event 2: 3
Health state on correlation: Critical
Reset Logic: Event reset using Event 3

Time	Event	First Occurrence	Last Occurrence
00:00:00	-	Healthy	Healthy
00:01:00	Event 1	Healthy	Healthy
01:30	Event 2	Healthy	Healthy
00:02:00	Event 2	Healthy	Healthy
00:02:30	-	Healthy	Healthy
00:03:00	Event 1	Healthy	Healthy
00:03:30	Event 2	Healthy	Healthy
00:04:00	Event 2	Healthy	Healthy
00:04:30	Event 1	Healthy	Healthy
00:05:00	Event 2	Critical	Healthy
05:30:00	Event 3	Healthy	Healthy
06:00:00	Event 1	Healthy	Healthy
06:30:00	Event 2	Healthy	Healthy
07:00:00	Event 1	Healthy	Healthy
07:30:00	Event 2	Healthy	Healthy
08:00:00	Event 2	Critical	Healthy
08:30:00	Event 2	Critical	Critical
09:00:00	Event 3	Healthy	Healthy

The First Occurrence does not trigger a critical state when Event 2 is detected at 00:03:00 because the timer was reset at 00:03:00 which is 2 minutes after the first occurrence of Event 1 at 00:01:00.
The First Occurrence triggers a critical state at 00:05:00 because Event 2 is detected 3 times within the 2 minutes since the first occurrence of Event 1 at 00:03:00. Event 1 starts a new time window at 00:03:00 because the time window from Event 1 at 00:01:00 would have expired.
The First Occurrence triggers a critical state at 00:08:00 because Event 2 is detected 3 times within 2 minutes from Event 1 at 00:06:00.
The First Occurrence resets its state to healthy at 00:05:30 and 00:09:00 because Event 3 is detected.

Correlated Missing Events

A correlated missing event monitor determines an error by the absence of a particular event after the occurrence of another. This resembles the missing event monitor except that instead of searching for the missing event in a particular time window, the monitor searches for the event in a particular time after another event is first detected.

For example, consider an application that performs a backup each evening and creates an event when it starts and a second event when it has completed successfully. A correlated missing event monitor could be created that searches for the event in a particular time window each evening. If both events are detected, then the monitor remains in a healthy state. If the first is found, then the timer starts. If the time is reached before the second event is detected, then the state change is triggered to indicate that the last backup did not occur successfully.

Correlated Missing Events Example

The following table provides an example of a correlated missing event monitor by using the first and the last occurrence of the first event. The monitor uses the following details:

Missing Event Log A: Event 1
Missing Event Log B: Event 2
Correlation interval: 2 minutes
Number of occurrences of Event 2: 3
Health state on correlation: Critical
Reset Logic: Event reset using Event 3

Time	Event	First Occurrence	Last Occurrence
00:00:00	-	Healthy	Healthy
00:01:00	Event 1	Healthy	Healthy
1:30	Event 2	Healthy	Healthy
00:02:00	Event 2	Healthy	Healthy
00:02:30	Event 1	Healthy	Healthy
00:03:00	-	Critical	Healthy
00:03:30	Event 2	Critical	Healthy
00:04:00	Event 2	Critical	Healthy
00:04:30	-	Critical	Critical
00:05:00	Event 3	Healthy	Healthy

The First Occurrence triggers a critical state at 00:03:00 because Event 2 has not been detected 3 times in the 2 minute interval since the first occurrence of Event 1 at 00:01:00.
The Last Occurrence does not trigger a critical state at 00:03:00 because Event 1 occurs at 00:02:30 resetting the timer. The critical state is not triggered until 00:04:30 when Event 2 has not been detected in the 2 minutes interval since the last occurrence of Event 1 at 00:02:30.
The single occurrence of Event 3 at 00:05:00 resets both monitors to healthy.

Missing Event

Instead of detecting a particular event to identify an error condition, a missing event monitor uses the absence of a particular event in a particular time window to determine an error. This supports applications that are expected to generate an informational event that indicates a successful operation or the success of a particular action.

For example, consider an application that performs a scheduled data transfer each evening and creates an event when it has completed successfully. A missing event monitor could be created that searches for the event in a particular time window each evening. If the event is detected, then the monitor remains in a healthy state. If it is not found, then it enters error state that indicates that the last transfer did not occur successfully.

Missing Event Example

The following table provides an example of a missing event monitor by using the following details:

Event: Event 1
Fixed Schedule: Su-Sa 2:00 AM – 3:00 AM
Health state on missing event: Critical
Reset Logic: Event reset using Event 3

Time	Event	Health State
00:00:00	-	Healthy
00:01:00	Event 1	Healthy
00:02:00	-	Healthy
00:03:00	-	Critical
00:04:00	-	Critical
00:05:00	Event 3	Healthy

The critical state is triggered at 00:03:00 when Event 1 is not detected within the specified window.

Health Reset Logic

The previous detection criteria describe the conditions under which a monitor changes to a warning or critical state. In addition to detecting an error state, each monitor must have logic defined to determine when the state should be returned to healthy. The different methods for resetting state are shown in the following table:

Reset Logic	Description
Event Reset	A single specific event indicates that monitor should be reset.
Manual Reset	The monitor is never automatically rest. The user must manually reset the monitor.
Timer Reset	The monitor is automatically reset after a specified time.

Each of these methods is discussed at length in the following sections:

Event Reset

With event reset, the monitor is reset when a single occurrence of a specific event is detected. The event must be the same type as the event used for detecting the error condition. For example, a Windows event monitor might specify an event with a particular event source and number to indicate an error condition. Another Windows event with the same event source but a different number might indicate that the error in the application was corrected.

Event reset can only be used if the application provides an event indicating the particular error was corrected. Many applications create an event when an error occurs but may not create a corresponding event that indicates that the error was corrected. Event reset cannot be used in this case.

Manual reset

With manual reset, the monitor never returns to a healthy state automatically. The user must determine whether the problem was corrected and then select the monitor in the Health Explorer and select Reset Health.

The advantage to this strategy is that a monitor can be used for issues that do not create an event that indicates a healthy state. The monitor can affect the health state of the managed object instead of creating a simple alert from a rule. The downtime will be recorded for the object in the State Change Events in the Operations Console and in any availability reports.

There are multiple implications of this strategy that should be considered. The first is the additional work required from the user because the monitor will never automatically reset. It can also result in too much downtime being recorded if the user waits a long time before performing the reset. The problem may have been corrected fairly quickly, but the healthy state will not be recorded until the user performs the reset.

Use of manual reset should be especially cautioned for monitors where there is a potential for a single problem to affect multiple instances of the target class. Because users cannot reset the monitor for multiple instances in the Operations Console, the user would be required to manually open the Health Explorer for each instance to perform this action. Depending on the number of instances, this could result in significant effort for the user.

Timer Reset

A timer reset acts the same as a manual reset except that if the user does not manually reset the monitor after a specified time, it will reset automatically. One use of this kind of reset is for issues that continuously log error events until the problem is corrected. Instead of using another event to indicate that the problem was corrected, the previously detected error event for a specified period can be used as the success criteria.

The timer reset can be used in the place of a manual reset providing the advantage of automatically resetting after a while if the user does not perform a manual reset.