Chapter 9 - Working with Monitors and Events

Article
08/31/2009

This chapter focuses on using the Microsoft Application Center 2000 (Application Center) event and health monitoring features, as well as creating custom monitors to meet your specific monitoring needs.

Application Center continuously records and displays information about the functional state of your cluster and its members. Event and monitoring data, which is gathered from several sources, provides immediate information about member health and availability. In addition, this information can be used to flag potential problems that are developing on a cluster or cluster member.

The following data sources constitute the core of cluster monitoring:

Events, which are generated by a range of objects, services, and applications within the Application Center environment.
Microsoft Health Monitor 2.1 data collectors, which are management objects that collect data for a particular process or service.
Performance counters, which gather data about how often, or how much of, a specific resource is used.

These data sources are implemented through:

Microsoft Windows 2000 events, which you can use to determine the functional state of Application Center and Windows 2000.
Windows Management Instrumentation (WMI), which provides a comprehensive and extensible management infrastructure for applications, services, and objects in Windows 2000.
Health Monitor, which sends events and executes actions in response to monitoring policies.
Application Center, which generates its own events about numerous cluster activities, such as synchronization and cluster membership.

NoteApplication Center also collects and persists all the data gathered by the preceding data sources.

Before dealing with monitoring and monitors in more detail, let's look at the event and monitors schema that you have to work with if you decide to implement custom monitors for your cluster.

The Event Schema

Application Center events are WMI classes that inherit from a base class. The top parent class in the Application Center event schema (namespace) is the MicrosoftAC_Base_Event, which inherits from the WMI class _ExtrinsicEvent. Base, or container classes, are not fired directly as events and do not appear in the Windows Event Log or the Application Center Event view; however, they can be used in a WMI event query to receive events for any child class.

Note The event schema presented in Figures 9.1 through 9.5 inclusive only show the container classes. To view the events within these classes, refer to the application event schema information, in the "References" section of the Application Center online Help or Appendix D, "Application Center Events."

Figure 9.1 shows the top-level container classes in the Application Center namespace.

Bb734914.f09uj01(en-us,TechNet.10).gif

Figure 9.1 The ApplicationCenter Base Event class

The next series of figures (Figure 9.2 through 9.5) show the Application Center Event classes expanded down to the lowest level in each of the four major Events class nodes.

Bb734914.f09uj02(en-us,TechNet.10).gif

Figure 9.2 The MicrosoftAC_Replication_Event class

Bb734914.f09uj03(en-us,TechNet.10).gif

Figure 9.3 The MicrosoftAC_Cluster_Event class

Bb734914.f09uj04(en-us,TechNet.10).gif

Figure 9.4 The MicrosoftAC_RequestForwarding_Event class

Bb734914.f09uj05(en-us,TechNet.10).gif

Figure 9.5 The MicrosoftAC_Monitoring_Event class

Note For detailed event information, including the event identifier, event type, event name, and event description, see Appendix D, "Application Center Events."

Querying a Container Class

You can query a container class and receive event information for all of the class's children. The following examples illustrate container class queries that use a WMI Query Language (WQL) statement in Wbemtest.

Suppose that you wanted to obtain information about the success/failures for these online actions:

8024—MicrosoftAC_Monitoring_OnlineAction_Success_Event
8025—MicrosoftAC_Monitoring_OnlineAction_Failure_Event

You would execute this query:

Select * from MicrosoftAC_Monitoring_OnlineAction_Event

Your query for the container class will return events for any of its children because of the following class hierarchy:

MicrosoftAC_Event

MicrosoftAC_Monitoring_Event

MicrosoftAC_Monitoring_OnlineAction_Event

MicrosoftAC_Monitoring_OnlineAction_Success_Event

Now suppose that you want to query for specific online and offline events (for example, 4015 and 4016), the query statement would be:

Select * from MicrosoftAC_Cluster_Loadbalancing_Event where EventId=4015 or EventId=4016

The Monitors Hierarchy

By default, the Application Center Setup program creates a collection of monitors that are synchronized across the cluster. As we indicated in Chapter 7, "Monitoring," the Health Monitor node represents the top of the monitoring hierarchy.

The next level is All Monitored Computers, which lists all the computers that Health Monitor has information for, which by default is the server that you're currently logged on to. You can add additional computers to this view (right-click the node, and then on the pop-up menu, click Connect to another computer) provided that you have sufficient privileges.

The third level of the hierarchy lists specific computers by name. By default, the name of the server that you're currently logged on to or have made a connection to in the Application Center snap-in will appear here. You can expand this level, represented by localsystem in Figure 9.6, to reveal the actions and monitors that are available after Application Center is installed on a server.

Figure 9.6 shows the entire monitors hierarchy and identifies the Data Groups (DG) and Data Collectors (DC) that are installed.

Note The Sample Monitors collection is only available if you do a custom installation and choose to install the samples. However, you can install these monitors after the initial set up by running the Application Center Setup program, and then, in the Setup dialog box, clicking Modify. For more information about these additional monitors, see "The Sample Monitors" later in this chapter.

Bb734914.f09uj06(en-us,TechNet.10).gif

Figure 9.6 ApplicationCenter Monitors schema for the Synchronized Monitors collection

Anatomy of a Monitor

A monitor, which belongs to a data group at some level, is made up of the following elements:

A data collector
Zero or more thresholds
Zero or more action associations

Let's examine data groups before covering the elements that make up a monitor.

Data Groups

A data group's primary purpose is to let you organize data collectors into a structure by using the data group as a container. A second but equally important function of a data group is that it enables you to treat more than one collector as a single entity.

A data group reflects the worst state of any of its children; therefore, an action can be associated with the data group rather than the data collectors it contains. The purpose is to trigger one event when one of a number of things goes wrong, and then trigger a second event when all the collectors return to the OK state.

For example, assume that we have a data group containing three data collectors. When the first data collector, which could be any one of the three, exceeds its threshold, the data group is flagged as unhealthy and the action associated with the data group is triggered. In cases where more than one data collector exceeds its threshold, all the data collectors have to return to a healthy state before the data group itself is flagged as healthy.

The Online/Offline Monitors data group, presented in "Online/Offline Monitors" later in this chapter, illustrates how an action, or in this case, actions, can be associated with a data group.

Data Collectors

Data collectors provide the fundamental mechanism for collecting data that can be used by a monitor. Every collector is configurable through a datacollectorname Properties dialog box (Figure 9.7), which you can use either to configure a new collector or to modify an existing collector.

The Memory Properties dialog box presents five tabs that are used to provide configuration information for a data collector. The General, Actions, Schedule, and Message tabs are common to all the collectors, whereas the information required for the Details tab varies according to the type of data collector that you're configuring.

Bb734914.f09uj07(en-us,TechNet.10).gif

Figure 9.7 Memory Properties dialog box

General Tab

The General tab is used to provide the collector name, which is the collector type by default, and has a Comments box that can be used to provide descriptive information about the collector.

Note If it isn't provided, the collector's name will default to the data collector's query, which reflects what the data collector is collecting.

Details Tab

As indicated earlier, the Details tab for the various collectors varies according to the type of collector that you're modifying or creating. Table 9.1 summarizes the different types of collectors and the configuration information that's available—and in some cases mandatory—for each collector.

Actions Tab

The Actions tab is used to identify the action to take, the condition that will trigger the action, and to enable a reminder message. The following default actions (for more details, see "Thresholds and Actions" later in this chapter) are available:

Bring Server Online
Email Administrator
Log to offline.log
Log to websitefailures.log
Take Server Offline

The three execution conditions that are available for an action are Ok, Warning, and Critical. Critical is the default condition when a threshold is reached. The reminder option can be configured for n seconds, minutes, or hours. The action that you identify is fired when a data group or data collector changes state. In addition to these default actions, you can also create custom actions. For more information, see "Modifying and Creating Actions" later in this chapter.

Schedule Tab

The Schedule tab is used to establish the collection days and collection times for the data collector, which by default is 7-day/24-hour. In addition, you can establish the collection interval as well as the total samples that should be used for threshold measurement here. The available settings are:

Collection days—Every day of the week is marked by default.
Collection times—All day is set by default, and you specify periods during the day by using Only from: hh:mm to hh:mm or by exclusion using All day except: hh:mm to hh:mm.
Collection interval—Collect every n period, where period is expressed in seconds, minutes, or hours. The minimum, except for event queries, is 10 seconds. This can impact system performance if a query is expensive to make and is made too frequently. Also, requests can get backlogged if the time to retrieve the query is less than the query interval, which is to say, the next query is getting made before the previous query is finished.
Total samples for average calculation—Select an integer value.

Message Tab

Two messages are available. The first is triggered when the collector's status changes to Critical or Warning and takes the following syntax, which does a string substitution for the values enclosed in percent (%) symbols and imbeds the collector name, state, error code, and error description in a text message:

% EmbeddedCollectedInstance.Name % service is % EmbeddedCollectedInstance.State %: % State % condition. (WMI Status: % CollectionErrorCode % % CollectionErrorDescription %)

Note The properties contained in the insertion strings (%%) are filled out when the message is sent. This enables you to include additional tracking information, such as the server name, date and time, or any data that was retrieved by the monitor.

The second message is displayed when the collector's status is healthy and takes the following form:

% Name % is Ok.

These are default messages, and you can create any message that you want to have displayed or sent in these areas.

Types of Data Collectors

You have nine different types of data collectors at your disposal for creating custom data collectors. Table 9.1 provides information about each of these collectors, including their configuration options on the Details tab. The Performance, Service, Process, Windows Event Log, and COM+ application monitors all use WMI, and like the WMI Instance, WMI Event Query, and WMI Data Query, are limited in scope to the local server. The Ping, TCP/IP, and HTTP monitors extend monitoring capability to the network.

Table 9.1 Data Collector Types and Configuration Options

Data collector	Details tab configuration options	Default configuration
Performance Monitor	Identify the object, select the counter to use, and if applicable, identify the instance.
Service Monitor	Identify the service.	Properties: Display Name, Started, State, and Status
Process Monitor	Identify the process.	Properties: Status
Windows Event Log Monitor	Identify the event type from: Information, Success audit, Warning, Failure audit, and Error. Choose one of three log file options: Application, Security, and System. Identify the Source, and if necessary the Category, Event ID, and User.	Event type: Warning, Failure audit, and Error Log file: Application
COM+ Application	Identify the application name.	Properties: Aborted Transactions Per Second, Admin Shutdowns, Application Name Committed Transactions Per Second, Failure Shutdowns, Handle Count, Object Activations Per Second, Object Creations Per Second, Object Pool, Thread Count Timeouts, Timeout Shutdowns, Total Aborted Transactions, Total Committed Transactions, Total Shutdowns, Virtual Size, Working Set Size
HTTP Monitor	Identify the URL, and specify its timeout period. If necessary, provide the following logon information: authentication (None, Clear Text, Windows Default, NTLM, Digest, Kerberos), User name, and Password. If a proxy server is involved, provide its address and port number; and if necessary, the credentials to use with it.	Timeout: 30 seconds Authentication: None
TCP/IP Monitor	Identify the system, the system's port number, and the timeout, in milliseconds.	Timeout: 10000
Ping (ICMP) Monitor	Identify the system and timeout, in milliseconds.	Timeout: 1000
WMI Instance	Identify the namespace, and select a class and instance.	Namespace: root\CIMV2
WMI Event Query	Identify the namespace and class. Specify the type of WQL event query (Intrinsic or Extrinsic), and provide the query.	Namespace: root\CIMV2 WQL event query: Extrinsic
WMI Data Query	Identify the namespace and class, and provide the query.	Namespace: root\CIMV2

Note A data collector functions as both a consumer and as a provider. As a consumer, it gathers data from events or properties and tests values against its threshold. The data collector becomes a provider when a threshold is crossed and it changes state. The data collector fires a status change notification that an action—a WMI consumer—is listening for.

Let's examine one of the default data collectors to see how it's configured. We'll use Synchronization Session Failure and review its configuration for each tab of its properties dialog box.

Synchronization Session Failure Properties Dialog Box Configuration

General tab
Name: Synchronization Session Failure
Details tab

Namespace: root\MicrosoftApplicationCenter

Class: MicrosoftAC_Replication_Session_General_Event

Properties:

EventId

ReplicationJobID

StatusMessage

WQL event query:

Type=Extrinsic

"SELECT * FROM MicrosoftAC_Replication_Session_General_Event WHERE

EventId=5037 OR EventID=5038"

Requires manual reset to return to Ok status: cleared

Status reset

You can use either an automatic or manual reset to return a data collector to Ok status.

Automatic reset

By default, a data collector will reset its state to Ok when the values it collects return below the specified thresholds. For example, if an HTTP monitor data collector gets an "access denied" error while attempting to access a Web page, the state of the data collector will change to Critical. However, if the next attempt to access the Web page is successful, the collector's state returns to Ok. In most cases, this is the desired behavior because it ensures that the Health Monitor snap-in displays the most current information about the status of monitored applications and components.

Manual reset

In some cases an automatic reset is not desired. You may want to manually verify the condition of a component before declaring it fixed, or have a threshold determine that its status should reset to Ok. If you've enabled manual reset on a data collector, the collector remains in a Warning or Critical state until you do a manual reset.

A manual reset might be required in monitoring environments where it isn't possible to verify a successful operation automatically. Although most Health Monitor data collectors poll at regular intervals to detect fixed problems, there are a few collectors that are event-based (such as Windows Event Log Monitor and WMI Event Query). Therefore, when Health Monitor receives an event indicating failure, there is no way for Health Monitor to recheck the status to determine when the failure condition has changed.
Actions tab

Actions: Email Administrator

Execution condition: Critical

Reminder: Null
Schedule tab

Collection days: Every day of week

Collection times: All day

Collection interval: 1 second

Total samples for average calculation: 6
Message tab

When status changes to Critical or Warning: %Name%: % State % condition. WMI Status: %CollectionErrorCode% % CollectionErrorDescription %

When status is Ok: %Name% is Ok.

Synchronization Session Failure Thresholds

Two thresholds are set for the Replication Session Failure collector:

The WMI status check (Error Code (from WMI) !=0), which is common to all the default collectors, watches for WMI errors—the assumption is that a WMI failure should be flagged as a critical condition.
A check is made to see if events 5037 (replication job succeeded), 5038 (replication job failed-source only), or 5043 (replication session commit-target only) were fired. If EventID is 5038, the collector's status is set to Critical and the send e-mail to the administrator action is triggered. The next time the data collector makes its query and receives event 5037 or 5043, the collector's status returns to Ok.

Thresholds and Actions

Application Center automates several aspects of cluster administration by using thresholds and actions. A threshold changes the state of a data collector or data group. The purpose of a threshold is to evaluate the data or properties returned by the collector. Subsequent actions are triggered by this change in state. An e-mail notification is an example of one of these actions.

A threshold is a monitoring rule that is applied to the property, or value of a data collector. When the threshold satisfies the rule, an action, such as sending an e-mail notification, is initiated. As you will see in Table 9.2, several kinds of actions can be associated with a threshold.

Table 9.2 ApplicationCenter Actions

Action	Description
Notification	An e-mail message is sent to the administrator or another designated recipient. This e-mail is sent automatically when the threshold is exceeded and can include information about the event, such as event severity and the time at which the event occurred.
Restarting the server (1)	The affected member, or even the cluster, is restarted if this is the designated action.
Running a batch or executable file (1)	A batch (.bat) file or any executable (.exe) file that is compatible with Windows 2000 is run automatically.
Generating a Windows 2000 event	You can configure Health Monitor to generate a Windows 2000 event, which will be recorded in the Windows 2000 Event Log. Using WMI, this event is also available to other applications and services.
Writing text to a log	The occurrence and related information is recorded in a log file. This log can be in any supported log file format.
Running a script (1)	Scripts written in Microsoft Visual Basic Scripting Edition (VBScript) or Microsoft JScript development software can be run automatically in response to an exceeded threshold.

1 These actions could have security implications, so they should be accessible to administrators only.

To learn how you can customize the default actions that Application Center provides, or create new actions to automate your monitoring, see "Modifying and Creating Actions" later in this chapter.

The Default (Synchronized) Monitors

The default monitors that are installed during Setup fall in the category of Synchronized Monitors, which means that they are replicated to all the cluster members. If you delete or disable a monitor in this category, that action will be reflected across the cluster, provided that you do so on the cluster controller. By the same token, if you create a new monitor on the controller and then add it to the Synchronized Monitors group, your new monitor will be synchronized across the cluster. Table 9.3 summarizes the Application Center Synchronized Monitors category at the data collector level.

Note As shown in Figure 9.6, the parent data group for the Synchronized Monitors is Application Center Monitors. Some, but not all, data collectors are members of a child data group.

Table 9.3 Synchronized Monitors Installed During Set Up

Name	Type	Enabled	Threshold	Description
Log Agent Job Failure	Windows Event Monitor	No	# of instances collected > 0	Looks for event identifier 208, which indicates a job failure in the Event Log for SQLAgent$MSAC.
Log Agent Service	Service Monitor	No	Started != true	Monitors the Started, State, and Status properties of the SQLAgent$MSAC Service.
Log Database Service	Service Monitor	No	Started != true	Monitors the Started, State, and Status properties of the MSSQL$MSAC Service.
Log Database Size	Performance Monitor	No	Data file(s) size (KB) > 100000	Monitors the Data File(s) Size (KB) counter for the ACLog instance of the MSSQLMSACDatabases object.
Cluster Service	Service Monitor	Yes	Started != true	Monitors the Started, State, and Status properties of the ACCLUSTER Service.
Health Monitor Action Failure	WMI Event Query	Yes	# of instances collected > 0	Monitors the ErrorCode, ErrorDescription, and Event properties of the _ConsumerFailureEvent class in the root\CIMV2\MicrosoftHealthMonitor namespace.
Request Forwarding Failure	WMI Event Query	Yes	Type = 1	Monitors the Type property of the MicrosoftAC_RequestForwarding_Initialize_Event class in the root\MicrosoftApplicationCenter namespace.
Server Offline	WMI Event Query	Yes	EventId = 4015 or 4016	Monitors the EventId property of the MicrosoftAC_Cluster_Load-balancing_Event class in the root\MicrosoftApplicationCenter namespace.
Synchronization Session Failure	WMI Event Query	Yes	EventId=5037 or 5038	Monitors the EventId property of the MicrosoftAC_Replication_Session_General_Event class in the root\MicrosoftApplicationCenter namespace.
W3Svc (Web Service)	WMI Instance	Yes	Started != true	Monitors the Started, State, and Status properties of the W3SVC instance of the Win32_Service class in the root\CIMV2 namespace.
Logical Disk	Performance Monitor	No	% free space < 10	Monitors the Logical Disk object.
Memory	Performance Monitor	Yes	Pages/sec > 500	Monitors the Pages/sec counter for the Memory object.
Processor	Performance Monitor	Yes	% Processor time > 90	Monitors the % Processor Time counter for the _Total instance of the Processor object.
https:// 127.0.0.1	HTTP Monitor	No	HTTP monitor response time> 30 seconds HTTP monitor status code >= 400	Monitors the Response Time and Status Code for the URL https://127.0.0.1/ with a timeout of 30 seconds.

Note All of the default monitors that are installed for Application Center also check the value of the Error Code (From WMI) property and verify that it is not equal to zero. An error code greater than zero indicates a failure in WMI, which is considered a critical failure.

Online/Offline Monitors

By default, Application Center creates the Online/Offline Monitors data group. This group contains the W3SVC data collector and its threshold, which will trigger the AC.exe command line Action to set a member offline when the collector's threshold is exceeded. Conversely, when the threshold falls within operational parameters, the AC.exe action will set the member back online. If you want to configure data collectors as online/offline monitors, simply copy or move these data collectors to the Online/Offline Monitors folder. (These data collectors and their associated actions are synchronized across the cluster.)

Note This data group cannot be removed. If it is deleted, the replication service will create it on the next full synchronization.

Viewing Current Monitor Status

If you click on the Monitors node for a member in the console tree, the current state of the member's monitors is presented in the Monitors view. As you can see in Figure 9.8, the Monitors view provides both summary (the name, status, and last modified date for the monitor), as well as more detailed information for a specific monitor. This detailed information, displayed in the lower part, comprises the monitor's thresholds, threshold status, and threshold values (Last, Average, Minimum, and Maximum). In addition to providing a report on a member's monitors, the Monitors view also allows you to disable/enable a monitor and force an immediate evaluation of a monitor's state.

Bb734914.f09uj08(en-us,TechNet.10).gif

Figure 9.8 Monitors view for a cluster member

If you look at the Synchronization Session Failure monitor in the upper part of the details pane shown in Figure 9.8, you'll see that its current status is Collecting. By design, the default event-driven monitors might remain in a collecting state.

Because their threshold is set for a specific condition, these monitors will continue collecting until a specific EventID or Type is received. When this condition is met, the monitor is put into a Critical state and will stay there until the threshold condition no longer exists, at which point the monitor returns to normal. The collecting state should be considered Ok.

A good example of this continuous collecting state is the Request Forwarding monitor. This monitor listens for events from the MicrosoftAC_RequestForwarding_Initialize_Event class, which is a container class for seven events. The first six of these events are represent a failure condition; the seventh is the Request Forwarder started event. The events in this class are:

MicrosoftAC_RequestForwarding_Initialize_OutOfMemory_Event
MicrosoftAC_RequestForwarding_Initialize_MissingWebCluster_Event
MicrosoftAC_RequestForwarding_Initialize_ThreadPool_Event
MicrosoftAC_RequestForwarding_Initialize_Computer_Name_Event
MicrosoftAC_RequestForwarding_Initialize_Missing_W3SVC_Event
MicrosoftAC_RequestForwarding_Initialize_WrongProcess_Event
MicrosoftAC_RequestForwarding_Initialize_Started_Event

There is one threshold for this class, the property Type = 1, which is an error event. The data collector stays in the collecting state until it receives an event in this class. When an error event comes in, the monitor turns Critical until its state changes, at which time the system will automatically reset its status to Ok. After the reset, the monitor receives the MicrosoftAC_RequestForwarding_Initialize_Started_Event event, which is not an error event, and then collection resumes. The monitor will also go to Ok if the started event is received first.

The Sample Monitors

A collection of sample monitors is also provided that you can use as is or tailor to meet your monitoring needs. These monitors can be installed by using the Custom Setup option, either during the initial product installation or after you've installed Application Center. The MOF file that's used to define and create these monitors is named Samples.mof (located in Program Files\Microsoft Application Center\Samples). The next collection of figures (Figure 9.9 through 9.16) describes the hierarchy for the sample monitors.

Note that this collection of monitors includes data groups and data collectors for products other than Application Center, such as Microsoft BizTalk Server 2000, Microsoft Commerce Server 2000, and Microsoft SQL Server 2000.

Bb734914.f09uj09(en-us,TechNet.10).gif

Figure 9.9 The top level of the Sample Monitors hierarchy

Bb734914.f09uj10(en-us,TechNet.10).gif

Figure 9.10 The Application Center Monitors hierarchy under Sample Monitors

Bb734914.f09uj11(en-us,TechNet.10).gif

Figure 9.11 The BizTalk Server 2000 Sample Monitors hierarchy

The BizTalk Server monitors collection allows you to monitor BizTalk servers on your cluster. The monitors include the following data collectors:

BizTalk Messaging Service Monitor—checks the status of the BizTalk Messaging Service. This data collector monitors the Display Name, Started, State, and Status properties of the BTSSvc Service.
BizTalk Messaging Service Performance Counters—monitors performance counters that are enabled for the BizTalkServer object.
Suspended Queue Event Monitor—checks for documents in the suspended queue by querying the DocSuspendedEvent class in the root\MicrosoftBizTalkServer namespace.

Some of these monitors do not have thresholds; you will have to create the required thresholds according to the specific needs of your environment.

Bb734914.f09uj12(en-us,TechNet.10).gif

Figure 9.12 The COM+ Application Monitors hierarchy

Bb734914.f09uj13(en-us,TechNet.10).gif

Figure 9.13 The Commerce Server 2000 Sample Monitors hierarchy

The Commerce Server monitors collection allows you to monitor Commerce Server installations on your cluster. The monitors include the following data collectors:

Commerce Server Error Event—checks for Commerce Server error events in the Application event log.
Commerce Server Marketing and Catalog Performance Counters—monitors the following counters for a Retail and Supplier instance of the CS2000MarketingAndCatalog object: LRUCache: Cache size, LRUCache: Flushes/sec, LRUCache: Hits/sec, and LRUCache: Misses/sec.

Note By default, all sites are monitored. This data collector can be limited to a single site instance.
Commerce Server Pipeline Performance Counters—checks the Executions Total and Errors Total counters for the CS2000Pipelines object.

Note By default, all pipelines are monitored. However, there may be more than one pipeline that is getting monitored, which results in an error returned by the data collector. You should create multiple copies of this monitor, one for each pipeline instance.

Some of these monitors do not have thresholds; you will have to create the required thresholds according to the specific needs of your environment.

Bb734914.f09uj14(en-us,TechNet.10).gif

Figure 9.14 The Monitors requiring additional configuration hierarchy

Bb734914.f09uj15(en-us,TechNet.10).gif

Figure 9.15 The SQL Server Monitoring Samples hierarchy

Bb734914.f09uj16(en-us,TechNet.10).gif

Figure 9.16 The System and Web Site Monitors hierarchy

Creating a Custom Monitor

Before you create a custom monitor, you will have to decide whether the monitor is going to be specific to a single member or common to all cluster members. If the latter case is true, you have to create the monitor in the Synchronized Monitors data group; if the former case is true, you can create the monitor in any other data group, such as the Non-Synchronized Monitors group. After making this decision and deciding what it is that you want to monitor, creating a monitor involves three basic steps: creating a data group, which is optional, creating the actual data collector for the monitor, and setting a threshold for the data collector.

In the following example, we're going to create a monitor that checks to see if the latest version of a specific DLL file has been installed on a new member. This is useful in cases where hot fixes have been provided and you'd like to verify that they've all been installed before putting your new member into production.

Caution Every monitor has a performance cost. It's important to create monitors that are essential to supporting your operations. You need to consider:

Scoping data collection to ensure that data is collected efficiently and quickly.
Scheduling the monitor so that it runs only when needed, and collects data only often enough to achieve its purpose.

The Software Version Checker

For this example, we'll assume that one of the hot fixes we have to track is the Replication Service, and that one of the updates was to the Replication Service Library DLL, called Replib.dll. We'll create a data collector that will test to see if an instance of the newest version of this DLL exists. This data collector is associated with two actions if its threshold is crossed—which is to say, an instance of the correct file version isn't found.

We'll start by creating a data group named Version Checker in the Synchronized Monitors group.

Create the Data Group

Because we want to test for the presence of the latest hot fix on servers as they're added to the cluster, we'll create the new monitor in the Synchronized Monitors group. As soon as the member joins the cluster, our new monitor will be synchronized to the new member. (We're making the assumption that a new member will be synchronized to the controller, but will not automatically be brought online for load balancing.)

Use the following steps and settings to duplicate our process for creating a new data group.

Expand the Health Monitor node down to the Synchronized Monitors node.
Right-click Synchronized Monitors; on the pop-up menu, point to New; and then click Datagroup.

In the Data Group Properties dialog box, click the General tab, and then enter the following information:
- Name: Version Checker
- Comment: This group contains collectors that check software versions to see that the latest hot fix is applied on a server.
Leave the default settings for actions—we'll let the data collector handle that aspect of the monitor—and click OK to save the new data group.

Now we'll create the data collector.

Create the Data Collector

Right-click the Version Checker data group; on the pop-up menu, point to New; point to Data Collector; and then click WMI Instance.

In the WMI Instance Properties dialog box, click the General tab, and then enter the following information:
- General: Replication Service Library
- Comment: This data collector checks the Replication Service Library dll to see that the newest version is installed on the local system.
Click the Details tab, and then provide the following information (illustrated in Figure 9.17):
- Namespace: root\CIMV2–
Use the default namespace for the system for which you are configuring the collector.
- Class: CIM_DataFile–
Before creating the collector, we determined that this particular class captured the information that we wanted to test for.
- Instance: CIM_DataFile.Name="D:\\Program Files\\Microsoft Application Center\\replib.dll"
You can also obtain this information by using the Browse button, but this is very time consuming because it enumerates all of the files on the server. It's quicker to provide a complete path and file name.

Note Make sure that you use two backslashes (\\) in the path statement.

Properties: FileName, FileSize,LastModified,Version–

The properties list determines what is shown in the statistics details pane. You need to select at least one property.

Figure 9.17 The Details tab for the Replication Service Library Properties dialog box
Click the Schedule tab, and then change the schedule so that the collector runs from 12:00 midnight through 4:00 A.M.

We've made the assumption that under normal conditions, new members will be brought online only during nonpeak periods, which is the time of day in our hypothetical environment. Set the collection interval at 60 minutes (check once per hour), and set the number of samples at 1 because we don't need any averaging; we only need one instance.
Click the Action tab, and then click New Action Association.
In the Execute Action Properties dialog box, add the Email Administrator action.
Click the Message tab, and then expand the default message by adding the following text:

%EmbeddedCollectedInstance.FileName% is not the latest version. %SystemName% will be taken offline.

You can add these or similar insertion strings by clicking the right angle bracket (>) to the right of the message area.

The final task is creating a threshold for the data collector because the only default threshold created is for WMI errors.

Create the Threshold

Follow these steps to create the data collector's threshold.

Right-click the Replication Service Library node; on the pop-up menu, point to New; and then click Threshold.

In the Threshold Properties dialog box, click the General tab, and then provide the following information:
- Name: Date and Time
- Comment: This threshold uses the WMI date and time stamp for the LastModified property.
We decided to use the WMI date and time stamp for this threshold, but file version would have worked as well.

Click the Expression tab, and then create the following expression by using the lists and boxes:
- If this condition is true: If the current value for LastModified (Date/Time) Is less than 20000920100400.
Note The data collector's Statistics tab (Figure 9.17) shows the value for the LastModified property in WMI data format. To use this date, we right-clicked the last modified item, made a copy, and pasted the line into Notepad. The WMI date, 20000920100455.620875-420, breaks out as follows:
- Year 2000
- Month 09
- Day 20
- Time 10:04:00
- Duration: Any time this occurs
- The following will occur: The status changes to Warning
Click OK to save the threshold information.

As soon as this action is completed, the data collector will start collecting data and testing the data against the threshold.

Figure 9.18 shows a typical statistical report that Application Center provides for data collectors that are installed on a cluster member. In this example, information about our new data collector is displayed.

Bb734914.f09uj18(en-us,TechNet.10).gif

Figure 9.18 The statistics provided for a data collector

For our sample data collector, we used a value for LastModified that was older than the current version of the DLL. Notice in Figure 9.18 that the LastModified date is different, and that a Warning was generated.

Throttling notifications

When creating data collectors, you have to be careful about setting up a situation where a data collector fires too many administrative events—throttling is extremely important in having a reliable monitoring solution. The following scenarios include workarounds that can reduce the number of notifications that are fired.

Scenario 1:

An action is mapped to an event-based data collector, and there are large numbers of events coming in.

The workaround is to set the collection interval and have Health Monitor count the number of events that are received. Specify that the action is triggered only when the number of events exceeds the count threshold you establish for the collection period. For example, 10 failed log-on attempts in 2 minutes.

Scenario 2:

A data collector flips back and forth repeatedly between a good and bad state.

The first workaround is to require that a bad state stay that way for n collection periods (where n is a configurable value). For nonbinary thresholds, such as processor utilization, you can key off an average over multiple intervals, thus blunting the impact of oscillations right around the threshold value.
The second workaround is to set up two thresholds on a data collector, a go to good threshold and a go to bad threshold, with an established delta between them. For example, send an alert when disk free space goes below 5 percent, but only reset this data collector when free space returns to above 10 percent. In this example, you would require a manual both reset and establish Critical and go to good thresholds.
The third workaround is to set up multiple periods or averages. For example, collect the average processor utilization every 30 seconds, but go critical only if this value is over the threshold for 2 minutes (4 collection intervals).

Modifying and Creating Actions

Application Center installs several actions that can be associated with any monitors that you enable for the cluster or a cluster member. Figure 9.19 shows the default Actions in the console tree. The Actions view in the details pane displays the items that are associated with a selected action in the console tree; including the name of the monitor, its type, and the condition (threshold) that triggers the action. In the example illustrated in Figure 9.19, the action is Email Administrator, which is associated with several data collectors.

Bb734914.f09uj19(en-us,TechNet.10).gif

Figure 9.19 The monitors associated with the Email Administrator action

The Default Actions

Table 9.4 describes the default actions that Application Center provides.

Note All the default actions are scheduled as follows. Days to run actions: Every day, Times to run actions: All day.

Table 9.4 ApplicationCenter Default Actions

Name	Type of action	Description
Bring Server Online	Command Line	Uses the command-line tool with parameters to set the member back online when the monitors in the Offline/Online folder are good.
Email Administrator	E-mail	Sends an e-mail message to the administrator with the specified message.
Log to offline.log	Text Log	Logs a message to a text file each time the online and offline events are received.
Log to websitefailures.log	Text Log	Logs HTTP monitor failures to a text log with all of the properties that are returned from the request.
Take Server Offline	Command Line	Uses the command-line tool with parameters to set the member offline when a monitor in the Offline/Online folder becomes Critical.

You can modify any of the actions shown in Table 9.4 by opening the properties dialog box for the action (right-click the action name) and selecting the configuration information that you want to change.

Creating a New Action

Use the following procedure to create a new action:

Right-click the Actions node; on the pop-up menu, point to New; and then choose one of the following types of actions:

Command Line Action

E-mail Action

Text Log Action

Windows Event Log Action

Script Action

After you select the type of action that you want to use, the actionname Properties dialog box appears and you can configure the action.

Let's examine each of these action types, focusing on how the Details tab is configured for each action.

Note The General and Schedule tabs are common to all actions and are configured as follows:

General—Provide the action name and a descriptive comment.
Schedule—Set to run all day, every day, by default. Change this by clearing the check box for specific days, specifying a period to include or a period to exclude. A good use for include/exclude is to specify different e-mail addresses to send alerts to during the week and on weekends.

Command Line Action Properties Dialog Box: Details Tab

You use this tab to specify the file name, path, command line, and parameters of the program that you want to run. The program runs when the threshold for the associated data collector is crossed.

File Name—Enter the file name and path for the program to run. A Browse button is provided that enables you to browse to the program by using the Browse for File dialog box. After you select the program, the file name and path are inserted automatically.
Working Directory—Enter the name of the working directory that you want to use for the program. A Browse button is provided that enables you to identify the working directory by using the Browse for Folder dialog box.
Command Line—Enter the command. You can include insertion strings by clicking the right angle bracket (>) to the right of the Command Line box. See sidebar.

Insertion strings

The following set of insertion strings is supported and available for the actions.
- Error Code (from WMI)
- Error Description (from WMI)
- # of Instances Collected
- Embedded Collected Instance
- GUID
- Instance Name
- Formatted Local Time
- Message
- Name
- Parent GUID
- State
- Status GUID
- System Name
- GMT Time
You can also reference a property from the data collector by using the Embedded Collected Instance insertion string to specify the property to display.
Process Timeout—Specify the maximum amount of time (in seconds, minutes, or hours) that the program is allowed to run before it is automatically terminated.
Run In a Window—Select this check box to enable program visibility when it is running.

Note The program will not be displayed in a Terminal Services window.

E-mail Action Properties Dialog Box: Details Tab

Use this tab to identify the SMTP server, e-mail recipients, and the e-mail message to send when the threshold for the associated data collector is crossed.

SMTP Server—Enter the name of the SMTP server that will send the message. You can specify a fully qualified Domain Name System (DNS) name (for example, samples.microsoft.com) or an IP address.
From—Enter the name of the monitor or e-mail address of the person who is sending the message. For example, Health Monitor, administrator@samples.microsoft.com.
To—Provide the e-mail address of the message recipient. Optional buttons, Cc and Bcc are provided that enable you to identify recipients who should receive a copy of the e-mail message. Click either Cc or Bcc to activate an input dialog box. You can specify multiple recipients if you separate their names with a semicolon (;).
Subject—Provide a subject line for the message or click the right angle bracket (>) to the right of Subject box to select an insertion string for the subject line. The available insertion strings are the same as those provided for Command Line. (See the preceding sidebar.)
Message—Use the default text and insertion strings for the message or provide your own message and insertion strings. (See the preceding sidebar.) The default message is:

Health Monitor Alert on %EmbeddedStatusEvent.SystemName% at %EmbeddedStatusEvent.LocalTimeFormatted%

%EmbeddedStatusEvent.Message%

This message displays the server name, the time when the alert was fired, and the message property from the data collector or data group.

Text Log Action Properties Dialog Box: Details Tab

Use this tab to specify the text to write to the selected log file. The text is written to this log file when the associated data collector is crossed.

File—Enter the name of the file to which you want to write information or click Browse to activate the Browse for File dialog box.
Log size—Specify the maximum size allowed for the log file in bytes, KBs, or MBs.
Text—Provide the text that should be written to the log. You can provide text and insertion strings. (Right-click the right angle bracket (>) to the right of the Text box to select insertion string; see the preceding sidebar.)
Use ASCII Text/Use Unicode Text—Select the text format that you want to use. ASCII is indicated by default.

Note Please note the following points regarding log files:

A log file is created automatically by using the name you specify if the file doesn't already exist.
By default, new information is appended to a log file.
After the log file reaches its size limit, a new file is created automatically.

Windows Event Log Action Properties Dialog Box: Details Tab

Use this tab to specify a Windows event that will be generated and automatically written to the Windows Event Log when a data collector's threshold is crossed.

Text—Provide the message text and/or insertion string(s) that will be written to the Event Log. (See the preceding sidebar.)

Script Action Properties Dialog Box: Details Tab

Use this tab to specify the name and path of the script that you want to run when a collector's threshold is crossed.

Script type—Identify whether the script is written in VBScript or JScript. Enter the path for the file or, alternatively, click Browse to launch the Browse for File dialog box.

Note You can edit the file you select by clicking the Edit button. This action opens the file in the default text editor, Microsoft Notepad.
Process Timeout—Establish a maximum time length that the script can run before its execution is halted. You can specify the process timeout in seconds, minutes, or hours.

Details Tab Configuration Examples

Table 9.5 uses the default actions to illustrate how different types of actions are configured.

Table 9.5 Examples of an actionname Properties Dialog Box Configuration for the Details Tab

Action	Configuration
Bring Server Online	Filename: Ac.exe Command line: ac.exe loadbalance /online /memberonly Process timeout: 120 seconds
Email Administrator	SMTP server: Null From: HealthMonitor, To: Null Subject: %EmbeddedStatusEvent.Name% Alert on %EmbeddedStatusEvent.SystemName% Message: Health Monitor Alert on %EmbeddedStatusEvent.SystemName% at %EmbeddedStatusEvent.LocalTimeFormatted% %EmbeddedStatusEvent.Message%
Log to offline.log	File: Offline.log Log Size: 1048576 (Bytes) Text: %EmbeddedStatusEvent.SystemName% %EmbeddedStatusEvent.LocalTimeFormatted%, %EmbeddedStatusEvent.Message% Use ASCII text

Configuring Event Logging

To modify event logging for an Application Center cluster:

In the Application Center snap-in, expand the clustername node, and then right-click the Events node.

Figure 9.20 shows the Events Properties dialog box that you can use to tailor event logging for a cluster.

Bb734914.f09uj20(en-us,TechNet.10).gif

Figure 9.20 The Events Properties dialog box for a cluster

There are four types of events that you can log for each monitoring group:

All
Errors and Warnings
Errors Only
None

In addition to specifying the events that you want to log for Application Center, Windows, and Health Monitor, you can also set the time (in days) for retaining event log information and turn off performance counter logging.

Event Exclusion

The Events Properties dialog box also supports event-logging exclusion, which is to say, you can specify events that you don't want logged. This is a useful feature in cases where you have events that are specific to issues that you are already aware of, so there is no need to see and log these events continuously.

If you do decide to disable logging for a specific event, any information that's been logged up to the time that you exclude the event will still be available until Application Center cleans out the log at the end of its retention period, or if you clear the event log manually. Figure 9.21 shows the dialog box that you can use to exclude an event.

Bb734914.f09uj21(en-us,TechNet.10).gif

Figure 9.21 The Event Exclusions dialog box

Excluding an event is essentially a two-part process that involves specifying the product—for example, Application Center—and then identifying the event by severity and source.

Let's say, for example, that one of our members is generating the Windows 3101 event (Unable to read IO information from NBT device.) and that we want to exclude this event from logging. We'd take the following steps:

Right-click Events, and then on the pop-up menu, click Properties to open the Event Properties dialog box.
Click Exclusions to open the Event Exclusions dialog box.
In the Product list, click Windows, and then click Add.
The message we're seeing is an error, so under Error severity, click Error.
Because Windows requires an event source (a service or application), in the Source box, enter perfctrs.
Finally, in the Event ID box, enter 3101, and then click OK.

Any time that you want to see what events are excluded or remove an event from the exclusion list, open the Event Exclusions dialog box in the way that we did in the preceding example. Use the Product box to get a list of the excluded events. To remove an event, in the Product box, click the event, and then click Remove.

Note You should be aware of the following items related to event exclusions:

For Application Center events, it is not necessary to specify the source because every event identifier is unique.
Health Monitor data collector events require that the event source, which is the name of the data collector, be entered in the Source box. The event identifier is not required.
Event logging exclusions use a combination of the event properties Product, Event ID, and Severity (Type) to identify which event to exclude. For example, Information event 8022 is considered a different event than Warning event 8022. Therefore, to exclude all events with an identifier of 8022, you must use all three properties in the exclusion definition.
Using wild cards to exclude groups of events is not supported by the Application Center user interface.

The event logging configuration is replicated automatically to cluster members. This happens any time you change either the cluster settings or the controller settings.

The Application Center online Help provides additional information about monitoring and troubleshooting. The "Error Message Reference," shown in Figure 9.22, is located under "References" in the Application Center Help. This section enables you to browse the local collection of error messages or enter an error message number that is linked to the Microsoft Support Online Web site.

Bb734914.f09uj22(en-us,TechNet.10).gif

Figure 9.22 The Error Message Reference section in the Application Center online Help

The final aspect of cluster monitoring is related to cluster performance and performance counters, which we deal with in Chapter 10, "Working with Performance Counters."