Monitoring Reliability and Availability of Windows 2000-based Server Systems
This paper describes tools and metrics that you can use to monitor the reliability and availability of server computers running the Microsoft Windows 2000 operating system.
The Windows 2000 operating system contains tools to monitor various conditions of the operating system and the computer in general. This paper describes these tools, their metrics, and some of the commonly monitored conditions. This paper is not meant to be an in-depth study of all the capabilities of the tools, but is intended to be a source of reference for setting up and managing the most common measurement conditions.
On This Page
The Microsoft Windows 2000 operating system contains tools to monitor various conditions of the operating system and the computer in general. This paper describes these tools, their metrics, and some of the commonly monitored conditions. This paper is not meant to be an in-depth study of all the capabilities of the tools, but is intended to be a source of reference for setting up and managing the most common measurement conditions.
Reliability and Availability Metrics
Operating System Stop Errors
As with all operating systems, Windows 2000 occasionally encounters serious error conditions, and stops responding. Windows stop errors display text on the console video screen with a blue background, and hence are often called blue screens. These conditions are also referred to as bug checks. Fortunately, operating system stoppages are relatively rare events. However, customers should still monitor these regularly.
A complete description of procedures for handling Windows stop errors is beyond the scope of this paper. However, customers can find additional information about these conditions in the Microsoft Support Knowledge Base at: http://support.microsoft.com/support. In particular, refer to these articles:
Q192463. Gathering Blue Screen Information After Memory Dump
Q129845. Blue Screen Preparation Before Calling Microsoft
Q103059. Descriptions of Bug Codes for Windows NT
The Windows Event Log service is a useful tool for historical monitoring of operating system crashes. Stop errors are recorded in the Event Log when the system restarts, and the crash dump is saved in a permanent file (usually called Memory.dmp). For details, see the subsection “Save Dump” in the “Using the Event Log as a Data Source” section of this document.
Operating System Reboots
Windows 2000 reboots occur for a variety of reasons, including operating system upgrades, software installation, and hardware maintenance. Reboots are recorded in the System Event Log. A system’s reboot frequency tends to drop when the system is stable. Thus, historical reboot frequencies are a long-term indicator of system and data center health. For more information, see the subsection “Startup Event” in the “Using the Event Log as a Data Source” section of this document.
Windows 2000 uses the Dr. Watson utility to record application failures. This utility appends information to the Drwtsn32.log file in the system root for each application failure. It also creates a User.dmp file that contains a memory dump of the user mode program that failed.
As with operating system crashes, complete procedures for handling application crashes is beyond the scope of this paper. For more information, refer to the following support articles in the Microsoft Support Knowledge Base:
Q94924. Postmortem Debugging Under Windows NT
Q141465. How to Install Symbols for Dr Watson Error Debugging
Application failures are recorded in the Application Event Log; therefore, the historical frequencies of these events are usually available for analysis. For details, see the subsection “Dr. Watson Event” in the “Using the Event Log as a Data Source” section of this document.
Operating System Availability
Most customers are very interested in the availability of the application services provided by the Windows operating systems. Each application generally requires different instrumentation; therefore, rather than measuring the application’s availability directly, some customers find it useful to measure the operating system availability. The events needed to do this are contained in the Windows System Event Log.
There are several variations of availability, including planned availability and total availability. Total availability is defined as the percentage of up time over total run time, and can be computed easily using the information in the System Event Log.
Operating System Mean Time to Repair
There is a strong correlation between availability and recoverability of systems. System recoverability is measured as the length of time a system is unavailable following a system outage. Typically, this is reported as mean time to repair. It is easy to measure mean time to repair using the Windows 2000 operating system Event Log. An outage begins when a system is shut down and ends when the system is restarted. To understand how to capture the time stamps associated with these events, see the subsections “Startup Event,” “Clean Shutdown Event,” and “Dirty Shutdown Event,” in the “Using the Event Log as a Data Source” section of this document.
Using the Event Log as a Data Source
Event Viewer Utility
You can use the Event Log service and Event Viewer to gather information about hardware, software, and system problems, and to monitor Windows security events.
Windows 2000 records events in three types of logs:
Application Log. The Application Log contains events logged by applications or programs. For example, a database program might record a file error in the Application Log. The program developer decides which events to record.
System Log. The System Log contains events logged by the Windows system components. For example, the failure of a driver or other system component to load during startup is recorded in the System Log. The event types logged by system components are predetermined for the operating system.
Security Log. The Security Log can record security events, such as valid and invalid logon attempts as well as events related to resource use, such as creating, opening, or deleting files. An administrator can specify what events are recorded in the Security Log. For example, if you have enabled logon auditing, attempts to log on to the system are recorded in the Security Log.
Event Viewer displays these types of events:
Error. A significant problem, such as loss of data or loss of functionality. For example, if a service fails to load during startup, an error is logged.
Warning. An event that is not necessarily significant, but may indicate a possible future problem. For example, when disk space is low, a warning is logged.
Information. An event that describes the successful operation of an application, driver, or service. For example, when a network driver loads successfully, an information event is logged.
Success Audit. An audited security access attempt that succeeds. For example, a user's successful attempt to log on to the system is logged as a Success Audit event.
Failure Audit. An audited security access attempt that fails. For example, if a user tries to access a network drive and fails, the attempt is logged as a Failure Audit event.
The Event Log service starts automatically when you start Windows. All users can view the Application and System Logs, but only administrators have access to Security Logs.
By default, security logging is turned off. You can use Windows 2000 Group Policy to enable security logging. The administrator can also set auditing policies in the registry that cause the system to halt when the Security Log is full. For more information about using Group Policy, refer to the Windows 2000 documentation and to the Group Policy white papers available at http://www.microsoft.com/windows2000/library.
To display the Event Viewer
On the Start menu, click Run.
Click OK. The Event Viewer displays as follows:Figure 1: Event Viewer
Exporting the Event List
You may want to export the event list to Microsoft Excel so that you can save and analyze the data.
To export the list
On the Event Viewer Action menu, click Export List. A Save As window displays.
Save the file with an .xls extension.Figure 2: Exporting the event list
The opened Excel file appears similar to the following:Figure 3: Opened Excel file with exported event data
You can sort events to more easily review and analyze the data.
To specify sort order
On the View menu, click Newest First or Oldest First. The default is from newest to oldest.
(Optional) On the Options menu, check the Save Settings On Exit box to use the current sort order the next time you start Event Viewer.Figure 4: Specifying sort order
Note: When a log is archived, the sort order affects files that you save in text format or comma-delimited text format. The sort order does not affect event records you save in log-file format.
You can filter events so that you can easily see only those events that you wish to review and analyze.
To filter events
On the View menu, click Filter Events.
In the Filter dialog box, specify the characteristics for displayed events. To return to the default criteria, click Clear.Figure 5: Filtering events
To turn off event filtering, click All Events in the View menu.
Windows 2000 records startup events in the System Event Log, as shown below. The Event Log service itself is the source of this event, and the Event ID is 6005. The time of this event is approximately the time the operating system becomes available to applications.
Clean Shutdown Event
Windows 2000 records a new event whenever an operating system shutdown is initiated. A clean shutdown can be initiated through several mechanisms.
Direct user interaction using a Shutdown screen as follows:
Shutdown or Restart using Ctrl+Alt+Delete
Shutdown or Restart using the Start menu
Shutdown or Restart using the Logon screen
Programmatically as follows:
InitiateSystemShutdown WIN32 API local
InitiateSystemShutdown WIN32 API remote
The Event Log service itself is the source of this event, and the Event ID is 6006. The time of this event is approximately the time the operating system becomes unavailable to applications.
Dirty Shutdown Event
Windows 2000 records a new event whenever the operating system is shutdown using a mechanism other than a clean shutdown. The most common cause is when the system is turned off. The Event Log service itself is the source of this event, and the Event ID is 6008. The event is recorded when the system restarts and Windows 2000 discovers that the previous shutdown was not clean.
While Windows 2000 server is running, the system periodically writes a time stamp to disk. This last alive time stamp is saved in the Windows 2000 registry, always overwriting the last alive time stamp from the previous interval. Whenever the last alive time stamp is written, it is also flushed to disk. In this way, if the computer crashes, you would have a boot stamp and a last alive stamp as the final two entries in the stream. If the computer shuts down normally, the normal shutdown time stamp would overwrite the last alive time stamp.
The time in the description portion of this event is the last alive time and is therefore shortly before the time the operating system became unavailable to applications.
The last alive time stamp is written only on Windows 2000 server operating systems. The Windows 2000 Professional operating system does not maintain this time stamp, nor does it record dirty shutdown events.
The last alive time stamp is written to the registry at HKLM\Software\Microsoft\Windows\CurrentVersion\Reliability\LastAliveStamp.
The last alive time stamp interval defaults to 5 minutes. You can add the registry value TimeStampInterval to can change the interval. This value is in units of minutes. Setting it to zero prevents any last alive time stamp logging; only the boot and normal shutdown stamps are written in that case.
System Version Event
Windows 2000 records a new event containing the operating system version information whenever the system is started. This makes it easier to post-process Windows 2000 Event Logs by operating system version. The Event Log service itself is the source of this event, and the Event ID is 6009.
Service Pack Installation
Windows 2000 now records service pack version details in the system Event Log. This makes it easier to post-process Windows 2000 system Event Logs by operating system version.
Save Dump events are always generated on Windows 2000 Server systems after an operating system stop error. They can still be disabled on Windows 2000 Professional systems.
Dr. Watson Event
Windows 2000 records application failures in Dr. Watson log files, and the Dr. Watson utility records application failure events in the Windows 2000 Application Event Log as shown below.
System Performance Monitor (PerfMon) is a tool that allows an administrator to monitor many types of conditions occurring within a local computer or a remote computer located across the globe.
PerfMon performs real-time and short-term historical monitoring of conditions called counters that are contained within categories of objects. One such counter, System Uptime, is described below.
System Uptime Counter
The System Uptime counter measures the time, in seconds, that the system has been “alive.” PerfMon graphs the results on the screen as they are gathered, and allows you to export the results to Excel for reporting purposes.
You can configure PerfMon to alert you when thresholds have been exceeded. You choose the criteria you want reported and the manner in which you want it reported to you. Figure 13 shows PerfMon set up to report when CPU performance exceeds 80 percent.
For More Information
For the latest information on the Windows 2000 Server family, visit the World Wide Web site at http://www.microsoft.com/windows2000/default.mspx.
For the latest information on Windows 2000, visit the World Wide Web site at http://www.microsoft.com/windows2000/default.mspx.