Monitoring Reliability and Availability of Windows NT Server Systems
|Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.|
This paper provides information on what monitoring data to look at and how to look at it to provide reliability and availability for Windows NT® servers.
On This Page
There are some tools within Windows 2000® to monitor various conditions of the operating system, and the computer in general. These tools and some descriptions of commonly monitored conditions are contained within this paper. This paper is not meant to be an in-depth study of all the capabilities of the tools presented here but it is a good source of reference for setting up, and managing the most common measurement conditions.
Reliability and Availability Metrics
Operating System Crashes
Like all operating systems, Windows 2000® occasionally encounters serious error and crashes. Windows 2000® crashes display text on the console video screen with a blue background, and hence are often called "bluescreens". They are also called "bugchecks". Fortunately, operating system crashes are relatively rare events. However, these should still be monitored regularly by customers.
A complete description of procedures for handling Windows bluescreens is beyond the scope of this paper. Customers can find additional information about this in the Microsoft Support Knowledge Base at: http://support.microsoft.com/support. In particular, try these articles:
Q192463 - Gathering Blue Screen Information After Memory Dump
Q129845 - Blue Screen Preparation Before Calling Microsoft
Q103059 - Descriptions of Bug Codes for Windows NT
The Windows 2000® event log is a useful tool for historical monitoring of operating system crashes. Bluescreens are recorded in the event log when the system reboots, and the crash dump is saved to a permanent file (usually memory.dmp). See the SaveDump section in this document for details.
Operating System Reboots
Windows 2000® reboots occur for a variety of reasons including operating system upgrades, software installation, and hardware maintenance. Reboots are recorded in the system event log. A system's reboot frequency tends to drop when the system is stable. Thus, historical reboot frequencies are a long-term indicator of system and datacenter health. See the Startup Event section in this document for details
Windows 2000® application crashes are recorded by the "Dr Watson" utility. This utility appends information to the drwtsn32.log file in the system root for each application crash. It also creates a user.dmp file that is a memory dump of the usermode program that crashed.
As with operating system crashes, the complete procedures for handling application crashes are beyond the scope of this paper.
Q94924 - Postmortem Debugging Under Windows NT
Q141465 - How to Install Symbols for Dr. Watson Error Debugging
Application crashes are recorded in the application event log, and thus the historical frequencies of these are usually available for analysis. See the DrWatson Event section in this document for details.
Operating System Availability
Most customers are very interested in the availability of the application services provided by their Windows 2000® systems. Each application generally requires different instrumentation. Rather than measure the applications availability directly, some customers find it useful to measure the operating system availability. The events needed to do this are contained in the Windows 2000® system event log.
There are several variations of availability, including "planned availability" and "total availability". Total availability is defined as the percentage of uptime over total runtime and can be computed easily using the information in the system event log.
Operating System Mean Time to Repair
There is a strong correlation between availability and recoverability of systems. System recoverability is measured as the length of time a system is unavailable following a system outage. Typically, this is reported as "mean time to repair". It is easy to measure mean time to repair using the NT system event log. Outages begin whenever a system is shutdown and the outage ends when the system is rebooted. See the sections on startup events, clean shutdowns, and dirty shutdowns to understand how to capture the timestamps associated with these events.
Event Log as a Data Source
Event Viewer Utility
Using the event logs in Event Viewer, you can gather information about hardware, software, and system problems, and monitor Windows 2000® security events.
Windows 2000® records events in three kinds of logs:
Application Log - The application log contains events logged by applications or programs. For example, a database program might record a file error in the application log. The program developer decides which events to record.
System Log - The system log contains events logged by the Windows 2000® system components. For example, the failure of a driver or other system component to load during startup is recorded in the system log. The event types logged by system components are predetermined by Windows 2000®.
Security Log - The security log can record security events such as valid and invalid logon attempts as well as events related to resource use, such as creating, opening, or deleting files. An administrator can specify what events are recorded in the security log. For example, if you have enabled logon auditing, attempts to log on to the system are recorded in the security log.
Event Viewer displays these types of events:
Error- A significant problem, such as loss of data or loss of functionality. For example, if a service fails to load during startup, an error will be logged.
Warning – An event that is not necessarily significant, but may indicate a possible future problem. For example, when disk space is low, a warning will be logged.
Information - An event that describes the successful operation of an application, driver, or service. For example, when a network driver loads successfully, an Information event will be logged
Success Audit - An audited security access attempt that succeeds. For example, a user's successful attempt to log on the system will be logged as a Success Audit event.
Failure Audit – An audited security access attempt that fails. For example, if a user tries to access a network drive and fails, the attempt will be logged as a Failure Audit event.
The EventLog service starts automatically when you start Windows 2000®. Application and system logs can be viewed by all users. Security logs are accessible only to administrators.
By default, security logging is turned off. You can use Group Policy to enable security logging. The administrator can also set auditing policies in the registry that cause the system to halt when the security log is full. For more information, see Related Topics.
To display the Event Viewer, click the Start button, click Run.., type "eventvwr" in the space provided, click the OK button. The Event Viewer displays as follows:
Exporting the Event List
To export the event list to Excel, click Action in the Event Viewer menu, click Export List and a Save As window displays. Select Save As type as shown in the picture following with a file name that has an xls extension.
The opened Excel file looks similar to the following:
To specify sort order, click on the View menu, click Newest First or Oldest First. The default is from newest to oldest.
If the "Save Settings On Exit" box is checked in the Options menu when you quit, the current sort order is used the next time you start Event Viewer.
Note: When a log is archived, the sort order affects files that you save in text format or comma-delimited text format. The sort order does not affect event records you save in log-file format.
Click on the View menu, click Filter Events. In the Filter dialog box, specify the characteristics for displayed events. To return to the default criteria, click Clear.
Tip To turn off event filtering, click "All Events" in the View menu.
Prior to Service Pack 4, Windows NT® 4 recorded startup events in the system event log, as shown below. The EventLog service itself is the source of this event, and the Event ID is 6005. The time of this event is approximately the time the operating system becomes available to applications.
Clean Shutdown Event
Service Pack 4 records a new event whenever an operating system shutdown is initiated. A clean shutdown can be initiated through several mechanisms:
Direct user interaction via a Shutdown screen
Shutdown/Restart via Ctrl+Alt+Delete
Shutdown/Restart via Start Menu
Shutdown/Restart via Logon screen
InitiateSystemShutdown WIN32 API – local
InitiateSystemShutdown WIN32 API – Remote
The EventLog service itself is the source of this event, and the Event ID is 6006. The time of this event is approximately the time the operating system becomes unavailable to applications.
Dirty Shutdown Event
Service Pack 4 records a new event whenever the operating system is shutdown via a mechanism other than a clean shutdown. The most common cause is when the system is power-cycled, e.g., WindowsNT® is stopped by powering off the system. The EventLog service itself is the source of this event, and the Event ID is 6008. The event is recorded upon subsequent system reboot when WindowsNT® discovers that the previous shutdown was not clean. While WindowsNT® server is running, the system will periodically write a time stamp to disk. This "last alive" timestamp is saved in the WindowsNT® registry, always overwriting the "last alive" stamp from the previous interval. Whenever the "last alive" time stamp is written, it is also flushed to disk. If the machine crashed you would have a boot stamp and a "last alive" stamp as the final two entries in the stream. If the machine shutdown normally the normal shutdown time stamp would overwrite the "last alive" time stamp.
The time in the description portion of this event is the "last alive" time and is therefore shortly before the time the operating system became unavailable to applications.
The "last alive" timestamp is written only on WindowsNT® Server systems. WindowsNT® Workstation system does not maintain this timestamp, nor do they record dirty shutdown events. The "last alive" timestamp is written to the registry at HKLM\Software\Microsoft\Windows\CurrentVersion\Reliability\LastAliveStamp. The "last alive" time stamp interval defaults to 5 minutes. Adding the registry value "TimeStampInterval" can change the interval. This value is in units of minutes. Setting it to zero will prevent any "last alive" time stamp logging; only the boot and normal shutdown stamps will be written in that case.
System Version Event
Service Pack 4 records a new event containing the operating system version information whenever the system is booted. This makes it easier to post-process NT system event logs by operating system version. The EventLog service itself is the source of this event, and the Event ID is 6009.
Service Pack Installation
NT Service Pack installation now records service pack version details in the system event log. This makes it easier to post-process NT system event logs by operating system version.
Prior to Service Pack 4, Windows NT® 4 recorded operating system crash events in the system event log, as shown below. The Save Dump event is written after the subsequent system reboot. On WindowsNT® Server systems prior to Service Pack (SP)4, the system administrator could disable the recording of these events via Start / Settings / Control Panel / System / "Startup/Shutdown" / "Write an event to the system log when a STOP error occurs". The practical consequence of this is that Save Dump events may not always be recorded in the event log. Starting with Service Pack 4, Save Dump events are always generated on WindowsNT® Server systems after an operating system crash. They can still be disabled on WindowsNT® Workstation systems.
WindowsNT® application crashes are recorded in DrWatson log files, and an event is recorded by DrWatson in the WindowsNT® application event log as shown below.
System Performance Monitor (PerfMon) is a tool that allows an administrator to monitor many types of conditions occurring within a computer located locally or across the globe.
It has the ability to do real-time and short term historical monitoring of conditions called counters that are contained within categories of objects. One such counter, System Uptime, is described below.
System Uptime Counter
The system uptime counter measures time, in seconds, that the system has been "alive". PerfMon graphs the results on the screen as it's happening, and you can export results to Excel for reporting.
PerfMon may also be set to alert you when thresholds have been exceeded. You choose the criterion you want reported, and even the manner in which you want it reported to you. The screen print below was set up to report when CPU performance exceeds 80%.
For More Information
For the latest information on Windows 2000, check out our World Wide Web site at http://www.microsoft.com/ntserver/, the Windows NT Server Forum on MSN™, and The Microsoft Network online service (GO WORD: MSNTS).