Microsoft Commerce Server 2000: Problem Management

This chapter describes how to manage problems with your Microsoft Commerce Server 2000 site. Problem management is the process of reducing business losses resulting from a service outage. It is also the process you use to deal with a situation that causes or threatens to cause a break in service. You should design your problem management process to minimize the impact of incidents and problems, as well as to collect information that can help correct the underlying causes.

The following terms are key to understanding the components of problem management:

  • Incident. An event that is not part of the normal operation of a system. 

  • Problem. A significant incident or group of incidents that exhibit common symptoms for which the cause is unknown. 

  • Known error. A known fault in a configuration item (CI). For more information about CIs, see Chapter 8, "Developing Your Site." 

Technical problems are a part of every site, so it is important to allocate sufficient resources to handle them efficiently. An effective problem management process will save you both time and money.

Problem management focuses on the following:

  • Incident control 

  • Problem control 

  • System health 

Incident Control

Cc936696.spacer(en-US,CS.10).gifCc936696.spacer(en-US,CS.10).gif

Incident control is the process of identifying, recording, classifying, and tracking incidents until the affected services are corrected and return to normal operation. The incident management team, which should include the help desk or product support personnel and the initial response team, owns the incident control process. Incidents should be escalated to bring in additional expertise when team members can't restore normal service rapidly or when they can't identify the cause of the problem. Figure 18.1 shows a typical incident control process.

Cc936696.f18csrk01(en-US,CS.10).gif 

Figure 18.1 Incident control process

The incident management team resolves issues identified by user requests to the help desk, issues raised by event logs, and issues discovered through other internal processes. The incident management team should record all incidents in an incident database, which should contain information about incidents, changes, configuration management, and problem management. If you use different databases for managing site configuration, tracking changes, and tracking problems, be sure that the records for each incident are linked to the other databases so you can easily track the different aspects of an incident. Ideally, you should store incident information in the Configuration Management database, in which you also store information about requests for changes, and so on. For more information about the Configuration Management database, see Chapter 8, "Developing Your Site."

You should classify incidents by severity and priority. Figure 18.2 shows a sample classification grid, in which 1 is the highest priority and 7 is the lowest.

Cc936696.f18csrk02(en-US,CS.10).gif 

Figure 18.2 Sample incident-classification grid 

A classification grid can help you to set priorities and expectations, and to facilitate cooperation between technical teams. Classification systems do not need to be complex. The following table shows a typical classification system.

Classification

Impact

Commitment level

Critical

Server down, loss of business

24 hours and seven days a week – total resource and commitment level to problem resolution

Urgent

Production severely impaired, but system still operational

High resource and commitment level to problem resolution

Important

Important problem

Commitment to resolution

Monitor

Under surveillance

No action items

When you have set up a classification system, technical and operational teams can use it to agree on action paths, response times, and the ways in which each type of issue can take priority in their schedules.

Classification systems promote cooperation between teams because each team understands the potential impact and expected commitment level. Classification also creates a metric that can be used to track the potential impact of problems, and that can be useful for future planning and quality control.

Problem Control

Cc936696.spacer(en-US,CS.10).gifCc936696.spacer(en-US,CS.10).gif

Problem control is the process of identifying, recording, classifying, investigating, and tracking problems until a problem is solved or until it is assigned a "known error" status. The problem management team's mission is to identify the cause of problems and to help restore normal service as quickly as possible. After the cause of a problem is identified, the problem can be classified as a known error. The problem management team then works with the change management team to coordinate fixes for known errors. For more information about the change management process, see Chapter 8, "Developing Your Site."

The problem management team identifies problems in two ways:

  • Through immediate investigation of a significant first incident

  • Through incident logs for multiple incidents with similar symptoms 

When a problem is identified, the problem management team should create a record in the Problem Management database that contains a list of incidents relating to the problem, as well as updates. The problem management team should then refer the problem to the appropriate technical specialists for analysis. After the specialists determine the cause of the problem, they can reclassify it as a known error.

Isolating the Problem

The first step in solving a problem is to create a reproducible scenario to isolate the problem. Few problems are difficult to solve once you know exactly what is going wrong. The difficulty with isolating a problem is that isolation is more of an art than a science. The approach you use to identify a problem can largely depend on the specifics of your system or application.

When you try to isolate a problem, you might have to stress your system or alter configuration settings to duplicate the conditions that caused the problem. You should do this in your test environment, not in your production environment. If a problem occurs only in your production environment, however, there are techniques and tools that you can use to make troubleshooting easier.

One of the most common approaches to troubleshooting a problem in a production environment is to use load balancing. In a typical load-balanced system, a number of servers are placed online and accessed through software that evenly distributes client requests across the server farm. In a load-balanced environment, you can usually take a failing server offline without affecting the other servers in the server farm. The remaining servers can then absorb the production load so that there is no interruption in service. After a server is offline, you can troubleshoot and fix it without impacting the other servers.

The following table lists some effective troubleshooting approaches.

Troubleshooting approach

Description

Simplify

The simpler your application, the less it will cost to support and maintain it.
If a complex system is failing and you don't have enough diagnostic information to isolate the problem, try eliminating areas until you end up with a small, reproducible failure. Use the process of elimination to rule out as many technologies and dependencies as possible.

Add troubleshooting flags

If a failure is occurring somewhere in your code, add troubleshooting flags to trace and record successes, failures, starts, finishes, and so on.

Be systematic

Successful troubleshooting is mostly a systematic process of elimination. Create a list of the most probable causes of a problem, and then systematically eliminate each one until you find the cause of the problem. Keep lists of theories, isolation steps, and results so that you can share what you have already done with others working on the problem.

Use action plans

Use a written action plan to define the problem, detail the problem status, and designate action items.

Keep a clear perspective

If you seem to hit a dead end, take a break. Sometimes it's helpful to bring in someone new to get a fresh perspective on the problem.

Use a consistent development methodology

The best way to minimize the need for troubleshooting and increase support efficiency is to develop consistent practices as part of your development cycle. Time and effort spent improving your testing process, architectural design, monitoring, and diagnostic systems will help minimize troubleshooting efforts.

Using Historical Performance Data

Records of historical performance that you capture in event logs can help you troubleshoot some types of problems. It is also useful to have the baseline logs from any quality control or stress tests. You should archive System Monitor and event logs on a regular basis. If a production server starts having problems, you can then study the previous performance and event data of the server for possible clues to the cause of the problems. The cumulative production data that this type of regular archive provides is also useful for identifying growth trends that might indicate the need for additional hardware or resources.

System Health

Cc936696.spacer(en-US,CS.10).gifCc936696.spacer(en-US,CS.10).gif

There are a number of software products that you can use to monitor server performance to be sure that your system is performing as expected. These products perform many tasks, from measuring the total response time of a Web page request to monitoring user-defined performance counters for alert thresholds. Most of these products can be configured to notify administrators by pager or e-mail when a failure occurs.

When you have monitoring processes in place and decide which counters to use to track system performance, you should set performance goals for each operation so that you have benchmarks of satisfactory system performance to use as a basis for comparison. For information about setting and tracking performance goals for your Commerce Server site, see Chapter 19, "Maximizing Performance."

There are a number of different types of system failures:

  • Response failures 

  • Errors and access violations 

  • Memory and resource leaks 

  • Security failures 

  • Queuing 

  • CPU saturation 

  • Corruption 

Response Failures

When the program fails to respond, it is often because the CPU is at 100 percent utilization or because a process has become unresponsive. If the CPU is at maximum utilization (100 percent) for any length of time, a process is often executing what is known as a spinning thread, an endless loop in the code that leaves a thread executing and consuming CPU cycles. If a process becomes unresponsive, a thread has usually entered a state in which it is waiting for a resource that never becomes available.

In both cases, you should focus on isolating the path of code execution leading up to the failure. If this path points to an obvious section of code, you can identify and fix the problem. If no code path is apparent or the failure occurs in third-party components, you can use a tool like UserDump (contained in the Microsoft Windows 2000 Driver Development Kit) to create a process snapshot of all executing threads. You can use UserDump with the appropriate symbols and performance monitor log to identify what each thread was executing when it failed.

Errors and Access Violations

One characteristic of errors and access violations is that the executing process often just stops. The type of notification you receive when an error occurs depends on how errors are handled within the application. You might see a very detailed "Access Violation" message with addresses and numbers, or you might not see anything. When access violations occur inside COM+ server packages, the system restarts the process and logs an event in the event log. If an error or access violation occurs in Internet Information Services (IIS) 5.0, then the system often stops serving Active Server Pages (ASP) and reports various ASP failures.

You should focus on change control and process isolation to correct errors and access violations. These types of failures often occur after system updates when unreliable components have been introduced. You might separate process components into individual COM+ packages to isolate them. You can also use tools like IIS Exception Monitor and UserDump to monitor for and detect both of these types of failures.

Memory and Resource Leaks

You can often identify memory and resource leaks through standard System Monitor counters, such as Private Bytes. If necessary, you can also use COM+ packages to separate components, to help you identify which component is leaking resources. There are also numerous development and troubleshooting tools that can help you track and identify memory and resource leaks. For more information, see the "Additional Resources" section at the end of the book.

Security Failures

You can usually identify a security failure if an operation consistently fails for a specific user but works for other users who are logged on with Administrative rights. Keep in mind that some environments will access resources under different security contexts, depending on the configuration. This is especially true for IIS and COM+. You can configure operations to run under a static user account or under the account of the active user. You can change how IIS accesses resources, depending on how a client is authenticated. Take the time to understand security implications for these environments and always test a consistently failing operation as a system administrator to rule out a security configuration issue.

Queuing

Queuing is commonly associated with IIS, but the concept applies to many different environments. Most high-transaction systems employ some kind of thread pool that services client requests. When all threads are busy serving clients, requests are typically queued until a free thread becomes available or requests are serialized on a single thread. The most common characteristic of a problem with a queue is a slow or unresponsive server. For ASP, you might see a "Server Too Busy" message if there is a problem, but for an Internet Server Application Programming Interface (ISAPI), the server might fail to respond with any error message. However, in both cases, the system will recover if you remove client load from the server.

You can easily identify issues with IIS queuing through System Monitor counters (such as Active Server Pages, Requests Queued, and Requests Executing counters). Keep in mind that the ASP default thread queue is 25 threads per processor on IIS 5.0. When all 25 threads become busy, requests start queuing and IIS sends the "Server Too Busy" message. ISAPI does not use the ASP thread pool directly and, unless you implement a custom pool, it will reach its maximum number of requests at 256 concurrent requests.

Protect your thread pool by writing efficient code for common requests. It takes only one slow page to exhaust your thread queue in a high-traffic environment.

CPU Saturation

You typically identify CPU saturation through System Monitor or Task Manager. You can use either tool to find out when a system is under high CPU stress. When CPU levels consistently exceed 80 percent, you should increase the number of CPUs or the amount of processing power. When you troubleshoot CPU utilization, you should focus on the process that is consuming the most CPU time. On a well-optimized multiprocessor system, CPU load should be evenly distributed.

Watch for processes that are using excessive CPU time or that have a high number of threads, which could cause excessive context switching. One common misconception is that more processors and more threads mean better performance. In some situations, more CPUs and threads can actually decrease performance. Also remember that threads can become blocked on shared resources (shared data, default heap allocations, and so on) and that this type of software bottleneck limits the effectiveness of additional hardware. Stress testing, effective architecture, and code optimization are the best approaches to keeping CPU utilization down.

Corruption

Corruption is an extremely difficult problem to isolate. Corruption typically occurs from a boundary overwrite in memory. In many cases, it is the process heap that becomes contaminated. This type of problem often occurs without warning. As a result, you can end up chasing the effect of the problem, rather than the problem itself. The characteristics of corruption are generally random access violations or corrupt data.

If your memory is corrupted, you should focus on isolating components through COM+ or, in some cases, the use of a boundary-checking tool, heap checking API functions, or a small reproducible scenario that you can examine in a test environment.

Troubleshooting Your Commerce Server System

Cc936696.spacer(en-US,CS.10).gifCc936696.spacer(en-US,CS.10).gif

You should monitor your system logs on a regular basis to help avoid catastrophic failures and to enable you to respond to unexpected increases in server activity on a timely basis. Use a separate server that is not part of the run-time services to monitor your production servers. You don't want the process of monitoring to affect the performance or operation of your production server. Also, if you use a separate server, you will get a more accurate measurement of the true performance of your production servers.

The following list provides some tips for effectively monitoring and troubleshooting your Commerce Server system:

  • At least twice a day, you should use Event Viewer to review event activity and then inform the appropriate personnel about any anomalies. 

  • At least once a week, you should analyze your IIS logs for changes in user-access trends and to be sure that site traffic is within your system specifications. You might also want to reconfigure your Web pages or product offerings, based on the trend analysis. 

  • Back up and truncate your database logs on a regular schedule. You should copy database logs to an external backup device, such as a tape drive, before deleting them. Retain logs over extended periods of time. 

  • Enlist help from an external analysis agent to monitor the page latency of your site from many remote locations. 

  • Configure Performance Logs and Alerts to report data for the recommended counters at regular intervals, such as every 10 to 15 minutes. 

When a problem occurs in your production environment, it is important to get as much diagnostic information as possible during the failure. At a minimum, always capture the following information:

  • Complete System Monitor logs (all counters) 

  • UserDump snapshot or Exception Monitor log of the failing process 

  • IIS log files 

  • Event logs 

  • Any custom diagnostic output from your application components 

Note that for a UserDump snapshot to provide useful information about your custom components, matching program database files (source debugging information) must be available. They do not have to be on the failing server, but they should be accessible to anyone who is analyzing the resulting UserDump logs. In order to obtain meaningful results from Exception Monitor, you must have Exception Monitor symbols installed on the production server prior to the failure.

The following table lists tools that are available with Windows 2000 Server to help you monitor and troubleshoot your system.

Tool

Description

ClusterSentinel

Integrates with Network Load Balancing (NLB) clustering technology to monitor the health of servers and determine whether the server is available.

HTTP Monitoring tool

Gathers large amounts of data about a Web site, enabling immediate monitoring of Web server(s).

Microsoft Cluster Tool

Backs up and restores a cluster configuration, and moves resources to a cluster.

Web Application Stress (WAS) tool

Simulates Web activity to enable you to evaluate the performance of your Web application, server, or network. With this tool, you can simulate a wide variety of workload scenarios to help determine the optimal configuration for your server.

Web Capacity Analysis Tool (WCAT)

Evaluates how Internet servers running Windows 2000 Server and IIS respond to various client workload simulations.

WinRep Deployment Software Development Kit

Collects information about client computers that you can use to diagnose and troubleshoot problems.

Cc936696.spacer(en-US,CS.10).gif