Avoiding Operational Pitfalls

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

The following sections describe operational practices that can limit the availability of applications and computers, in both clustered and nonclustered environments.

Supporting multiple versions of the operating system, service packs, and out-of-date applications

Support of a highly available system becomes much more difficult when multiple combinations of different versions of software (and hardware) are used together in one system or in systems that interact on the network. Older software, protocols, and drivers (and the associated hardware) become impractical when they do not support new technologies.

Set aside resources and time for planning, testing, and installing new operating systems, applications, and (where appropriate) hardware. When planning software upgrades, work with users to identify the features they require. Provide training to ease users through software transitions. In your budget for software and support, provide funds for upgrading applications and operating systems in the future.

Installing incompatible hardware

Maintain and follow a hardware standard for new systems, spare parts, and replacement parts.

Failing to plan for future capacity requirements

Capacity planning is critical to the success of highly available systems. Study and monitor your system during peak loads to understand how much extra capacity currently exists in the system.

Performing outdated procedures

Make sure you remove any outdated procedures from operation and support schedules when a root system problem is fixed. For example, when software is replaced or upgraded, certain procedures might become unnecessary or no longer be valid. Pay special attention to procedures that may have become routine. Be sure that all procedures are necessary and not simply temporary fixes for issues for which the root cause has not been found.

Failing to monitor the system

If you do not use adequate monitoring, you might not have the ability to catch problems before they become critical and cause system failures. Without monitoring, an application or server failure may be the only notification you receive of a problem.

Failing to determine the nature of the problem before reacting

If the operations staff is not trained and directed to analyze problems carefully before reacting, your personnel can spend large amounts of time responding inappropriately to a problem. They also might not use monitoring tools effectively in the crucial time between the first signs of a problem and an actual failure.

Treating symptoms instead of root cause

Symptom treatment is an effective strategy for restoring service when an unexpected failure occurs or when performing short-term preventative maintenance. However, symptom treatments that are added to standard operating procedures can become unmanageable. Support personnel can be overwhelmed with symptom treatment and might not be able to react properly to new failures.

Stopping and restarting to end error conditions

Stopping and restarting a computer may be necessary at times. However, if this process temporarily fixes a problem but leaves the root cause untouched, it can create more problems than it solves.