Preventing Site Failures

There are three classes of events that can cause site failures: human error, hardware failure, and software failure. Without proper planning, any of these can ruin a site's target availability.

Human error is the hardest category to manage because the individuals responsible for site maintenance can be its worst enemy. However, when users interact with the run-time site, they might be performing an operation that has an effect. Thus, it is highly recommended that any administrative operation be tried out in the dedicated test environment first and then scripted. When the new administrative operation is rolled into the live site for the first time, it should be carefully monitored for its effect on the overall system. This careful planning will enable a site to achieve the highest level of availability.

This topic lists some of the more common types of problems that cause site failure, and the steps you can take to prevent the failure.

  • Application Software
  • Climate Control
  • Data
  • Electrical Power
  • Hardware
  • Network
  • Security
  • Server

Application Software

Potential problem: Inferior code quality, vulnerability to service attacks, and platform dependencies that are not met.

Preventive steps:

  • Create a robust architecture based on redundant, load-balanced servers. Note that load-balanced clusters are different from Windows application clusters. Commerce Server run-time components were designed for load balancing with Network Load Balancing/WLBS, rather than Windows Clustering.
  • Review code to avoid potential buffer overflows, infinite loops, code crashes, and openings for security attacks.
  • Install stable, tested application software that operates correctly, integrates with existing software, and performs at targeted levels. It is also important to control software versions and manage changes. For more information about managing changes, see Chapter 8, "Developing Your Site", in the Microsoft Commerce Server 2000 Resource Kit, available from Microsoft Press.
  • Use production software only. Avoid using development or evaluation-mode software.

Climate Control

Potential problem: Air-conditioning units or heating units malfunction.

Preventive steps:

  • Maintain the temperature of your hardware within the manufacturer specifications. Excessive heat can cause CPU meltdown and excessive cold can cause failure of moving parts, such as fans or disk drives.
  • Maintain humidity control. Excessive humidity can cause electrical short circuits from water condensing on circuit boards. Excessive dryness can cause static electricity discharges that damage components when you handle them.

Data

Potential problem: Data corruption.

Preventive steps:

  • Conduct regular backups and archive backups offsite. For example, you can archive every fourth regular backup offsite, to save space.

    If your data becomes corrupted, you can restore the data from backups to the last point before the corruption occurred. If you also back up transaction log files, you can then apply the transaction log files to the restored database to bring it up–to-date.

  • Replay transaction log files against a known valid database to maintain data. This technique is also known as "log shipping to a warm backup server." This technique is useful for maintaining a business-recovery site (also known as a "hot site").

  • Deploy Windows Clustering. Commerce Server uses data stores such as SQL Server and Microsoft Active Directory. SQL Server provides access to data and services such as catalog search. SQL Server uses Windows Clustering to provide redundancy. Active Directory provides access to profile data and can provide authentication services. Active Directory uses data replication to provide redundancy.

  • In general, clustering is more effective for dynamic (read/write) data and data replication is more effective for static (read-only) data.

  • Minimize the probability and impact of a SQL Server failure by clustering SQL Server servers or by replicating data among SQL Server servers. SQL Server 2000 is fully supported for high availability configurations.

  • If you use Active Directory, back up Active Directory stores. (You can do this while Active Directory is online.)

    Use at least two Active Directory domain controllers, with a replication schedule appropriate to your requirements. Restoring a domain controller can be time-consuming and requires that the domain controller be offline. Having peer domain controllers enables you to minimize downtime if you must restore your site from backups.

Electrical Power

Potential problem: Power-conditioning units, Universal Power Supplies (UPS), or generator sets malfunction.

Preventive steps:

  • Use UPSs. Because UPSs are typically battery-powered, they are useful only for outages that last for short periods of time. Be sure to use a UPS that has the same power rating as your equipment.
  • Use power generators as secondary backups to the UPSs. You can use generators for an indefinite period of time because they are fuel-powered (diesel or gasoline) and you can refuel them if necessary.

Hardware

Potential problem: Degraded memory chips and CPU, disk hardware, disk controllers, or power supplies malfunction.

Preventive steps: Deploy redundant hardware components, such as the following:

  • Use Redundant Array of Inexpensive Disks (RAID) disk arrays, disk mirroring, and dual disk controllers to minimize disk failures. There are also a number of excellent third-party solutions for reducing downtime related to disk failure.
  • Use a redundant disk controller.
  • Use redundant fiber-channel host-bus adapters and switches (for SAN configuration). In the event of an adapter or switch failure, the backup adapter or switch provides an alternate path to the SAN.

Network

Potential problem: Switches, routers, firewalls, cable media, or network adapters malfunction. Or ISPs do not comply with Service Level Agreements.

Preventive steps: Implement network redundancy.

  • Use multiple Network adapters, multiple routers, switches, Local Area Networks (LAN), or firewalls.
  • Contract with multiple ISPs or set up identical equipment in geographically dispersed locations.
  • Enable routing and management protocols (may require firewall policy configuration: RIP2, OSPF, ICMP)

Security

Potential problem: Firewalls, networks, and Web applications do not work properly, and you are attacked by hackers on the Internet.

Preventive steps:

  • Contract an independent security audit firm to evaluate your environment.
  • Deploy intrusion-detection tools.
  • Deploy multiple firewalls.
  • Check server and client certificates to ensure the identities of the systems with which your site interacts.

For the latest strategies and techniques for handling security issues, see the following resources:

Server

Potential problem: Server is overloaded.

Preventive steps: Deploy redundant, load-balanced servers. Single-IP solutions increase site capacity by distributing HTTP requests proportionally, according to the capacity of each server for handling the required load. In addition, when you use a single-IP solution, you make sure that users are referred only to operating servers. There are many single-IP solutions available to help you load-balance your servers, such as the following:

  • Microsoft Windows 2000 Advanced Server and Datacenter Server editions both provide a Network Load Balancing (NLB) service.
  • Microsoft Application Center 2000 provides Network Load Balancing enhancements (Request Forwarder) to support many users sharing a single IP address.
  • Investigate hardware-based load-balancing solutions.

Copyright © 2005 Microsoft Corporation.
All rights reserved.