Estimating the Likelihood of a Site Failure

You can use the following formula to estimate the relative probability number (RPN) of a site failure:

  • RPN = O x D x S

In which:

  • O (likelihood of occurrence) is the number of times an error is expected to occur (from 1 to 10; the higher the number, the more likely the error is to occur).
  • D (detectability) is the ease with which a failure can be found (from 1 to 10; the higher the number, the harder the failure is to detect).
  • S (severity) is the degree to which the failure will affect the site (from 1 to 10; the higher the number, the more serious the failure and the more severe the outage).

The following table lists the items that can fail and the effect of the failure, followed by a calculation of the RPN that the failure will occur. It also lists the preventive steps you can take to avoid the failure and the RPN after you have implemented them.

Item Effect of the
failure
O D S RPN before
prevention
Preventive steps O D S RPN after
prevention
CPU (dual) Server might go offline 2 4 7 56
  • Monitoring software
  • Remote-access software
  • Disable CPU
  • Reboot
2 2 7 28
CPU (single) Server offline 2 4 10 40
  • Monitoring software
  • Second CPU
2 2 8 32
Drives Server offline 5 4 10 200
  • RAID 5 controller with hot spare
  • Monitoring software
2 2 5 20
Firewall Theft, site altered, or inaccessible 4 4 8 128
  • Monitoring software
  • Additional firewall
2 2 4 16
Load balancing All traffic goes to one server

Site inaccessible

4 4 8 128
  • Monitoring software
  • Additional load-balancing
2 2 4 16
Memory Server offline 2 4 7 56
  • Monitoring software
  • Additional server
2 2 7 28
NIC Server offline 4 4 8 128
  • Dual NIC card
  • Monitoring software
1 2 4 8
Power supply Site offline 4 4 10 160
  • UPS
  • Monitoring software
2 2 2 8
RAID controller Server offline 2 4 10 80
  • Monitoring software
  • Additional server
2 2 8 32
Router/ Customer Service Unit (CSU) Loss of connections

Inability to process orders

Inability to manage site

4 4 8 128
  • Redundant routers
  • Monitoring software
2 2 4 16
SQL Server cluster Single-server failure, resulting in slower service 5 4 2 40
  • Redundant servers designed into site architecture
  • Monitoring software
5 2 2 20
Switch Some or all devices offline 4 4 10 160
  • Redundant power supplies
  • Redundant connection cards
  • Redundant management card
  • Monitoring software
1 2 4 8
Web server Site offline 5 4 10 200
  • Additional Web servers
  • Load balancing
2 2 5 20

Copyright © 2005 Microsoft Corporation.
All rights reserved.