Estimating the Likelihood of a Site Failure

Article
11/12/2009

You can use the following formula to estimate the relative probability number (RPN) of a site failure:

RPN = O x D x S

In which:

O (likelihood of occurrence) is the number of times an error is expected to occur (from 1 to 10; the higher the number, the more likely the error is to occur).
D (detectability) is the ease with which a failure can be found (from 1 to 10; the higher the number, the harder the failure is to detect).
S (severity) is the degree to which the failure will affect the site (from 1 to 10; the higher the number, the more serious the failure and the more severe the outage).

The following table lists the items that can fail and the effect of the failure, followed by a calculation of the RPN that the failure will occur. It also lists the preventive steps you can take to avoid the failure and the RPN after you have implemented them.

Item	Effect of the failure	O	D	S	RPN before prevention	Preventive steps	O	D	S	RPN after prevention
CPU (dual)	Server might go offline	2	4	7	56	Monitoring software Remote-access software Disable CPU Reboot	2	2	7	28
CPU (single)	Server offline	2	4	10	40	Monitoring software Second CPU	2	2	8	32
Drives	Server offline	5	4	10	200	RAID 5 controller with hot spare Monitoring software	2	2	5	20
Firewall	Theft, site altered, or inaccessible	4	4	8	128	Monitoring software Additional firewall	2	2	4	16
Load balancing	All traffic goes to one server Site inaccessible	4	4	8	128	Monitoring software Additional load-balancing	2	2	4	16
Memory	Server offline	2	4	7	56	Monitoring software Additional server	2	2	7	28
NIC	Server offline	4	4	8	128	Dual NIC card Monitoring software	1	2	4	8
Power supply	Site offline	4	4	10	160	UPS Monitoring software	2	2	2	8
RAID controller	Server offline	2	4	10	80	Monitoring software Additional server	2	2	8	32
Router/ Customer Service Unit (CSU)	Loss of connections Inability to process orders Inability to manage site	4	4	8	128	Redundant routers Monitoring software	2	2	4	16
SQL Server cluster	Single-server failure, resulting in slower service	5	4	2	40	Redundant servers designed into site architecture Monitoring software	5	2	2	20
Switch	Some or all devices offline	4	4	10	160	Redundant power supplies Redundant connection cards Redundant management card Monitoring software	1	2	4	8
Web server	Site offline	5	4	10	200	Additional Web servers Load balancing	2	2	5	20

Estimating the Likelihood of a Site Failure

Additional resources