Quantifying Availability and Scalability for Your Organization

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

Your goal in quantifying availability is to compare the costs of your current IT environment — including the actual costs of outages — and the cost of implementing high availability solutions. These solutions include training costs for your staff as well as facilities costs, such as costs for new hardware. After you have calculated the costs, IT managers can use these numbers to make business decisions, not just technical decisions, about your high availability solution. For information about monitoring tools that can help you measure the availability of your services and systems, see "Implementing Software Monitoring and Error-Detection Tools" later in this chapter.

Scalability is more difficult to quantify because it is based on future needs and therefore requires a certain amount of estimation and prediction. Remember, though, that scalability is tied to availability because if your system cannot grow to meet increased demand, certain services will become less available to your users.

Determining Availability Requirements

Availability can be expressed numerically as the percentage of the time that a service is available for use. The exact level of availability must be determined in the context of the service and the organization that uses the service. Table 6.1 displays common availability levels that many organizations try to achieve. The following formula is used to calculate these levels:

Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time

Table 6.1   Availability Measurements and Yearly Downtime

Availability Yearly Downtime

99.999%

5 minutes

99.99%

53 minutes

99.9%

8 hours, 45 minutes

Availability requirements can vary depending on the server role. Your users can probably continue to work if a print server is down, for example, but if a server hosting a mission-critical database fails, your business might feel the effects immediately.

Determining Reliability Requirements

Reliability is related to availability, and it is generally measured by computing the time between failures. Mean time between failures (MTBF) is calculated by using the following equation:

MTBF = (total elapsed time – sum of downtime)/number of failures

A related measurement is mean time to repair (MTTR), which is the average amount of time that it takes to bring an IT service or component back to full functionality after a failure.

A system is more reliable if it is fault tolerant. Fault tolerance is the ability of a system to continue functioning when part of the system fails. This is achieved by designing the system with a high degree of hardware redundancy. If any single component fails, the redundant component takes its place with no appreciable downtime. For more information about fault-tolerant components, see "Planning and Designing Fault-Tolerant Hardware Solutions" later in this chapter.

Determining Scalability Requirements

You need to consider scalability now to provide your organization a certain amount of flexibility in the future. If you believe your hardware budget will be sufficient, you can plan to purchase hardware at regular intervals to add to your existing deployment. The amount of hardware you purchase depends on the exact increase in demand. If you have budget limitations, purchase servers that you can scale up later by adding RAM or CPUs to meet a rise in users or client requests.

Looking at past growth can help you determine how demand on your IT system might grow. However, because business technology is becoming increasingly complex, and reliance on that technology is growing every year, you must consider other factors as well. If you anticipate growth, realize that some aspects of your deployment may grow at different rates. You might need many more Web servers than print servers, for example, over a certain period of time. For some types of servers, it might be sufficient to add CPU power when network traffic increases, while in other cases, such as with a Network Load Balancing cluster, the most practical scaling solution might be to add more servers.

Recreate your Windows deployment as accurately as possible in a test environment and, either manually or through a simulation program, put as much workload as possible on different areas of your deployment. Observing your system under such circumstances can help you formulate scaling priorities and anticipate where you might need to scale first.

After your system is deployed, software-monitoring tools can alert you when certain components of your system are near or at capacity. Use these tools to monitor performance levels and system capacity so that you know when a scaling solution is needed. For more information about monitoring performance levels, see "Implementing Software Monitoring and Error-Detection Tools" later in this chapter.