Planning for fault tolerance and availability in Project Server 2007
Updated: October 16, 2008
"Fault tolerance" and "availability" refer to the ability of a multiple-server environment to accept connections and operate normally even when one or more of the components in the farm are not operational. Availability implies redundancy, and may additionally include a failover mechanism and several other possible characteristics.
You can use the following strategies to improve the fault tolerance of your Microsoft Office Project Server 2007 deployment:
Server role redundancy
This article provides more information about each of these strategies. You can apply these strategies individually or in combination. Because each strategy has a cost associated with it, it is important to examine the cost/benefit ratio for each before applying it in your organization.
We recommend that you consider availability requirements as part of the core design of the Office Project Server 2007 solution. You can also provide enhanced availability after the solution is deployed. Operationally, we recommend that you deploy and tune the core solution within a farm, and then test the availability solutions.
What is availability?
Availability is the degree to which a system such as Office Project Server 2007 is perceived by users to be available. To ensure availability means to ensure that a system is resilient — that is, that service-affecting incidents occur infrequently, and that timely and effective action is taken when they do. Availability strategies minimize the user perception of planned and unplanned downtime.
One of the most common measures of availability is percentage of uptime expressed as number of nines — that is, the percentage of time that a given system is active and working. For example, a system with a 99.999 uptime percentage is said to have five nines of availability.
The following table correlates the number of nines to calendar time equivalents.
|Acceptable uptime percentage||Downtime per day||Downtime per month||Downtime per year|
If you can make an educated guess as to the number of total hours downtime you are likely to have, you can use the following formulas to calculate the uptime percentage for a year, a month, or a week:
% Uptime/year = 100 – (8760 – number of total hours down per year)/8760
% Uptime/month = 100 – ((24 * number of days in the month) – number of total hours down in that calendar month)/(24 * number of days in the month)
% Uptime/week = 100 – (168 – number of total hours down in that week)/168
What availability is not
Availability is not data protection and recovery, nor is it disaster recovery. You should have separate data protection and disaster recovery plans in any highly available system.
Moreover, availability is not business continuation management (BCM). BCM consists of the business decisions, processes, and tools you put in place in advance to handle crises. A crisis can be a local, regional, or national event, or a crisis can relate to only your business.
Cost of availability
Availability is one of the more expensive requirements for a system. The higher the level of availability and the more systems you protect, the more complex and costly an availability solution is likely to be. When you invest in availability, costs include:
Additional hardware and software, often involving complex operations between software, such as custom scripts for failover and recovery.
Additional operational complexity.
The costs of attaining availability should be evaluated based on your business needs — not all solutions within an organization are likely to require the same level of availability. You can offer different levels of availability for different sites, different services — for example, search and business intelligence — or different farms.
Availability is a key area in which information technology (IT) groups offer service-level agreements (SLAs) to set expectations with customer groups. Many IT organizations offer a variety of SLAs that are associated with different chargeback levels.
Redundancy is a key part of availability. Redundancy includes the use of multiple servers in a load-balanced environment to improve farm performance or to scale out to accommodate additional users. Redundancy also includes the use of identical backup components, such as power supplies or networking equipment, to provide continued functionality in the event of the failure of the primary component.
This article describes how to implement redundant servers in an Office Project Server 2007 farm.
Office Project Server 2007 supports scalable server farms for capacity, performance and availability. Typically, capacity is the first consideration in determining the number of server computers to start with. After factoring in performance, availability also plays a role in determining both the number of servers and the size or capacity of the server computers in a server farm.
Determining availability requirements
To gauge the organization’s tolerance of downtime for a site, service, or farm, answer the following questions for the site, service, or farm.
If Office Project Server 2007 becomes unavailable, will employees of the organization be unable to perform their expected job responsibilities?
If Office Project Server 2007 becomes unavailable, will business and customer transactions be halted, leading to loss of business and customers?
If you answered yes to any of these questions, you should invest in an availability solution.
Although this article primarily discusses the availability of Office Project Server 2007, the system uptime will also be affected by the other components in the system. In particular, consider the following:
You should ensure that infrastructure dependencies such as power, cooling, network, directory, and SMTP are fully redundant.
Choose a switching mechanism for the system, whether DNS or hardware load balancing, that meets your needs. Best practices for load-balancing Web servers can be found in the following articles:
Clustering can protect your system against an application or operating system failure. You can also perform many tasks on clustered computers without taking them offline, including upgrading an application or operating system or installing a service pack or update.
Server clusters are designed to keep applications available, rather than to protect data. To protect against viruses, corruption, and other threats to data, you need solid data protection and recovery plans. Cluster technology cannot protect against failures caused by viruses, software corruption, or human error.
SQL Server failover clustering
Failover clusters are designed for stateful applications. Stateful applications have long-running in-memory state, or they have large, frequently updated data states.
Failover clusters provide high availability by allowing the failover of resources. Failover clusters also maintain client connections to applications and services.
In failover clusters, nodes share access to data. Nodes can be either active or passive, and the configuration of each node depends on the operating mode (active or passive) and how you configure failover in the cluster. A server that is designated to handle failover must be sized to handle the workload of the failed node in addition to its own workload.
In Office Project Server 2007 deployments, you can use failover clustering with SQL Server.
Load-balanced clusters are groups of identical, typically cloned computers that are used to enhance the availability of Web servers, Microsoft Internet Security and Acceleration (ISA) servers (for proxy and firewall servers), and other applications that receive Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) traffic. Because cluster nodes are usually identical clones of each other and can therefore operate independently, all nodes in a cluster are active.
Office Project Server 2007 supports two methods of load balancing:
Software, such as Network Load Balancing (NLB) services in the Microsoft Windows Server 2003 operating system. NLB runs on the front-end Web servers, and uses TCP/IP to route requests. Because NLB (and other software load balancing solutions) runs on the front-end Web servers, it uses the front-end Web system resources, and thereby reduces the resources you can use for serving Web pages. However, the impact on system resources is not great, and a software solution can handle up to 32 front-end Web servers.
Hardware, such as a router or a switch box. Load-balancing hardware uses the network to direct Web site traffic between the front-end Web servers. Load-balancing hardware is more expensive to set up than software, but it does not affect resources on the front-end Web servers. Office Project Server 2007 can be used with any load-balancing hardware.
Although not recommended, there is a third method of load balancing — round-robin load balancing with Domain Name System (DNS). Round-robin DNS load balancing can use significant resources on the front-end Web servers, is slower than either load balancing software or hardware, and is not recommended for use with Office Project Server 2007. Also, round-robin DNS load balancing does not take session load into account when routing a user to a server, which can lead to a server being overloaded.
You can provide some fault tolerance for your Office Project Server 2007 deployment by deploying additional hardware configurations that duplicate the hardware configuration of your organization. In this way, if one path of data input/output (I/O), or the physical hardware components of a server (such as computer, network, and storage area network components) fail, the system is not affected. The hardware that you use to minimize the single points of failure varies according to what components you want to make redundant. Hardware vendors typically include duplicate hardware as part of their storage solution.
By using a redundant array of independent disks (RAID), you can increase the fault tolerance of your Office Project Server 2007 deployment. RAID stores identical data on multiple disks for redundancy, improved performance, and increased mean time between failures (MTBF). In a RAID configuration, part of the physical storage capacity contains redundant information about data stored on the hard disks. The redundant information is either parity information (in the case of a RAID-5 volume), or a complete, separate copy of the data (in the case of a mirrored volume). The redundant information enables data regeneration if one of the disks or the access path fails, or if a sector on the disk cannot be read.
To ensure that computers running Office Project Server 2007 continue to function properly in the event of a single-disk failure, you can use RAID disk mirroring or disk striping with parity on the hard disks within your Office Project Server 2007 deployment. Disk mirroring and disk striping with parity creates redundant data for the data on your hard disks.
The Office Project Server 2007 databases are very I/O intensive. For this reason, we recommend RAID 10 for optimal performance and redundancy of drives containing Office Project Server 2007 databases.
Using RAID configurations does not prevent damaged files or other file errors. For this reason, do not use RAID configurations as a substitute for keeping current backups of important data on your servers.
Because transaction log files and database files are critical to the operation of computers running Office Project Server 2007, you can keep the transaction log files and database files on separate physical disks. You can also use RAID disk mirroring or disk striping with parity to prevent the loss of a single physical hard disk from causing a failure in your Office Project Server 2007 database.
If your environment contains a Storage Area Network (SAN), you may already have the needed disk redundancy for your deployment. In a SAN environment, we recommend not placing your Office Project Server 2007 deployment and its associated components on the same disk spindle as other I/O intensive applications, as this may cause performance degradation. Office Project Server 2007 data is optimized for sequential reads, making it ideal for a SAN environment.
Server role redundancy
Which baseline server topology you choose depends on the requirements for redundancy of application server roles. This section describes the application server roles relative to their redundancy options.
Roles that can be redundant
These application server roles can be deployed to multiple servers. The code that is deployed to each server is identical and the application server roles do not store any data. In other words, each instance of these server roles remains identical. If one of the server computers fails, no saved data is lost. The Web servers automatically load balance requests to these server roles across the available application server computers.
The Office Project Server 2007 Project Application Service can be deployed redundantly. This allows for greater throughput for PWA data requests and can increase the capacity of your deployment. However, deploying the Project Application Service on multiple servers does not increase the availability of the farm. If one of the servers fails, the farm will not automatically detect the failure and will continue to send requests to the failed Project Application Service server until it is manually removed from the farm.
Roles that cannot be redundant
Some application server roles you can enable in Office Project Server 2007 cannot be redundant, such as Windows SharePoint Services 3.0 search. This application server role can be deployed to multiple servers; however, the multiple servers are not redundant. This server role is configured to crawl content and generate content indexes. If you deploy this role to multiple servers, each server crawls different content.
Database server redundancy
The database server role affects the availability of your solution more than any other role. If a Web server or an application server fails, these roles can quickly be restored or redeployed. However, if a database server fails, your solution depends on restoring the database server. This can potentially include rebuilding the database server and then restoring data from the backup media. In this case, you can potentially lose any new or changed data dating back to the last backup job, depending on how SQL Server is configured. Additionally, the solution will be completely unavailable for the time it takes to restore the database server role.
In any system, we recommend that you work with hardware vendors to procure fault-tolerant hardware that is appropriate for the system, including RAID arrays.
When planning for component fault tolerance, consider the following:
Complete redundancy of every component within a server may not be possible or practical. Use additional servers for additional redundancy.
Ensure that servers have multiple power supplies connected to different power sources for maximum redundancy.
With Microsoft SQL Server, you can use log shipping to feed transaction logs from one database to another continuously. Continually backing up the transaction logs from a source database and then copying and restoring the logs to a destination database keeps the destination database synchronized with the source database. Log shipping provides an automated method of maintaining a standby server.
A standby server is a second server that can be brought online if a primary production server fails. The same software components that are installed on the primary server are installed on the standby server. Using a standby server allows users to continue working with Office Project Server 2007 data if the primary server becomes unavailable.
A standby server can also be used when a primary server is unavailable due to scheduled maintenance. For example, if you must take the primary server offline for a hardware or software upgrade, you can use the standby server until the primary server is brought back online.
The most important factor to consider when using standby servers is that the hardware, software updates, and firmware updates on a standby server must be identical to those of the primary server that the standby server is designed to replace.
If the standby server is a database server, it must contain a copy of the databases on the primary server. If the primary server goes offline and the standby server is brought online, when the primary server becomes available again, any changes to the copies of the database that are located on the standby server must be copied back to the primary server. Otherwise, those changes are lost. When users start using the primary server again, the databases on the primary server should be backed up and copied to the standby server.
Log shipping is best used to ensure that the standby server remains synchronized with the primary server. If the primary server fails, or even if just a single database fails, the databases on the standby server can be made available to user processes. Any user processes that cannot access the primary server must use the standby server instead.
If you have separate front-end Web servers as part of your deployment, you can install the Project Application Service on the front-end Web servers and leave them turned off. Then, in the event of the failure of one of your Office Project Server 2007 servers, you can activate the Project Application Service on the front-end Web to easily bring online a standby server.