Step 3: Determine Role Placement and Fault Tolerance
Published: November 12, 2007 | Updated: February 25, 2008
If the SoftGrid instance is supporting critical applications, it can be deployed using several methods to increase fault tolerance. There are different strategies for different server roles in the SoftGrid environment. SQL Server, for example, is made highly available by deploying it in an active-passive MSCS cluster. The VAS’s high availability is based on creating a load-balanced array of VASs. Active Directory has built-in high availability, and the Management Web Service can also be installed on two or more separate servers to provide redundant management points. A number of services are important to the functionality of the SoftGrid infrastructure. In order to increase the reliability and availability of these services, additional technology can be used to increase each component system’s fault tolerance. Although there are a number of fault tolerance strategies and technologies available, not all are applicable to a given service. Additionally, if SoftGrid roles are combined, certain fault tolerance options may no longer apply due to incompatibilities. SoftGrid does not currently support the full range of fault-tolerant solutions in the market. For example, there are several additional methods for making Microsoft SQL server fault tolerant beyond what is given below; however, these additional methods are not applicable to the SoftGrid system. Finally, component level fault tolerance, such as RAID systems, is not discussed at a SoftGrid role level. Below is a list of the major options for fault tolerance in SoftGrid:
Ignoring scaling and fault tolerance requirements, the minimum number of servers needed for a location with connectivity to Active Directory is one. This server will host the Content Storage System, Management Web Service, Microsoft SQL Server, and the Virtual Application Server roles. Server roles, therefore, can be arranged in any desired combination since they do not conflict with one another. Ignoring scaling requirements, the minimum number of servers necessary to provide a fault-tolerant implementation is four when Active Directory is already present in the environment with multiple domain controllers. The Content Storage System, SQL Server, and Virtual Application Server are all capable of being placed in fault-tolerant configurations. The Management Web Service can be combined with any of the roles, but remains a single point of failure. Table 3. Compatible Fault Tolerant Role Combinations
Task 1: Active DirectoryActive Directory provides group security and access control to applications. When the SoftGrid client connects to a VAS to request applications that can be accessed by the user currently logged on, it passes group membership information to the VAS in the form of the user’s Windows security token. The VAS in turn uses Active Directory to track permissions on applications. The VAS will provide access to the application for which the user has been granted permissions. The client then streams those applications down to the client’s computer. If a user’s permissions are modified to remove them from a group that is associated with a particular application, the next time he or she attempts to launch the application, access will be denied. A domain controller should be located near the location, ideally within the same location, in order to efficiently handle the requests made by the SoftGrid system. If Active Directory is unavailable, clients will be unable to launch applications if the VAS is still running. If both the VAS and Active Directory are not working, clients will enter disconnected mode and launch any applications they previously had successfully launched. If a VAS service is unable to contact Active Directory during startup, the service will fail to run. Fault tolerance for domain controllers can be accomplished by adding additional domain controllers to the infrastructure. Guidance for adding additional domain controllers is outside the scope of this guide. Decision 2: Content Storage SystemIf the content storage system is unavailable, clients will be unable to install new applications or update existing applications. The content storage system can be placed locally with the VAS or on a shared storage device Network Attached Storage (NAS) or a file server. If the content is stored in a shared location, then ensuring the availability of the data is critical. Care must be taken to ensure that the network path between the location of the content storage system and the VASs is sufficiently high bandwidth. The disk I/O subsystem and NIC in the file server/SAN/NAS that hosts the content must have sufficient I/O throughput to handle several VASs reading concurrently from the content share. If the content is stored locally to the VAS or if the virtual application packages need to be shared across locations, file replication solutions are recommended for keeping the content shares synchronized. The directory or share that is used to store the SoftGrid-enabled application packages is referred to as the SoftGrid Content directory. Option 1: Built-InIf the content storage system is a Network Attached Storage (NAS), then typically these systems provide multiple methods for ensuring data availability, reliability, and scalability. They can provide multiple hardware paths to the data, low level RAID capabilities, and fault-tolerant hardware. From a client system point of view, the storage appears as a remote share in the case of NAS. In order for Storage Area Networks to be fault tolerant, the file server hosting the share needs to be made fault tolerant as well. Option 2: Fault-Tolerant File ReplicationIf each VAS hosts a local Content Storage System within a single location, then file replication can be used to keep the multiple storage systems synchronized. Likewise, if the information in the content storage system needs to remain the same across locations, file replication can be used as well. If DFS is being used, then the availability of the data is increased because DFS is able to redirect requests to another copy of the share if the targeted location is off-line. It is important to note that FRS and DFS-R are not supported on a Server Cluster although DFS is supported. Option 3: Server ClusteringServer clustering can be used to increase the fault tolerance of single content storage system file share. The file share becomes a clustered resource running on a cluster with two or more computers. If the computer hosting the file share fails, the file share will move to a remaining active node. Although a share hosted in a server cluster can become part of a DFS namespace, the content of the share cannot be replicated using FRS or DFS-R. Evaluating the Characteristics
Task 3: Management Web ServiceAny server may host the Management Web Service as long as it can communicate with the SoftGrid database and Active Directory. The Management Web Service reads and writes configuration data to the SoftGrid database as well as querying Active Directory for group membership information. Typically, this Web service will also be installed on the VAS in smaller installations. The Management Console can be placed on the same management server or may be placed on an administrator’s workstation. An important consideration to make when the Management Web Service is placed on the VAS is the negative performance impact that report generation can have in large environments. For this reason, it is recommended to have a dedicated server to host the SoftGrid Management Web Service in large environments that will be running reports. The Management Web Service is only used to configure the SoftGrid environment. If the Web service fails, the SoftGrid system will continue to function normally with the exception of SoftGrid management changes and reporting. In the event of a Management Web Service failure, the Management Console can be used to redirect the system to use another instance of the Management Web Service in the environment. Although multiple instances of the Management Web Service can be run in a single SoftGrid instance, no testing has been done with providing fault tolerance to the Management Web Service, and therefore it is not officially supported at this time. Task 4: Microsoft SQL ServerSoftGrid requires SQL Server 2000 or SQL Server 2005. Data about the application, license management, and report data are kept in the SQL Server database. In locations where fault tolerance is not required, the SQL Server-based server can be installed on the same server as the Web Management Service and VAS. SQL Server provides a number of mechanisms for fault tolerance. This includes Database Mirroring, Log Shipping, Server Clustering, and Peer-to-Peer Replication. Although all of these provide some form of increased fault tolerance for a database, the only supported method for SoftGrid today is server clustering. If the SoftGrid database is unavailable, no configuration changes can be made to the SoftGrid system. VASs that are currently running will continue to service clients. However, the VAS service will fail to run if the database is unavailable during startup. Clustering SQL Server-based servers will increase the complexity of the environment. Creating a new cluster using Microsoft Cluster Services will require additional servers with the appropriate hardware to support the cluster service. MSCS will also require a shared storage device that can be locally attached to the servers running SQL Server, thus increasing the costs of deploying SoftGrid. Because the load introduced by SoftGrid is extremely low, an existing SQL Server-based server cluster can be used to host the SoftGrid configuration database to provide fault tolerance through minimal costs. Decision 5: Virtual Application ServerVASs perform a critical role in the SoftGrid infrastructure. They are the servers that have direct connectivity to the client workstations; they are also responsible for streaming applications to the clients. The VAS role must be deployed in the same location and, if possible, on the same fast LAN as the SQL Server role in order to ensure good connectivity between the VAS and the SoftGrid configuration information that is stored in the SQL Server database. In locations where fault tolerance is not required, the VAS can be deployed to the same server as SQL Server and the Management Web Service. Fault tolerance is achieved by load balancing VASs. Some load balancing solutions will provide fault tolerance at the machine level. That is, if the entire machine fails, the load balancing cluster will no longer send requests to that system. However, application failures are not recognized, so client requests will still be sent to the affected server. Other load balancing solutions will provide higher levels of fault tolerance by recognizing when the application layer has stopped responding. This will ensure that the remaining servers will continue to handle the streaming functionality in the event that a server fails. N+1 or greater VAS redundancy is required to provide fault tolerance. The VAS service can use a Network Load Balancing system to provide additional fault tolerance to the system. The VAS is not cluster aware nor has it been tested on a server cluster, so this configuration is not supported at this time. There are two network load balancing options available: software-based NLB and hardware load balancer. Option 1: Software-based NLBNLB is a cost-effective method for providing load balancing as well as a basic level of fault tolerance and scalability. NLB does not query the health of the real-time streaming protocol (RTSP) on the VAS. This can lead to a situation where the VAS appears healthy because the NLB heartbeat is detected; however, the VAS service is down and will not answer client requests. Although up to 32 systems can be placed in a single software-based NLB cluster using Microsoft NLB, it has been observed in production that the effective performance of the system drops for cluster groups containing more than six members, so independent verification testing should be conducted if needed. Option 2: Hardware Load BalancerTo provide access to the VAS array of servers and to recognize when a VAS has stopped responding to requests automatically, a hardware load-balancing solution that supports Hypertext Transfer Protocol (HTTP) and RTSP is required. This level of configuration adds complexity to the overall deployment of the SoftGrid servers. Hardware load balancers also add costs to the SoftGrid solution. Two or more hardware load balancers are necessary. If only one is implemented, then the hardware load balancer becomes the single point of failure. Because the handling of client connections is handled by specialized hardware, hardware load balancers tend to scale to handle more concurrent client sessions than software-based load balancers. Evaluating the Characteristics
Validating with the Business
Decision SummaryThis step should be repeated for each SoftGrid instance required. At this point, the requirements around fault tolerance will have been identified as well as the implementation to meet those requirements for a given SoftGrid instance. Fault tolerance for SoftGrid in Connected Mode provides a system that is able to service client requests for new applications or updates. Applications that have previously been cached will run in a disconnected mode in the event of an infrastructure failure. Additional Reading
|
|