Choosing applications to run on a server cluster

Article
10/08/2009

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

Choosing applications to run on a server cluster

Many, but not all, applications can be adapted to run on a server cluster. Of those that can, not all need to be set up as cluster resources. This section offers guidelines for making these decisions.

Important

All applications running in a server cluster must be from a trusted source and all files, registry checkpoints, and other resources needed for those applications must be in a secure location. For more information, see Best practices for securing server clusters.

Three criteria determine whether an application can adapt to server clustering failover mechanisms:

The application must use an IP-based protocol.

Client/server applications must use an IP-based protocol (TCP, UDP, DCOM, Named Pipes, or RPC over TCP/IP) for their network communications to run on a server cluster. Any application that uses only NetBEUI or IPX protocols cannot take advantage of cluster failover.
The application must be able to specify where the application data is stored.

Any application you run on a server cluster must be able to store its data in a configurable location, that is, on the disks attached to shared buses. Some applications that cannot store their data in a configurable location can still be configured to fail over. However, in such cases access to the application data is lost at failover because the data is available only on the disk of the failed node.
Client applications that connect to the server application must retry and recover from temporary network failures.

During failover, client applications will experience a temporary loss of network connectivity. If the client application is configured to recover from temporary network connection problems, it will be able to continue operating after a server failover.

Applications that can be failed over can be further divided into two groups: those that support the Cluster API and those that do not.

Applications that support the Cluster API (for example, Microsoft SQL Server 2000) are defined as cluster aware. These applications can register with the Cluster service to receive status and notification information, and they can use the Cluster API to administer clusters.

Important

Only Microsoft SQL Server 2000 Failover Clusters and subsequent versions are supported in a server cluster running Windows Server 2003, Enterprise Editionor Windows Server 2003, Datacenter Edition.

Applications that do not support the Cluster API are defined as cluster unaware. If cluster-unaware applications meet the TCP/IP and remote storage criteria, they can still be used in a cluster and often can be configured to fail over.

In either case, applications that keep significant state information in memory are not the best applications for clustering because information that is not stored on disk is lost at failover.

Finally, it is important to note that in order for an application to be certified by Microsoft as cluster compliant, the application must meet certain requirements. Those requirements are as follows:

Summary of Cluster service requirements

Note

Applications that do not meet these requirements are eligible for the Windows Server 2003, Standard Editioncertification only.

Rationale

A server cluster is a group of independent servers managed as a single system for higher availability. Cluster service is a set of system services in Windows Server 2003, Enterprise Editionand Windows Server 2003, Datacenter Editionthat enable you to form server clusters by connecting multiple servers together, making them appear to network clients as a single, highly available system.

Cluster service can automatically detect the failure of an application or server, and restart the application, either on the same server if it is still alive, or on another surviving server.

These requirements help ensure that your application will run properly with Cluster service enabled, so that:

Your server application can fail over to other servers.
The client side of your application properly handles failure of the server application.

Customer benefits

Customers who run your application in a clustered environment can achieve higher availability because your application can continue to provide service during both planned downtime (such as hardware and software upgrades) and unplanned outages (such as hardware or software failure).

When one of the systems, or nodes, in the cluster fails or becomes unavailable, Cluster service transfers its workload to another system in the cluster. Users experience only a momentary pause in service. You can also configure Cluster service to provide failback so that when the failed server comes back online, the workload is rebalanced across the server cluster.

Requirements

Applications must be able to be installed on up to eight nodes for certification on Windows Server 2003, Enterprise Editionor Windows Server 2003, Datacenter Edition.
Applications must support failover to all cluster members.
Clients must survive failure of the server application without failing or affecting the stability of the system.

How to comply with Cluster service requirements

Applications must be able to be installed on up to eight nodes for certification on Windows Server 2003, Enterprise Editionor Windows Server 2003, Datacenter Edition.

Note
- Make certain your application setup does not make any assumptions about the number of nodes in the cluster. Verify that it enumerates all nodes in the cluster and allows installing your application on any node, even if the disks where your application data is stored are not physically located on that node.
Applications must support failover to all cluster nodes.

When a node in the cluster fails, Cluster service will move the resource group from that node to a new node. A resource group is a collection of resources that provide services to clients and can depend on each other.

Resources that represent the primary functionality of your application must be able to start up (come online) on any other node in the cluster. After a failover is complete, very that clients are able to access all data exposed by primary functions.

Note
- Cluster service operates under a "shared nothing" architecture in which each server owns its own disk resources. In the event of a server failure, ownership of the clustered disk is transferred from one server to another. For applications to properly support failover, the application's data must be stored on the clustered disk.
Clients must survive failure of the server application without failing or affecting the stability of the system.

Any client that ships with your server application must gracefully handle both cluster node failures and application failures. Cluster and application failures may cause clients to temporarily lose their connection to the server application (see the next item). Your client must survive both the failure of the server application and node failure as follows:
- When a connection to the server application is lost, your client application must not stop responding or compromise the stability of the client operating system.
- After the failover is complete and the application is restarted on a cluster node, your client must reconnect to the cluster in one of the following ways:

Reestablish the lost connection without user intervention and with no loss of data

--or--

Offer the user a chance to reconnect and retry the operation that failed; for example, prompt the user to refresh the data in the client

If the server application is not able to restart, the client must inform the user that the connection could not be reestablished.

Connections to the server application can be lost for any of the following reasons:

The application fails and is then restarted on the same node.
The application fails and is restarted on a new node.
The node fails, and all resources fail over to a new node.
The administrator moves the resource group containing the application to a new node.
The administrator shuts down the server application.
All nodes in the cluster fail.
The client's network connection to the cluster is interrupted, even though the cluster and the server application are still running.

These failures might be exposed to the client application as application time-outs, invalid handles, network failures, and connection time-outs.

Development guidelines

The guidelines in this section are not requirements that will be tested individually for certification. However, following these guidelines will help you meet the requirements described previously:

Use TCP/IP protocol

Services that communicate with clients (as well as their clients) must use TCP/IP to be able to take advantage of IP address failover provided by Cluster service. servers that do not communicate with clients do not need to use TCP/IP.
Make sure that applications use a virtual server name and IP address to connect to the node hosting the server application.

Clients communicating with the cluster resources must use the virtual server IP address or virtual server network name to support failover.

If the server application publishes a network name or IP address to clients, it must publish an IP virtual server IP address or network name. A server application that depends on a computer name or an IP address is supposed to use a network name or an IP address of a virtual server that is used by clients to access this application. Make sure that your server application does not fail to restart on another node because the computer name on this node is different.

The following code example illustrates how to set the server application environment as part of your resource DLL online routine.
```
//
// Create the new environment with the simulated net name when the
// services queries GetComputerName.
//
if ( ! ClusWorkerCheckTerminate( pWorker ) )
{
nStatus = ResUtilSetResourceServiceEnvironment(
YOUR_SERVICE_NAME,
pResourceEntry->hResource,
g_pfnLogEvent,
pResourceEntry->hResourceHandle
);
if ( nStatus != ERROR_SUCCESS )
{
break;
} // if: error setting the environment for the service
} 
```
About IP address failover

Client applications use a virtual server IP address to access services running on a server Cluster. A virtual server is a cluster resource group containing an IP address and a network name. A virtual server can be brought online on any node in the cluster. However, it appears to clients accessing it as the same physical computer.

The IP address of the virtual server has to be configured as a cluster resource in the same resource group where the server application was created. In case of a node failure, all resource groups running on this node are moved to another node in the cluster. The IP address of the virtual server is now available on another node and all connections with the clients can be reestablished.
Upon failure, clients must preserve user data.

The client application must be able to reconnect and resume an operation in the event of cluster node failure. It must either offer the user a chance to retry the connection, or it must retry the connection automatically until it succeeds or can determine that the server application could not be brought online.

In case of a node failure, all resource groups running on the failed node are moved to another node in the cluster. Cluster service requires some time to bring all resources online and restart services on the other node. The time needed to fail over a server application depends on many factors. The most significant is the time required to restart the application.
The location of application data must be configurable.

Cluster service can fail over only disks managed by the cluster that are on the storage bus shared among all nodes in the cluster. Make sure that your application setup allows for selecting the drive and installing application data on any drive. Cluster-aware setup should allow installing the application data only on a shared drive managed by the cluster.
Checkpoint either automatically or manually the state information required for a clean restart.

If a server application maintains any state information required for a clean restart, make sure that it checkpoints this state information frequently to a cluster disk managed by a cluster. It is supposed to use this data to recover quickly after a failure.
Upon failure, an application can be restarted and, if applicable, recover to the last checkpoint.

The server application must recover from a node failure. Make sure that a sudden node failure (for example, a power blackout) will not leave your application in a state in which it cannot restart.

After a node failure, Cluster service moves the server application running on the node to another node along with any other resources it may depend on. The server application must restart, recover, and resume operation in the time you specified in your product literature.
At least one instance of the application can run as a cluster resource.

Cluster service manages applications as cluster resources. A cluster resource is a physical or logical entity that can be owned by a node, brought online and taken offline, moved between nodes, and managed as a server cluster object. A resource can be owned only by a single node at any point in time. A resource is associated with, and managed by, a resource type.

If the resource supports it, Cluster service can manage multiple instances of the same resource, but it is acceptable to support only one instance.

To take advantage of clustering, your application has to be configured as a cluster resource. Make sure you can create at least one instance of your application. Your application must function properly as a cluster resource. Cluster service must be used to start your application (bring it online) and stop it (take it offline).
Can be configured at least as a generic service or application.

The monitoring and failover capabilities of Cluster service can be extended to support any application. Cluster service uses resource DLLs to extend its failover support to other resource types.

Applications that do not offer an application-specific resource DLL can still take advantage of clustering by using a generic application or generic service resource type. These resource types offer failover protection against most failures, notably node failure. However, they cannot detect your application failures. If your application stops responding, Cluster service will not be able to detect this failure and either restart or fail over your application.

It is acceptable to use a generic application or a generic service type to manage your application as a cluster resource.

How to pretest applications for Cluster service requirements

How to pretest whether your application is cluster ready.

If your application's setup is cluster aware, use setup to configure all nodes.

If your application's setup is not cluster aware, install your server application on at least two nodes in the cluster. Use the Cluster Administrator console to create a virtual server and configure your server application as a generic service or application. Use the Cluster Administrator console to move your resource to either node in the cluster. If your application is cluster-ready, it is supposed to come online on any node in the cluster. Clients should be able to access the service provided by your application no matter which node hosts it.

For certification on Windows Server 2003, Enterprise Editionor Windows Server 2003, Datacenter Edition, repeat this procedure for three-node, four-node, five-node, six-node, seven-node, and eight-node configurations.

How to pretest whether your application supports failover

After you have installed the application on all nodes in the cluster, run functionality tests to verify that the application is fully functional and stable.
Cause the node running your application to fail so that failover of the application is triggered.

Following are suggestions for triggering failure:
- Hardware failure. Simulate by doing a hard reset.
- Operating system failure. Simulate by emitting a CTRL+C command followed by a ".reboot" command from a remote kernel debugger.
- Application failure. Simulate using the End Process feature in Task Manager or Process Viewer (Pview.exe in the Platform SDK).
  
  Note that normal shutdown of the computer is not a valid test for failover because the application will have the opportunity to shut down gracefully.
Verify that the application restarts on a new node in the cluster.
Run functionality tests to verify that all functionality is again available on the new node. The application must have access to all data to which it previously had access.
For testing on Windows Server 2003, Enterprise Editionor Windows Server 2003, Datacenter Edition, repeat steps 2 through 4 to verify that the application subsequently fails over to each of the remaining nodes.

How to pretest whether clients you provide survive failure and subsequent restart of the server application

Cause the server application to fail using each of the following scenarios:

Shut down the server application using the normal shutdown sequence and leave all nodes in the cluster running.
Terminate the application process (do not use the normal shutdown sequence), but leave the node running.
Stop the node.

Do not use normal shutdown. The node and application must not have time to exit gracefully. Following are suggestions for various failure modes:
- Hardware failure. Simulate by doing a hard reset.
- Operating system failure. Simulate by emitting a CTRL+C command followed by a ".reboot" command from a remote kernel debugger.
- Application failure. Simulate using the End Process feature in Task Manager or Process Viewer (Pview.exe in the Platform SDK).

For each case:

Verify that the client does not stop responding or lose stability when the server application fails.
After the server application is restarted, either on the same node or a different node, verify that the client either:
- Reestablishes the connection with no user intervention and no loss of user data.
  
  --or--
- Prompts the user to retry the connection and the application then establishes the connection.

Note

If you need to manually configure the client to access the server application, configure it to use the virtual server, not the node name.

How to pretest whether clients you provide survive failure without subsequent restart of the server application

Cause the application to fail in a way that does not allow it to restart on the cluster. To do this, you can take the resource offline, or stop all nodes.
Verify that the clients do not stop responding or lose stability.
Verify that the clients notify the user in a reasonable time that the connection to the server application was lost.
Verify that the client can be closed without failing or affecting the stability of the client's workstation, and that the user can preserve data, if appropriate.

Choosing applications to run on a server cluster