Scenarios for Restoring a Windows HPC Server 2008 R2 Cluster

Updated: June 2012

Applies To: Windows HPC Server 2008 R2

This section provides an overview of the recommended steps to restore a Windows® HPC Server 2008 R2 cluster. Because of the variety of cluster deployment options, including options to configure the head node for high availability in a failover cluster and to use remote servers running Microsoft SQL Server to store the HPC databases, you should use the restoration steps that are appropriate for your cluster configuration. The restoration steps include links to separate topics in this guide that contain detailed recovery procedures.

This topic assumes that you have backed up the data for your Windows HPC cluster, or the full system for the head node or a remote database server, by using one of the methods described in Guidelines for Backing Up a Windows HPC Server 2008 R2 Cluster in this topic.

Important
Some recovery scenarios allow you to restore only cluster configuration data, not the data that is stored in the HPC databases. Recovered configuration data can often allow critical cluster operations to resume, but the backed up HPC database data cannot be used in the recovered cluster in those scenarios.

The following is a list of the high level scenarios that are described in this section:

  • Single head node scenarios

    • Recover the entire Windows HPC Server cluster

    • Recover a failed head node computer

    • Perform a full-system restore of the head node

    • Recover a failed remote database server

    • Perform a full-system restore of the head node

  • High availability head node scenarios

    • Recover the entire Windows HPC Server cluster

    • Replace a single failed head node in the failover cluster

    • Recover a failed remote database server that is part of a SQL Server failover cluster

Single head node scenarios

The following are general recovery steps for several failure scenarios in a Windows HPC Server 2008 R2 cluster that contains a single head node.

Recover the entire Windows HPC Server cluster

Follow these general steps to recover an entire Windows HPC cluster, in the case where an entire site becomes unavailable.

  1. In another site, on a computer that meets the system requirements for the head node of the cluster, perform a clean installation of HPC Pack 2008 R2. For more information, see Deploy the Head Node in the Design and Deployment Guide for Windows HPC Server 2008 R2.

  2. Recover the cluster configuration settings. The exact settings depend on the settings that were previously backed up and the backup method, but they can include cluster data that is stored in shared folders, node templates, and job templates, in addition to custom application programs and service DLLs. For more information, see Recover the Windows HPC Cluster Configuration Settings in this guide.

  3. Redeploy the compute nodes and broker nodes in your cluster by using an appropriate deployment method.

Recover a failed head node computer

Follow these general steps in the case of an unrecoverable hardware failure on the head node computer. The existing HPC databases can be on the head node or on a remote server running SQL Server.

  1. On a new computer that meets the system requirements for the head node of the cluster, perform a clean installation of HPC Pack 2008 R2. For more information, see Deploy the Head Node in the Design and Deployment Guide for Windows HPC Server 2008 R2.

  2. Recover the cluster configuration settings. The exact settings depend on the settings that were previously backed up and the backup method, but they can include cluster data that is stored in shared folders, node templates, and job templates, in addition to custom application programs and service DLLs. For more information, see Recover the Windows HPC Cluster Configuration Settings in this guide.

Perform a full-system restore of the head node

If Windows HPC Server 2008 R2 files or SQL Server databases are corrupt on the head node, initiate Windows System Backup to perform a full-system restore.

Important
You can perform a full-system restore only by using the backups that you have created by using Windows System Backup on the same computer. For more information, see Windows Server Backup.

After performing the full-system restore, synchronize the HPC databases by starting the HPC Job Scheduler service in restore mode and performing additional steps. For more information, see Start the job scheduler in restore mode during a full-system restore.

Recover a failed remote database server

Follow these general steps in the case of a hardware failure on a remote server running SQL Server, where the HPC databases are installed. In this scenario, the head node of the cluster is assumed to be functioning properly.

  1. On a computer that meets the system requirements for SQL Server, install SQL Server 2008 SP1 or later. For more information, consult the documentation for your version of SQL Server.

  2. Restore the HPC databases in SQL Server. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.

    For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.

  3. Stop the following services on the head node of the Windows HPC Server 2008 R2 cluster: hpcscheduler, hpcmanagement, hpcreporting, hpcsdm, hpcdiagnostics, and hpcdsc.

    Note
    The hpcdsc service is installed only in Windows HPC Server 2008 R2 or later.

    At an elevated command prompt, type the following commands:

    sc config hpcscheduler start= disabled
    sc config hpcmanagement start= disabled
    sc config hpcreporting start= disabled
    sc config hpcsdm start= disabled
    sc config hpcdiagnostics start= disabled
    sc config hpcdsc start= disabled
    net stop hpcscheduler
    net stop hpcmanagement
    net stop hpcreporting
    net stop hpcsdm
    net stop hpcdiagnostics
    net stop hpcdsc
    
  4. If you previously deployed Windows Azure nodes and the state of the nodes has changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.

  5. Configure the head node to point to the restored server that is running SQL Server. To do this, you must modify the registry settings that specify the HPC databases. For more information, see Configure the Windows HPC Cluster for New Database Locations.

  6. Synchronize the HPC databases by starting the HPC Job Scheduler service in restore mode and performing additional steps. For a detailed procedure, see Perform an HPC Database Synchronization.

  7. Enable and start the HPC services.

    Note
    The hpcdsc service is installed only in Windows HPC Server 2008 R2 or later.

    At an elevated command prompt, type the following commands:

    sc config hpcscheduler start= auto
    sc config hpcmanagement start= auto
    sc config hpcreporting start= auto
    sc config hpcsdm start= auto
    sc config hpcdiagnostics start= auto
    sc config hpcdsc start= auto
    net start hpcsdm
    net start hpcscheduler
    net start hpcmanagement
    net start hpcreporting
    net start hpcdiagnostics
    net start hpcdsc
    
  8. Restart or, if necessary, redeploy the compute nodes and broker nodes in your cluster by using an appropriate deployment method.

  9. If there are Windows Azure nodes that are online and have a health state of Error, in HPC Cluster Manager, manually stop the Windows Azure nodes.

    Warnung
    Ensure that you have already stopped the deployment in the Windows Azure hosted service, as outlined in a previous step. If you do not stop the deployment first in the Windows Azure Management Portal, you will be unable to stop the Windows Azure nodes by using HPC Cluster Manager.
    Note
    If the nodes are deployed by using a node template that includes a policy to start and stop the nodes automatically, you should first edit the node template to configure a policy to start and stop the Windows Azure nodes manually. Then stop the Windows Azure nodes.

    After the Windows HPC cluster reaches a stable state, you can restart the Windows Azure nodes.

Restore SQL Server databases on a remote server

Follow these general steps in the case where an HPC database fails or becomes corrupt on a remote server running SQL Server.

  1. Restore the HPC databases in SQL Server. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.

    For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.

  2. Synchronize the HPC databases by starting the HPC Job Scheduler service in restore mode and performing additional steps. For more information, see Perform an HPC Database Synchronization.

High availability head node scenarios

The following are general recovery steps for several failure scenarios in a Windows HPC Server 2008 R2 cluster that contains a head node configured for high availability in the context of a failover cluster.

Recover the entire Windows HPC Server cluster

Follow these general steps to recover an entire Windows HPC cluster that is configured for high availability of the head node in a failover cluster, in the case where an entire site becomes unavailable. You should also follow these steps if you need to recover the resource groups for the failover cluster.

  1. On computers that meet the system requirements for a high availability configuration of the head node, perform a clean installation of Windows HPC Server 2008 R2 where the head node is configured in a failover cluster. Depending on your requirements, you can choose to install SQL Server for the HPC cluster on the same servers as the head node or on one or more remote servers that are running SQL Server. For more information, see Configuring Windows HPC Server 2008 R2 for High Availability of the Head Node.

    Note
    Configuring SQL Server as a failover cluster is recommended to help ensure the availability of the Windows HPC cluster during scheduled and nonscheduled outages. For more information, see How to: Create a New SQL Server Failover Cluster (Setup).
  2. Recover the cluster configuration settings on the active head node. The exact settings depend on the settings that were previously backed up, but they can include cluster data that is stored in shared folders, in addition to custom application programs and service DLLs. For more information, see Recover the Windows HPC Cluster Configuration Settings in this guide.

  3. Perform the following steps to restore the databases. For a detailed procedure, see To start the job scheduler in restore mode during a database restore on a head node that is configured for high availability in this guide.

    1. Close all instances of HPC Cluster Manager.

    2. Stop and disable the hpcmanagement and hpcreporting services on both head nodes and take offline the four HPC services that are in the resource group for the failover cluster.

      Important
      If you do not stop the HPC services on both head nodes before restoring the databases, database inconsistencies will be reported during the restore operation. You will then need to begin the restoration steps again.
    3. If you previously deployed Windows Azure nodes and the state of the nodes has changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.

    4. Restore the HPC databases.

    5. On the first head node on which you will enable and start the HPC services, configure the HPC Job Scheduler service for restore mode.

    6. Enable and start the HPC services on the head node.

  4. Restart or, if necessary, redeploy the compute nodes and broker nodes in your cluster by using an appropriate deployment method.

  5. If there are Windows Azure nodes that are online and have a health state of Error, in HPC Cluster Manager, manually stop the Windows Azure nodes.

    Warnung
    Ensure that you have already stopped the deployment in the Windows Azure hosted service, as outlined in a previous step. If you do not stop the deployment first in the Windows Azure Management Portal, you will be unable to stop the Windows Azure nodes by using HPC Cluster Manager.
    Note
    If the nodes are deployed by using a node template that includes a policy to start and stop the nodes automatically, you should first edit the node template to configure a policy to start and stop the Windows Azure nodes manually. Then stop the Windows Azure nodes.
    After the Windows HPC cluster reaches a stable state, you can restart the Windows Azure nodes.

Replace a single failed head node in the failover cluster

If a single head node that is configured in a failover cluster no longer functions properly because of a hardware or software failure, the cluster still functions properly by using the reMayning head node. However, the failed server needs to be replaced to restore the high availability configuration of the head node. For procedures to evict the failed head node server from the failover cluster, prepare and add the new server to the failover cluster, and install HPC Pack 2008 R2 on the new server, see Replacing a Head Node Configured in a Failover Cluster in Windows HPC Server 2008 R2.

Recover a failed remote database server that is part of a SQL Server failover cluster

If a hardware or database failure occurs on a remote server running SQL Server that is configured as a failover cluster, you can recover the failed server. In this scenario, the high availability head nodes of the cluster are otherwise assumed to be functioning properly. To recover a server that is running SQL Server, consult the documentation for your edition of SQL Server.

Additional references