Perform an HPC Database Synchronization

Updated: July 2011

Applies To: Windows HPC Server 2008 R2

Windows® HPC Server 2008 R2 includes a restore mode for the HPC Job Scheduler service, which can be configured by setting the Restore registry key to 1. When you restore the HPC databases, you must configure the cluster to enter restore mode before you restart the HPC Job Scheduler service, and follow several other steps to help synchronize the HPC databases and to return the system to a stable state.

Tipp
In a cluster running at least Windows HPC Server 2008 R2 with Service Pack 2, you can configure the cluster to enter restore mode by running the Set-HPCClusterProperty HPC PowerShell command with the –Restore parameter, instead of manually setting a Registry key.
Important
Because of the variety of cluster deployment options, including options to configure the head node for high availability in a failover cluster, and the different restoration scenarios that are possible, you should use the steps to start the HPC Job Scheduler service in restore mode that are appropriate for your cluster configuration and situation.

In this section:

  • Overview: Bringing the cluster to a consistent state during a database restore

  • What happens when the HPC Job Scheduler service starts in restore mode?

  • Start the job scheduler in restore mode

    • To start the job scheduler in restore mode during a full-system restore

    • To start the job scheduler in restore mode during a database restore on a single head node

    • To start the job scheduler in restore mode during a database restore on a head node that is configured for high availability in a failover cluster

  • Verify the restore operations and bring the cluster to a stable state

  • Filter and sort the job list to see the jobs that were canceled during restore mode

  • Delete the message queue on WCF broker nodes

Overview: Bringing the cluster to a consistent state during a database restore

After you restore HPC databases from a backup, the job queue in the restored databases will not be consistent with what is running on the cluster. The databases will contain jobs in the state they were in when the backup was made. Many of those jobs may already have finished. Additionally, the compute nodes may be running jobs that were submitted after the backup was made. The restored databases will have no records for these jobs.

When you restore the HPC databases, you need to perform additional steps to help to return the cluster to a consistent state. The following procedures describe these additional steps. The exact steps for restoring your system or your databases depend on the your backup and restore solution (for example, Windows Server Backup, SQL Server Backup, Data Protection Manager, or non-Microsoft solutions).

To restore the HPC databases, you need to:

  • Have backups of the HPC databases.

  • Understand what the HPC Job Scheduler service does in restore mode.

  • Know the steps for restoring the databases according to the backup method that you used, and know the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.

  • Start the HPC Job Scheduler service in restore mode.

  • Verify the restore operations and bring the cluster to a stable state after a database restore.

  • Decide how to handle jobs that were canceled by the HPC Job Scheduler service during restore mode.

  • Delete the message queue on the Windows Communication Foundation (WCF) broker nodes.

What happens when the HPC Job Scheduler service starts in restore mode?

Every time the HPC Job Scheduler service restarts, it checks the Restore registry key. If the key has a value of 1, then the HPC Job Scheduler service starts in restore mode. After the HPC Job Scheduler starts in restore mode, to help to bring the system to a consistent state, the service cancels all jobs in the database that are in the Submitted, Validating, Queued, or Running states. The scheduler stops all tasks that are actually running on the compute nodes (nodes periodically send status information to the head node about the jobs and tasks that are running, so even tasks that do not have records in the database are stopped).

In Event Viewer, you can see warning events from the SchedulerService that indicate the service has entered restore mode, how many jobs were canceled in each state, and the restore is complete. You will also see a warning event for each unrecognized task that the HPC Job Scheduler service stopped (tasks that were running on the cluster, which the restored database does not have records for).

After the HPC Job Scheduler service completes the restore mode steps, it clears the Restore key in the registry, writes a warning event to the system event log to indicate that the restore is complete, and then starts scheduling jobs again. This means that if users submit jobs right after the restore, the HPC Job Scheduler service will attempt to run them.

At this point, the HPC job scheduling database contains three categories of jobs:

  • Jobs that were Finished, Canceled, or Configuring when the backup was made. These jobs have not been changed.

  • Jobs that were Submitted, Validating, Queued, or Running when the backup was made. These jobs are now Canceled.

  • New jobs, in any state, that users submitted after the HPC Job Scheduler service completed the restore mode steps.

Start the job scheduler in restore mode

When you restore of the HPC databases, you must set the Restore registry key before restarting the HPC Job Scheduler service. The following procedures are available:

  • To start the job scheduler in restore mode during a full-system restore

  • To start the job scheduler in restore mode during a database restore on a single head node

  • To start the job scheduler in restore mode during a database restore on a head node that is configured for high availability in a failover cluster

To start the job scheduler in restore mode during a full-system restore

  1. After you perform the full-system restore on the head node, start the head node in safe mode.

  2. Set the Restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts by doing one of the following:

    • In a cluster running at least Windows HPC Server 2008 R2 with SP2, run the Set-HPCClusterProperty PowerShell command with the –Restore parameter:

      1. Start HPC PowerShell. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, right-click HPC PowerShell, and then click Run as administrator.

      2. Type the following command:

        Set-HPCClusterProperty –RestoreMode:$true
        
    • Otherwise, manually set the registry key by typing the following command at an elevated command prompt:

      reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f
      
      Caution
      Durch eine fehlerhafte Bearbeitung der Registrierung können schwerwiegende Schäden am System verursacht werden. Sichern Sie alle Importanten Daten auf dem Computer, bevor Sie Änderungen an der Registrierung vornehmen.
    Important
    Durch eine fehlerhafte Bearbeitung der Registrierung können schwerwiegende Schäden am System verursacht werden. Sichern Sie alle Importanten Daten auf dem Computer, bevor Sie Änderungen an der Registrierung vornehmen.
  3. Restart the head node in normal mode.

  4. Continue to Verify the restore operations and bring the cluster to a stable state.

To start the job scheduler in restore mode during a database restore on a single head node

  1. Close all instances of HPC Cluster Manager.

    Caution
    Do not continue if HPC Cluster Manager is running. If you restore the HPC databases while HPC Cluster Manager is open, you may not be able to perform node operations in HPC Cluster Manager after you restore the databases.
  2. On the head node, stop and disable the HPC services as follows:

    • Open an elevated Command Prompt window.

      Klicken Sie zum Öffnen eines Eingabeaufforderungsfensters mit erhöhten Rechten auf Start, klicken Sie auf Alle Programme, klicken Sie auf Zubehör, klicken Sie mit der rechten Maustaste auf Eingabeaufforderung, und klicken Sie anschließend auf Als Administrator ausführen.

    • At the elevated command prompt, type the following commands to stop and disable the HPC services:

      sc config hpcscheduler start= disabled
      sc config hpcmanagement start= disabled
      sc config hpcreporting start= disabled
      sc config hpcsdm start= disabled
      sc config hpcdiagnostics start= disabled
      net stop hpcscheduler
      net stop hpcmanagement
      net stop hpcreporting
      net stop hpcsdm
      net stop hpcdiagnostics
      
  3. If you previously deployed Windows Azure nodes and the state of the nodes changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.

    To stop the deployment in Windows Azure

    1. In the navigation pane of the Management Portal, click Hosted Services, Storage Accounts & CDN.

    2. Click Hosted Services.

    3. In the group for the hosted service that you used to deploy the Windows Azure nodes, click the deployment. This has the name Deployment for <HostedServiceName>, where <HostedServiceName> is the name of the hosted service.

    4. On the ribbon, in the Deployments group, click Stop. The status of the service deployment status changes to Stopped.

  4. If you have not already done so, restore and replace (overwrite) the HPC databases. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for your backup solution.

    For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.

  5. Set the Restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts by doing one of the following:

    • In a cluster running at least Windows HPC Server 2008 R2with SP2, run the Set-HPCClusterProperty PowerShell command with the –RestoreMode parameter:

      1. Start HPC PowerShell. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, right-click HPC PowerShell, and then click Run as administrator.

      2. Type the following command:

        Set-HPCClusterProperty –RestoreMode $true
        
    • Otherwise, manually set the registry key by typing the following command at an elevated command prompt:

      reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f
      
      Caution
      Durch eine fehlerhafte Bearbeitung der Registrierung können schwerwiegende Schäden am System verursacht werden. Sichern Sie alle Importanten Daten auf dem Computer, bevor Sie Änderungen an der Registrierung vornehmen.
  6. Enable and start the HPC services by typing the following commands at an elevated command prompt:

    sc config hpcscheduler start= auto
    sc config hpcmanagement start= auto
    sc config hpcreporting start= auto
    sc config hpcsdm start= auto
    sc config hpcdiagnostics start= auto
    net start hpcsdm
    net start hpcscheduler
    net start hpcmanagement
    net start hpcreporting
    net start hpcdiagnostics
    
    Note
    • It may take more than 30 seconds to start the hpcscheduler service. If this happens, you may see a timeout error message. This message is only informational, and it can be safely ignored.

    • After the hpcscheduler service starts, it clears the Restore key in the registry.

  7. Continue to Verify the restore operations and bring the cluster to a stable state.

To start the job scheduler in restore mode during a database restore on a head node that is configured for high availability in a failover cluster

  1. Close all instances of HPC Cluster Manager.

    Caution
    Do not continue if HPC Cluster Manager is running. If you restore the HPC databases while HPC Cluster Manager is open, you may not be able to perform node operations in HPC Cluster Manager after you restore the databases.
  2. Stop and disable the HPC services by doing the following:

    • In Failover Cluster Manager, in the resource group for the failover cluster, take the following resources offline: hpcscheduler, hpcsdm, hpcdiagnostics, and hpcsession.

    • On each head node computer, open an elevated Command Prompt window. Klicken Sie zum Öffnen eines Eingabeaufforderungsfensters mit erhöhten Rechten auf Start, klicken Sie auf Alle Programme, klicken Sie auf Zubehör, klicken Sie mit der rechten Maustaste auf Eingabeaufforderung, und klicken Sie anschließend auf Als Administrator ausführen.

  3. At the elevated command prompt, type the following commands:

    sc config hpcmanagement start= disabled
    sc config hpcreporting start= disabled
    net stop hpcmanagement
    net stop hpcreporting
    
  4. If you previously deployed Windows Azure nodes and the state of the nodes changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.

    To stop the deployment in Windows Azure

    1. In the navigation pane of the Management Portal, click Hosted Services, Storage Accounts & CDN.

    2. Click Hosted Services.

    3. In the group for the hosted service that you used to deploy the Windows Azure nodes, click the deployment. This has the name Deployment for <HostedServiceName>, where <HostedServiceName> is the name of the hosted service.

    4. On the ribbon, in the Deployments group, click Stop. The status of the service deployment status changes to Stopped.

  5. If you have not already done so, restore and replace (overwrite) the HPC databases. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.

    For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.

  6. On the first head node on which you will enable and start the HPC services, set the Restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts by doing one of the following:

    • If your cluster is running at least Windows HPC Server 2008 R2 with SP2, run the Set-HPCClusterProperty PowerShell command with the –Restore parameter:

      1. Start HPC PowerShell. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, right-click HPC PowerShell, and then click Run as administrator.

      2. Type the following command:

        Set-HPCClusterProperty –Restore:$true
        
    • Otherwise, manually set the registry key by typing the following command at an elevated command prompt:

      reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f
      
      Caution
      Durch eine fehlerhafte Bearbeitung der Registrierung können schwerwiegende Schäden am System verursacht werden. Sichern Sie alle Importanten Daten auf dem Computer, bevor Sie Änderungen an der Registrierung vornehmen.
  7. Make the head node on which you set the Restore registry the active head node in the failover cluster.

  8. Enable and start the HPC services on the active head node by doing the following:

    1. In Failover Cluster Manager, in the resource group for the failover cluster, bring the following resources online: hpcscheduler, hpcsdm, hpcdiagnostics, and hpcsession.

    2. At an elevated command prompt on each head node computer (starting first on the active head node), type the following commands:

      sc config hpcmanagement start= auto
      sc config hpcreporting start= auto
      net start hpcmanagement
      net start hpcreporting
      
    Note
    • It may take more than 30 seconds to start the hpcscheduler service. If this happens, you may see a timeout error message. This message is only informational, and it can be safely ignored.

    • After the hpcscheduler service starts, it clears the Restore key in the registry.

  9. Continue to Verify the restore operations and bring the cluster to a stable state.

Verify the restore operations and bring the cluster to a stable state

The following procedure describes how to check the event log for restore operations and bring the cluster nodes to a stable state.

To verify the restore operations and bring the cluster to a stable state

  1. On the head node, open Event Viewer.

    Click Start, point to Administrative Tools, then click Event Viewer.

  2. In Event Viewer, check the following:

    • Verify that the HPC Scheduler Service started in restore mode. You should see the following Warning event:

      {Warning} [SchedulerService] The scheduler has started in restore mode.

    • Review the Warning events from the SchedulerService indicating how many jobs were canceled in each state. You should see a list of events similar to the following:

      {Warning} [SchedulerService] 5 Running jobs were canceled during restore.

      {Warning} [SchedulerService] 5 Queued jobs were canceled during restore.

      {Warning} [SchedulerService] 1 Submitted jobs were canceled during restore.

      {Warning} [SchedulerService] 0 Validating jobs were canceled during restore.

    • Review the Warning events for each unrecognized task that the HPC Job Scheduler service stopped (tasks that were running on the cluster, which the restored database does not have records for). You should see a list of events similar to the following:

      {Warning} [RC] Task 27.137 is not running node R25-1234A1234 any more. Tries to cancel it

    • Verify that the HPC Job Scheduler service completed the restore mode steps. You should see the following Warning event:

      {Warning} [SchedulerService] Scheduler restore complete.

  3. Restart all the compute nodes, broker nodes, and workstation nodes. If you have not made any configuration changes since the backup, such as node deployment changes, then the restored database should still be aware of all the nodes that are joined to the cluster. If that is the case, you can use the clusrun command to restart all the nodes. If you made configuration changes since the last backup, you may need to manually restart the nodes. In disaster recovery situation you will need to redeploy the nodes using an appropriate deployment method.

    To restart all the nodes using the clusrun command, at an elevated command prompt, type:

    clusrun /all shutdown -r
    

    The –r parameter, the shutdown command indicates that the computers should restart after shutting down.

    Important
    If the head node has the compute node role or the broker node role enabled, running this command also restarts the head node. If you do not want to do this, you should, at a minimum, restart the hpcmanagement and hpcnodemanager services on the other cluster nodes.
  4. If there are Windows Azure nodes that are online and have a health state of Error, in HPC Cluster Manager, manually stop the Windows Azure nodes.

    Warnung
    Ensure that you have already stopped the deployment in the Windows Azure hosted service, as outlined in a previous step. If you do not stop the deployment first in the Windows Azure Management Portal, you will be unable to stop the Windows Azure nodes by using HPC Cluster Manager.
    Note
    If the nodes are deployed by using a node template that includes a policy to start and stop the nodes automatically, you should first edit the node template to configure a policy to start and stop the Windows Azure nodes manually. Then stop the Windows Azure nodes.

    When the Windows HPC cluster reaches a stable state, you can restart the Windows Azure nodes.

  5. Verify the health of your cluster.

    Open HPC Cluster Manager, and run all the diagnostic tests. Go to Node Management to check the state and health of your compute nodes. If you see any errors or warnings, use the test result messages and the operations log to help you troubleshoot and resolve issues.

  6. Use the information in the job queue to help you determine whether or not to requeue any of the canceled jobs. For examples of how to sort the job list in HPC Cluster Manager and use HPC PowerShell, see Filter and sort the job list to see the jobs that were canceled during restore mode.

Filter and sort the job list to see the jobs that were canceled during restore mode

After restoring the HPC databases, you need to decide how to handle the jobs that were canceled while the HPC Job Scheduler service was in restore mode. You can use the information in the job queue to help you to determine whether to requeue any of the canceled jobs. For example, the following scripts and procedure show you can sort and filter the job list by using HPC PowerShell or HPC Cluster Manager.

Note
In Windows HPC Server 2008 R2, if jobs appear stuck in the Canceling state after you restore the database, you can force cancel the jobs. For example, to force cancel all jobs that are in the Canceling state, use the following HPC PowerShell cmdlet: Get-hpcjob –state Canceling|Stop-HpcJob

To filter and sort the job list in HPC PowerShell

  • To list the jobs that were canceled during restore mode; view only the Owner, Priority, ID, SubmitTime, StartTime, and RunTime job properties; sort the output by Owner, Priority, and Submit time; and view the output in table format, use the following script:

    Get-hpcjob –state Canceled|
    where {$_.Error like “*The scheduler is in restoration*”}|
    select –property Owner, Priority, ID, SubmitTime, StartTime, RunTime|
    sort Owner, Priority, SubmitTime|
    ft
    
  • Alternatively, to group the output from the previous script by job owner, use the following script:

    Get-hpcjob –state Canceled|
    where {$_.Error like “*The scheduler is in restoration*”}|
    select –property Priority, ID, SubmitTime,StartTime,RunTime|
    sort Priority, SubmitTime|
    ft –groupby Owner
    
  • To requeue all the canceled jobs that were submitted 24 hours ago that have a priority of Highest, use the following script:

    $yesterday=[datetime]::now.AddDays(-1)
    get-hpcjob –state Canceled|
    where {($_.submittime –gt $yesterday) –and ($_.priority –eq “Highest”)}|
    submit-hpcjob
    

To filter and sort the job list in HPC Cluster Manager

  1. In HPC Cluster Manager, click Job Management.

  2. In Job Management, in the navigation pane, under All Jobs, click Canceled.

  3. Right-click the column headings in the job list, then click Column Chooser.

  4. Use the Column Chooser dialog box to include the following job properties in the list of displayed columns:

    • Error Message

    • Owner

    • Priority

    • Submit Time

    • Start Time

    • Run Time

  5. Click the column headers to sort the job list according to the displayed property values.

  6. Optionally, you can requeue jobs. Select one or more jobs, and then click Requeue Job in the Actions pane.

Delete the message queue on WCF broker nodes

If your HPC cluster contains one or more WCF broker nodes, after you restore the HPC databases, you must delete Message Queuing (also known as MSMQ) on all of the WCF broker nodes. If you do not do this, you may be unable to run service-oriented architecture (SOA) jobs on your cluster.

To delete the message queue on a WCF broker node

  1. On a broker node, start Windows PowerShell Modules as an administrator. Click Start, point to Administrative Tools, right-click Windows PowerShell Modules, and then click Run as administrator.

  2. Type the following script:

    [System.Reflection.Assembly]::LoadWithPartialName("System.Messaging")
    [System.Messaging.MessageQueue]::GetPrivateQueuesByMachine("localhost") | ? {"$($_.FormatName)" -like "*hpc*re*"} | % {[System.Messaging.MessageQueue]::Delete($_.Path)}
    

Additional references