Perform an HPC Database Synchronization

 

Updated: July 29, 2016

Applies To: Microsoft HPC Pack 2008 R2

Windows® HPC Server 2008 R2 includes a restore mode for the HPC Job Scheduler service, which can be configured by setting the Restore registry key to 1. When you restore the HPC databases, you must configure the cluster to enter restore mode before you restart the HPC Job Scheduler service, and follow several other steps to help synchronize the HPC databases and to return the system to a stable state.

System_CAPS_ICON_tip.jpg Tip


In a cluster running at least Windows HPC Server 2008 R2 with Service Pack 2, you can configure the cluster to enter restore mode by running the Set-HPCClusterProperty HPC PowerShell command with the –Restore parameter, instead of manually setting a Registry key.

System_CAPS_ICON_important.jpg Important


Because of the variety of cluster deployment options, including options to configure the head node for high availability in a failover cluster, and the different restoration scenarios that are possible, you should use the steps to start the HPC Job Scheduler service in restore mode that are appropriate for your cluster configuration and situation.

In this section:

After you restore HPC databases from a backup, the job queue in the restored databases will not be consistent with what is running on the cluster. The databases will contain jobs in the state they were in when the backup was made. Many of those jobs may already have finished. Additionally, the compute nodes may be running jobs that were submitted after the backup was made. The restored databases will have no records for these jobs.

When you restore the HPC databases, you need to perform additional steps to help to return the cluster to a consistent state. The following procedures describe these additional steps. The exact steps for restoring your system or your databases depend on the your backup and restore solution (for example, Windows Server Backup, SQL Server Backup, Data Protection Manager, or non-Microsoft solutions).

To restore the HPC databases, you need to:

  • Have backups of the HPC databases.

  • Understand what the HPC Job Scheduler service does in restore mode.

  • Know the steps for restoring the databases according to the backup method that you used, and know the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.

  • Start the HPC Job Scheduler service in restore mode.

  • Verify the restore operations and bring the cluster to a stable state after a database restore.

  • Decide how to handle jobs that were canceled by the HPC Job Scheduler service during restore mode.

  • Delete the message queue on the Windows Communication Foundation (WCF) broker nodes.

Every time the HPC Job Scheduler service restarts, it checks the Restore registry key. If the key has a value of 1, then the HPC Job Scheduler service starts in restore mode. After the HPC Job Scheduler starts in restore mode, to help to bring the system to a consistent state, the service cancels all jobs in the database that are in the Submitted, Validating, Queued, or Running states. The scheduler stops all tasks that are actually running on the compute nodes (nodes periodically send status information to the head node about the jobs and tasks that are running, so even tasks that do not have records in the database are stopped).

In Event Viewer, you can see warning events from the SchedulerService that indicate the service has entered restore mode, how many jobs were canceled in each state, and the restore is complete. You will also see a warning event for each unrecognized task that the HPC Job Scheduler service stopped (tasks that were running on the cluster, which the restored database does not have records for).

After the HPC Job Scheduler service completes the restore mode steps, it clears the Restore key in the registry, writes a warning event to the system event log to indicate that the restore is complete, and then starts scheduling jobs again. This means that if users submit jobs right after the restore, the HPC Job Scheduler service will attempt to run them.

At this point, the HPC job scheduling database contains three categories of jobs:

  • Jobs that were Finished, Canceled, or Configuring when the backup was made. These jobs have not been changed.

  • Jobs that were Submitted, Validating, Queued, or Running when the backup was made. These jobs are now Canceled.

  • New jobs, in any state, that users submitted after the HPC Job Scheduler service completed the restore mode steps.

When you restore of the HPC databases, you must set the Restore registry key before restarting the HPC Job Scheduler service. The following procedures are available:

To start the job scheduler in restore mode during a full-system restore

  1. After you perform the full-system restore on the head node, start the head node in safe mode.

  2. Set the Restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts by doing one of the following:

    • In a cluster running at least Windows HPC Server 2008 R2 with SP2, run the Set-HPCClusterProperty PowerShell command with the –Restore parameter:

      1. Start HPC PowerShell. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, right-click HPC PowerShell, and then click Run as administrator.

      2. Type the following command:

        Set-HPCClusterProperty –RestoreMode:$true  
        
        
    • Otherwise, manually set the registry key by typing the following command at an elevated command prompt:

      reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f  
      
      
      System_CAPS_ICON_caution.jpg Caution


      Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

    System_CAPS_ICON_important.jpg Important


    Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

  3. Restart the head node in normal mode.

  4. Continue to Verify the restore operations and bring the cluster to a stable state.

To start the job scheduler in restore mode during a database restore on a single head node

  1. Close all instances of HPC Cluster Manager.

    System_CAPS_ICON_caution.jpg Caution


    Do not continue if HPC Cluster Manager is running. If you restore the HPC databases while HPC Cluster Manager is open, you may not be able to perform node operations in HPC Cluster Manager after you restore the databases.

  2. On the head node, stop and disable the HPC services as follows:

    • Open an elevated Command Prompt window.

      To open an elevated Command Prompt window, click Start, point to All Programs, click Accessories, right-click Command Prompt, and then click Run as administrator.

    • At the elevated command prompt, type the following commands to stop and disable the HPC services:

      sc config hpcscheduler start= disabled  
      sc config hpcmanagement start= disabled  
      sc config hpcreporting start= disabled  
      sc config hpcsdm start= disabled  
      sc config hpcdiagnostics start= disabled  
      net stop hpcscheduler  
      net stop hpcmanagement  
      net stop hpcreporting  
      net stop hpcsdm  
      net stop hpcdiagnostics  
      
      
      
  3. If you previously deployed Windows Azure nodes and the state of the nodes changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.

    To stop the deployment in Windows Azure
    1. In the navigation pane of the Management Portal, click Hosted Services, Storage Accounts & CDN.

    2. Click Hosted Services.

    3. In the group for the hosted service that you used to deploy the Windows Azure nodes, click the deployment. This has the name Deployment for <HostedServiceName>, where <HostedServiceName> is the name of the hosted service.

    4. On the ribbon, in the Deployments group, click Stop. The status of the service deployment status changes to Stopped.

  4. If you have not already done so, restore and replace (overwrite) the HPC databases. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for your backup solution.

    For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.

  5. Set the Restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts by doing one of the following:

    • In a cluster running at least Windows HPC Server 2008 R2with SP2, run the Set-HPCClusterProperty PowerShell command with the –RestoreMode parameter:

      1. Start HPC PowerShell. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, right-click HPC PowerShell, and then click Run as administrator.

      2. Type the following command:

        Set-HPCClusterProperty –RestoreMode $true  
        
        
    • Otherwise, manually set the registry key by typing the following command at an elevated command prompt:

      reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f  
      
      
      System_CAPS_ICON_caution.jpg Caution


      Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

  6. Enable and start the HPC services by typing the following commands at an elevated command prompt:

    sc config hpcscheduler start= auto  
    sc config hpcmanagement start= auto  
    sc config hpcreporting start= auto  
    sc config hpcsdm start= auto  
    sc config hpcdiagnostics start= auto  
    net start hpcsdm  
    net start hpcscheduler  
    net start hpcmanagement  
    net start hpcreporting  
    net start hpcdiagnostics  
    
    
    
    System_CAPS_ICON_note.jpg Note

    • It may take more than 30 seconds to start the hpcscheduler service. If this happens, you may see a timeout error message. This message is only informational, and it can be safely ignored.
    • After the hpcscheduler service starts, it clears the Restore key in the registry.
  7. Continue to Verify the restore operations and bring the cluster to a stable state.

To start the job scheduler in restore mode during a database restore on a head node that is configured for high availability in a failover cluster

  1. Close all instances of HPC Cluster Manager.

    System_CAPS_ICON_caution.jpg Caution


    Do not continue if HPC Cluster Manager is running. If you restore the HPC databases while HPC Cluster Manager is open, you may not be able to perform node operations in HPC Cluster Manager after you restore the databases.

  2. Stop and disable the HPC services by doing the following:

    • In Failover Cluster Manager, in the resource group for the failover cluster, take the following resources offline: hpcscheduler, hpcsdm, hpcdiagnostics, and hpcsession.

    • On each head node computer, open an elevated Command Prompt window. To open an elevated Command Prompt window, click Start, point to All Programs, click Accessories, right-click Command Prompt, and then click Run as administrator.

  3. At the elevated command prompt, type the following commands:

    sc config hpcmanagement start= disabled  
    sc config hpcreporting start= disabled  
    net stop hpcmanagement  
    net stop hpcreporting  
    
    
  4. If you previously deployed Windows Azure nodes and the state of the nodes changed after you backed up the HPC databases, use the Windows Azure Management Portal to stop the deployment in the Windows Azure hosted service.

    To stop the deployment in Windows Azure
    1. In the navigation pane of the Management Portal, click Hosted Services, Storage Accounts & CDN.

    2. Click Hosted Services.

    3. In the group for the hosted service that you used to deploy the Windows Azure nodes, click the deployment. This has the name Deployment for <HostedServiceName>, where <HostedServiceName> is the name of the hosted service.

    4. On the ribbon, in the Deployments group, click Stop. The status of the service deployment status changes to Stopped.

  5. If you have not already done so, restore and replace (overwrite) the HPC databases. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. For more information, consult the documentation for the backup solution that you used.

    For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.

  6. On the first head node on which you will enable and start the HPC services, set the Restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts by doing one of the following:

    • If your cluster is running at least Windows HPC Server 2008 R2 with SP2, run the Set-HPCClusterProperty PowerShell command with the –Restore parameter:

      1. Start HPC PowerShell. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, right-click HPC PowerShell, and then click Run as administrator.

      2. Type the following command:

        Set-HPCClusterProperty –Restore:$true  
        
        
    • Otherwise, manually set the registry key by typing the following command at an elevated command prompt:

      reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f  
      
      
      System_CAPS_ICON_caution.jpg Caution


      Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

  7. Make the head node on which you set the Restore registry the active head node in the failover cluster.

  8. Enable and start the HPC services on the active head node by doing the following:

    1. In Failover Cluster Manager, in the resource group for the failover cluster, bring the following resources online: hpcscheduler, hpcsdm, hpcdiagnostics, and hpcsession.

    2. At an elevated command prompt on each head node computer (starting first on the active head node), type the following commands:

      
      sc config hpcmanagement start= auto  
      sc config hpcreporting start= auto  
      net start hpcmanagement  
      net start hpcreporting  
      
      
    System_CAPS_ICON_note.jpg Note

    • It may take more than 30 seconds to start the hpcscheduler service. If this happens, you may see a timeout error message. This message is only informational, and it can be safely ignored.
    • After the hpcscheduler service starts, it clears the Restore key in the registry.
  9. Continue to Verify the restore operations and bring the cluster to a stable state.

The following procedure describes how to check the event log for restore operations and bring the cluster nodes to a stable state.

To verify the restore operations and bring the cluster to a stable state

  1. On the head node, open Event Viewer.

    Click Start, point to Administrative Tools, then click Event Viewer.

  2. In Event Viewer, check the following:

    • Verify that the HPC Scheduler Service started in restore mode. You should see the following Warning event:

      {Warning} [SchedulerService] The scheduler has started in restore mode.

    • Review the Warning events from the SchedulerService indicating how many jobs were canceled in each state. You should see a list of events similar to the following:

      {Warning} [SchedulerService] 5 Running jobs were canceled during restore.

      {Warning} [SchedulerService] 5 Queued jobs were canceled during restore.

      {Warning} [SchedulerService] 1 Submitted jobs were canceled during restore.

      {Warning} [SchedulerService] 0 Validating jobs were canceled during restore.

    • Review the Warning events for each unrecognized task that the HPC Job Scheduler service stopped (tasks that were running on the cluster, which the restored database does not have records for). You should see a list of events similar to the following:

      {Warning} [RC] Task 27.137 is not running node R25-1234A1234 any more. Tries to cancel it

    • Verify that the HPC Job Scheduler service completed the restore mode steps. You should see the following Warning event:

      {Warning} [SchedulerService] Scheduler restore complete.

  3. Restart all the compute nodes, broker nodes, and workstation nodes. If you have not made any configuration changes since the backup, such as node deployment changes, then the restored database should still be aware of all the nodes that are joined to the cluster. If that is the case, you can use the clusrun command to restart all the nodes. If you made configuration changes since the last backup, you may need to manually restart the nodes. In disaster recovery situation you will need to redeploy the nodes using an appropriate deployment method.

    To restart all the nodes using the clusrun command, at an elevated command prompt, type:

    clusrun /all shutdown -r  
    
    

    The –r parameter, the shutdown command indicates that the computers should restart after shutting down.

    System_CAPS_ICON_important.jpg Important


    If the head node has the compute node role or the broker node role enabled, running this command also restarts the head node. If you do not want to do this, you should, at a minimum, restart the hpcmanagement and hpcnodemanager services on the other cluster nodes.

  4. If there are Windows Azure nodes that are online and have a health state of Error, in HPC Cluster Manager, manually stop the Windows Azure nodes.

    System_CAPS_ICON_warning.jpg Warning


    Ensure that you have already stopped the deployment in the Windows Azure hosted service, as outlined in a previous step. If you do not stop the deployment first in the Windows Azure Management Portal, you will be unable to stop the Windows Azure nodes by using HPC Cluster Manager.

    System_CAPS_ICON_note.jpg Note


    If the nodes are deployed by using a node template that includes a policy to start and stop the nodes automatically, you should first edit the node template to configure a policy to start and stop the Windows Azure nodes manually. Then stop the Windows Azure nodes.

    When the Windows HPC cluster reaches a stable state, you can restart the Windows Azure nodes.

  5. Verify the health of your cluster.

    Open HPC Cluster Manager, and run all the diagnostic tests. Go to Node Management to check the state and health of your compute nodes. If you see any errors or warnings, use the test result messages and the operations log to help you troubleshoot and resolve issues.

  6. Use the information in the job queue to help you determine whether or not to requeue any of the canceled jobs. For examples of how to sort the job list in HPC Cluster Manager and use HPC PowerShell, see Filter and sort the job list to see the jobs that were canceled during restore mode.

After restoring the HPC databases, you need to decide how to handle the jobs that were canceled while the HPC Job Scheduler service was in restore mode. You can use the information in the job queue to help you to determine whether to requeue any of the canceled jobs. For example, the following scripts and procedure show you can sort and filter the job list by using HPC PowerShell or HPC Cluster Manager.

System_CAPS_ICON_note.jpg Note


In Windows HPC Server 2008 R2, if jobs appear stuck in the Canceling state after you restore the database, you can force cancel the jobs. For example, to force cancel all jobs that are in the Canceling state, use the following HPC PowerShell cmdlet:

Get-hpcjob –state Canceling|Stop-HpcJob

To filter and sort the job list in HPC PowerShell

  • To list the jobs that were canceled during restore mode; view only the Owner, Priority, ID, SubmitTime, StartTime, and RunTime job properties; sort the output by Owner, Priority, and Submit time; and view the output in table format, use the following script:

    Get-hpcjob –state Canceled|  
    where {$_.Error like “*The scheduler is in restoration*”}|  
    select –property Owner, Priority, ID, SubmitTime, StartTime, RunTime|  
    sort Owner, Priority, SubmitTime|  
    ft  
    
    
  • Alternatively, to group the output from the previous script by job owner, use the following script:

    Get-hpcjob –state Canceled|  
    where {$_.Error like “*The scheduler is in restoration*”}|  
    select –property Priority, ID, SubmitTime,StartTime,RunTime|  
    sort Priority, SubmitTime|  
    ft –groupby Owner  
    
    
  • To requeue all the canceled jobs that were submitted 24 hours ago that have a priority of Highest, use the following script:

    $yesterday=[datetime]::now.AddDays(-1)  
    get-hpcjob –state Canceled|  
    where {($_.submittime –gt $yesterday) –and ($_.priority –eq “Highest”)}|  
    submit-hpcjob  
    
    

To filter and sort the job list in HPC Cluster Manager

  1. In HPC Cluster Manager, click Job Management.

  2. In Job Management, in the navigation pane, under All Jobs, click Canceled.

  3. Right-click the column headings in the job list, then click Column Chooser.

  4. Use the Column Chooser dialog box to include the following job properties in the list of displayed columns:

    • Error Message

    • Owner

    • Priority

    • Submit Time

    • Start Time

    • Run Time

  5. Click the column headers to sort the job list according to the displayed property values.

  6. Optionally, you can requeue jobs. Select one or more jobs, and then click Requeue Job in the Actions pane.

If your HPC cluster contains one or more WCF broker nodes, after you restore the HPC databases, you must delete Message Queuing (also known as MSMQ) on all of the WCF broker nodes. If you do not do this, you may be unable to run service-oriented architecture (SOA) jobs on your cluster.

To delete the message queue on a WCF broker node

  1. On a broker node, start Windows PowerShell Modules as an administrator. Click Start, point to Administrative Tools, right-click Windows PowerShell Modules, and then click Run as administrator.

  2. Type the following script:

    [System.Reflection.Assembly]::LoadWithPartialName("System.Messaging")  
    [System.Messaging.MessageQueue]::GetPrivateQueuesByMachine("localhost") | ? {"$($_.FormatName)" -like "*hpc*re*"} | % {[System.Messaging.MessageQueue]::Delete($_.Path)}  
    
    
    
Show: