Steps to Perform Before and After Restoring the HPC Databases from a Backup

Applies To: Windows HPC Server 2008

Windows® HPC Server 2008 SP1 includes a restore mode for the HPC Job Scheduler service. The restore mode helps return the cluster to a stable, consistent state after the HPC databases are restored from a backup. The databases maintain information about node properties, job templates, the job queue, and other cluster settings and states.

Every time that the HPC Job Scheduler service restarts, it checks the Restore registry key. If the key has a value of 1, then the HPC Job Scheduler service starts in restore mode. When performing a restore of the HPC databases, the administrator must set the Restore registry key before restarting the HPC Job Scheduler service, and follow several other steps to help return the system to a stable state.

In this topic:

  • What happens when the HPC Job Scheduler service starts in restore mode

  • Bringing the cluster to a stable state during a database restore

  • Filtering and sorting the job list to see the jobs that were canceled during restore mode

What happens when the HPC Job Scheduler service starts in restore mode

After you restore the HPC databases from a backup, the job queue in the restored databases will not be consistent with what is running on the cluster. The databases will contain jobs in the state they were in when the backup was made. Many of those jobs may already have finished. Additionally, the compute nodes may be running jobs that were submitted after the backup was made. The restored databases will have no records for these jobs.

After the HPC Job Scheduler starts in restore mode, to help to bring the system to a consistent state, the service cancels all jobs in the database that are in the Submitted, Validating, Queued, or Running states. The scheduler stops all tasks that are actually running on the compute nodes (nodes periodically send status information to the head node about the jobs and tasks that are running, so even tasks that do not have records in the database are stopped).

In the Event Viewer, you can see warning events from the SchedulerService indicating that the service has entered restore mode, how many jobs were canceled in each state, and that the restore is complete. You will also see a warning event for each unrecognized task that the HPC Job Scheduler service stopped (tasks that were running on the cluster, which the restored database does not have records for).

After the HPC Job Scheduler service completes the restore mode steps, it clears the restore key in the registry, writes a warning event to the system event log indicating that the restore is complete, and then starts scheduling jobs again. This means that if users submit jobs right after the restore, the HPC Job Scheduler service will attempt to run them.

At this point, the HPC job scheduling database contains three categories of jobs:

  • Jobs that were Finished, Canceled, or Configuring when the backup was made. These jobs have not been changed.

  • Jobs that were Submitted, Validating, Queued, or Running when the backup was made. These jobs are now Canceled.

  • New jobs, in any state, that users submitted after the HPC Job Scheduler service completed the restore mode steps.

Bringing the cluster to a consistent state during a database restore

When you restore the HPC databases, you need to perform additional steps to help to return the cluster to a consistent state. The following procedures describe these additional steps. The exact steps for restoring your system or your databases depend on the method that you chose for your backup and restore solution (for example, Windows Server Backup, SQL Server Backup, Data Protection Manager, or other third party solutions).

To restore the HPC databases you need to:

  • Have backups of the HPC databases.

  • Understand what the HPC Job Scheduler service does in restore mode.

  • Know the steps for restoring the databases according to the backup method that you used, and the location where you saved the backups. Consult the documentation for the backup method that you used for more information.

  • Start the HPC Job Scheduler service in restore mode.

  • Verify the restore operations and bring the cluster to a stable state after a database restore.

  • Decide how to handle jobs that were canceled by the HPC Job Scheduler service during restore mode.

Start the job scheduler in restore mode

When performing a restore of the HPC databases, the administrator must set the Restore registry key before restarting the HPC Job Scheduler service. The first procedure describes how to start the scheduler in restore mode if you are restoring the databases as part of a full-system restore. The second procedure describes how to start the scheduler in restore mode if you are restoring only the databases.

To start the job scheduler in restore mode during a full-system restore

  1. After you perform the full-system restore, start the head node in safe mode.

  2. Set the restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts. At the elevated command prompt, type:

    reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f
    

    Important

    Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

  3. Restart the head node in normal mode.

  4. Continue to Verify the restore operations and bring the cluster to a stable state in this topic.

To start the job scheduler in restore mode during a database restore

  1. Close all instances of HPC Cluster Manager.

  2. Open an elevated command prompt window.

    To open an elevated Command Prompt window, click Start, point to All Programs, click Accessories, right-click Command Prompt, and then click Run as administrator.

  3. Stop and disable the HPC services.

    At the elevated command prompt window, type the following commands:

    sc config hpcscheduler start= disabled
    sc config hpcmanagement start= disabled
    sc config hpcreporting start= disabled
    sc config hpcsdm start= disabled
    net stop hpcscheduler
    net stop hpcmanagement
    net stop hpcreporting
    net stop hpcsdm
    
  4. Restore and replace (overwrite) the HPC databases. The exact steps for restoring the databases depend on the backup method that you used, and the location where you saved the backups. Consult the documentation for the backup method that you used for more information.

    For example, if you used SQL Server Management Studio to create a backup, you can right-click each database in SQL Server Management Studio, then click Restore to start the database restore process.

  5. Set the restore registry key to indicate to the HPC Job Scheduler service that it should enter restore mode when it restarts. At the elevated command prompt, type:

    reg add HKLM\Software\Microsoft\HPC /v Restore /t REG_DWORD /d 1 /f
    

    Important

    Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

  6. Enable and start HPC services. At the elevated command prompt, type:

    sc config hpcscheduler start= auto
    sc config hpcmanagement start= auto
    sc config hpcreporting start= auto
    sc config hpcsdm start= auto
    net start hpcsdm
    net start hpcscheduler
    net start hpcmanagement
    net start hpcreporting
    
  7. Continue to Verify the restore operations and bring the cluster to a stable state in this topic.

Verify the restore operations and bring the cluster to a stable state

The following procedure describes how to check the event log for restore operations and bring the cluster to a stable state.

To verify the restore operations and bring the cluster to a stable state

  1. On the head node, open the Event Viewer.

    Click Start, point to Administrative Tools, then click Event Viewer.

  2. In Event Viewer, check the following:

    • Verify that the HPC Scheduler Service started in restore mode. You should see the following Warning event:

      {Warning} [SchedulerService] The scheduler has started in restore mode.

    • Review the Warning events from the SchedulerService indicating how many jobs were canceled in each state. You should see a list of events similar to the following:

      {Warning} [SchedulerService] 5 Running jobs were canceled during restore.

      {Warning} [SchedulerService] 5 Queued jobs were canceled during restore.

      {Warning} [SchedulerService] 1 Submitted jobs were canceled during restore.

      {Warning} [SchedulerService] 0 Validating jobs were canceled during restore.

    • Review the Warning events for each unrecognized task that the HPC Job Scheduler service stopped (tasks that were running on the cluster, which the restored database does not have records for). You should see a list of events similar to the following:

      {Warning} [RC] Task 27.137 is not running node R25-1234A1234 any more. Tries to cancel it

    • Verify that the HPC Job Scheduler service completed the restore mode steps. You should see the following Warning event:

      {Warning} [SchedulerService] Scheduler restore complete.

  3. Reboot all the compute nodes. If you have not made any configuration changes since the backup, such as any node deployment changes, then the restored database should still be aware of all the nodes that are joined to the cluster. If that is the case, then you can use the clusrun command to reboot all the nodes. If you made configurations changes since the last backup, then you may need to manually restart the nodes. At an elevated command prompt, type:

    clusrun /all shutdown -r
    

    The –r parameter in the shutdown command indicates that the computers should restart after shutting down.

  4. Verify the health of your cluster.

    Open HPC Cluster Manager, and run all diagnostic tests. Go to Node Management to check on the state and health of your compute nodes. If you see any errors or warning, use the test result messages and the operations log to help you troubleshoot and resolve issues.

  5. Use the information in the job queue to help you to determine whether or not to requeue any of the canceled jobs. For examples of how to sort the job list in HPC Cluster Manager and using HPC PowerShell, see the next section in this topic.

Filtering and sorting the job list to see the jobs that were canceled during restore mode

After restoring the HPC databases, you will need to decide how to handle the jobs that were canceled while the HPC Job Scheduler service was in restore mode. You can use the information in the job queue to help you to determine whether or not to requeue any of the canceled jobs. For example, the following scripts and procedure show you can sort and filter the job list using either HPC PowerShell or HPC Cluster Manager.

To filter and sort the job list in HPC PowerShell

  • To list the jobs that were canceled during restore mode, view only the Owner, Priority, ID, SubmitTime, StartTime, and RunTime job properties, sort the output by Owner, Priority, and Submit time, and view the output in table format, use the following script:

    Get-hpcjob –state Canceled|
    where {$_.Error like “*The scheduler is in restoration*”}|
    select –property Owner, Priority, ID, SubmitTime, StartTime, RunTime|
    sort Owner, Priority, SubmitTime|
    ft
    
  • Alternately, to group the output from the previous script by job owner, use the following script:

    Get-hpcjob –state Canceled|
    where {$_.Error like “*The scheduler is in restoration*”}|
    select –property Priority, ID, SubmitTime,StartTime,RunTime|
    sort Priority, SubmitTime|
    ft –groupby Owner
    
  • To requeue all the canceled jobs that were submitted 24 hours ago and that have a priority of Highest, use the following script:

    $yesterday=[datetime]::now.AddDays(-1)
    get-hpcjob –state Canceled|
    where {($_.submittime –gt $yesterday) –and ($_.priority –eq “Highest”)}|
    submit-hpcjob
    

To filter and sort the job list in HPC Cluster Manager

  1. In HPC Cluster Manager, go to Job Management.

  2. In Job Management, in the navigation pane, under All Jobs, select Canceled.

  3. Right-click the column headings in the job list, then click Column Chooser.

  4. Use the Column Chooser dialog box to include the following job properties in the list of Displayed columns:

    • Error Message

    • Owner

    • Priority

    • Submit Time

    • Start Time

    • Run Time

  5. Click the column headers to sort the job list according to the displayed property values.

  6. Optionally, you can requeue jobs by selecting one or more jobs, then clicking Requeue Job in the Actions Pane.

Additional references