New Feature Evaluation Guide for Windows HPC Server 2008 R2

Updated: May 2011

Applies To: Windows HPC Server 2008 R2

This guide provides scenarios and steps to try new features in Windows® HPC Server 2008 R2. You can download the Windows HPC Server 2008 R2 Suite Evaluation on the Microsoft download center (https://go.microsoft.com/fwlink/?LinkId=198810).

Important
Read the Release Notes for Windows HPC Server 2008 R2 before following the steps in this guide.

This guide includes the following scenarios:

Cluster Management

  • Use the patching wizard to add software updates to a node template

  • Use workstations to run cluster jobs

  • Create customizable dashboards that allow you to monitor nodes at a glance

  • Save a command or script as a diagnostic test in HPC Cluster Manager

SOA scheduling and runtime

  • Optimize job scheduling for SOA jobs and interactive workloads

  • Manage SOA service configuration settings from a single location

  • Enable and collect trace logs to troubleshoot SOA sessions

Job scheduling and runtime

  • Provide accurate job prioritization for your cluster

  • Check for license availability before a job is started

  • Stop a running job or task immediately

  • Exclude particular nodes from running tasks in your job

  • Receive notification when your job is done

  • Provision or clean up the nodes that are allocated to your job

  • Provide custom job progress information

  • Allow canceled tasks time to save state information or clean up before exiting

Cluster Management

The scenarios in this section help you try new management features in Windows HPC Server 2008 R2.

Use the patching wizard to add software updates to a node template

Scenario

You have deployed cluster nodes, and now you want to use the node templates to manage and apply software updates (patches) to the nodes.

Goal

Use the Add Software Updates Wizard to add an Apply Updates task to a node template.

Requirements

  • A cluster with a Windows HPC Server 2008 R2 head node.

  • The head node must be able to access the Microsoft Update website or the WSUS server in your enterprise.

  • Administrative permissions on the cluster.

Steps

The Mayntenance phase of a node template can include an Apply Updates task, with settings that you configure for which updates to apply. When you run the Mayntain action on nodes, the Apply Updates task downloads updates to the compute nodes from the Microsoft Update website or the WSUS server in your enterprise, and then installs the updates.

Note
The Apply Updates task in the node template cannot install updates that include software license terms (also known as an End User License Agreement or EULA). This type of update requires the administrator of each node to accept the software license terms before the update can be installed. In this scenario, you have to install updates that include software license terms manually.

The following procedure describes how to add the Apply Updates task to a node template. The node template must have already been used to deploy one or more nodes.

  1. In HPC Cluster Manager, click Configuration, and then click Node Templates.

  2. Right-click a node template, and then click Add Software Updates.

  3. Follow the steps in the Add Software Updates Wizards to select the update level and the specific updates to add. In the final step, you can select to install the updates now or later.

  4. To make changes to the Apply Updates task, you can run the wizard again, or right-click the node template and click Edit.

Expected results

An Apply Updates task is added to the Mayntenance phase of the selected node template.

Related Resources

For more information about applying updates by using and enterprise WSUS server or by using a node template, see the Best Practices topic in the updating nodes step-by-step guide (https://go.microsoft.com/fwlink/?LinkId=194794).

^ Top of page

Use workstations to run cluster jobs

Scenario

You have powerful workstation computers that are not utilized overnight and on weekends. You want to harvest this processing power to run cluster jobs.

Note
SP1 includes a new feature for workstation node scheduling that allows you to configure the availability policy so that nodes come online when there is no user activity detected. For more information, see Configure availability of workstation nodes based on detected user activity.

Goal

Add workstation nodes to your HPC cluster and define a weekly availability policy to control when these nodes are brought online.

Requirements

  • A cluster with a Windows HPC Server 2008 R2 head node.

  • One or more workstation computers running the Windows 7 operating system.

  • The workstation computers and the head node computer must be connected to the same doMayn.

  • Administrative permissions on the cluster.

Steps

  1. In HPC Cluster Manager, create a node template for the workstations:

    1. Use the Create Node Template Wizard to create a Workstation node template.

    2. On the Configure Availability Policy page, select how and when you want workstation nodes to be brought online and offline.

  2. Install HPC Pack 2008 R2 on the workstations and select the Join an existing HPC cluster by creating a new workstation node option.

  3. The nodes appear in Node Management as Unapproved.

  4. Assign the workstation template that you created to the nodes.

Expected results

Workstation nodes come online and go offline according to the configured availability policy.

Related Resources

Adding Workstation Nodes in Windows HPC Server 2008 R2 Step-by-Step Guide

^ Top of page

Create customizable dashboards that allow you to monitor nodes at a glance

Scenario

When administrating clusters of up to 1000 nodes, you need the ability to create customizable dashboards that allow you to monitor several node metrics for the entire cluster at a glance. To more easily identify outliers and bottlenecks and quickly switch between views, you can create multiple node list or heat map tabs that focus on sets of information such as:

  • Network view

  • CPU or disk load

  • Application trends for large MPI jobs

Goal

Create one or more new tabs in Node Management.

Requirements

  • A head node with Windows HPC Server 2008 R2 installed.

  • Administrative permissions on the cluster.

Steps

  1. In HPC Cluster Manager, click Node Management.

  2. In the Navigation Pane, click Nodes.

  3. In the view pane, click the blank tab, and then click Customize Tab.

  4. Type a name for the tab, and then click List View or Heat Map View.

  5. Add one or more metrics.

  6. Click Apply at any time to see the current tab configuration.

  7. Click OK to save your changes to the tab.

  8. In the view pane, use the slide bar to adjust the heat map zoom, or click the Fit to window icon to automatically adjust the tile size for the best fit.

  9. You can organize the heat map view by location by clicking the Group by location icon. Remove the location grouping by clicking the Group by name icon. (You can specify the node location property in the node XML or by selecting a node and clicking Edit.)

To change the settings on a tab, right-click the tab name, then click Customize Tab.

If you are creating a Heat Map tab, you can customize the following display options:

  • Color scale: The minimum value for a metric is associated with a color, for example, white, and the maximum value for that metric is associated with another color, for example, blue. In this case, lower values for that metric appear as lighter shades of blue, and higher values appear as darker shades of blue. For each metric, you can customize the maximum and minimum values and associated colors. You can also flip the scale so that the minimum values are darker, and the maximum values are lighter.

  • Linear or logarithmic color banding: The color bands that are used to display metric values can be displayed in linear or logarithmic scales. In a linear scale, the color bands are equally sized across the defined value range. In a logarithmic scale, the color bands are logarithmically sized across the value range. Logarithmic scale is useful when you want to visually distinguish values at one end of the value range.

  • Stacking or overlaying metrics: You can view multiple metrics in Stacking or Overlaying view. Stacking displays a color bar for each metric. Overlaying displays only the most significant metric for each node. Significance is based on the order in which the metrics are defined in the Customize Tab dialog box. The first metric is displayed by default. If a metric value reaches the darkest color band, that is the metric that is displayed. If more than one metric reaches the darkest color band, the first one listed is the one that is displayed.

  • Metric aggregation period: Aggregate metrics over a short time period by increasing the number of seconds for the metric value display.

Expected results

  • Your new tab configurations are saved, and you can easily switch between different views of your nodes.

  • The node filters that you apply persist across all tabs.

  • The nodes that you select in one tab are selected when you go to a different tab.

Related Resources

HPC R2 Demo: New heat map and location-based node management features – video (7 min.)

^ Top of page

Save a command or script as a diagnostic test in HPC Cluster Manager

Scenario

When managing your cluster, there are some commands or scripts that you run regularly to check the status of your nodes. You would like to be able to run your own tests and the built-in tests from a single location.

Goal

Save the fsutil volume diskfree command as a diagnostic test that checks for free disk space on your nodes and then run the test and view results from HPC Cluster Manager.

Requirements

  • A head node with Windows HPC Server 2008 R2 installed.

  • Administrative permissions on the cluster.

Steps

Step 1: Define the test

  1. On the head node, open a text editor such as Notepad, and paste in the following XML code. Optionally, change the Company name to your name.

  2. Save the file as C:\SampleTests\DiskSpaceTest.xml.

    Ensure that in Save as type, you select All files (*.*).

Step 2: Add the test to the cluster

  1. Run HPC PowerShell as an administrator. and type the following cmdlet:

    Add-HpcTest -File C:\SampleTests\DiskSpaceTest.xml

  2. To see information about the test you added, type the following cmdlet:

    Get-HpcTestDetail -Alias diskspace

  3. To see all cmdlets that relate to HPC tests, type Get-help *hpctest*.

Step 3: Run the test and view results

  1. In HPC Cluster Manager, click Diagnostics.

  2. In the Navigation Pane, under Tests, select the new node named Beta Participant.

  3. In the view pane, right-click Free Disk Space, then click Run.

  4. In the Run Diagnostics Tests dialog box, click Run.

  5. In the Navigation Pane, click Test Results, select the new node named Beta Participant.

  6. In the Details Pane you can see the progress of the test. Click the Results tab to see the results.

Expected results

The results from the test should look similar to this:

NODE 1 - - > Finished

------------------------------------

Total # of free bytes : 33324670976

Total # of bytes : 41910938752

Total # of avail free bytes : 33324670976

Related Resources

^ Top of page

SOA scheduling and runtime

The scenarios in this section help you try new SOA scheduling and runtime features in Windows HPC Server 2008 R2.

Optimize job scheduling for SOA jobs and interactive workloads

Scenario

Your cluster runs mostly interactive workloads, such as service-oriented architecture (SOA) jobs. One or two large jobs may be taking up most of the cluster, but there are many other interactive jobs that need to run. You want as many jobs to start as possible, rather than having most of the resources allocated to the top of the job queue.

To optimize job scheduling for interactive workloads, you can change the scheduling mode from Queued to Balanced.

In Balanced mode, the scheduler attempts to start all incoming jobs as soon as possible at their minimum resource requirements. After all the jobs in the queue have their minimum resources, additional cluster resources are allocated to jobs based on their load and priority. Resource allocation is periodically rebalanced to fill idle resources and accommodate new jobs.

Goal

Change the scheduling mode from Queued to Balanced.

Requirements

  • A Windows HPC Server 2008 R2 cluster with at least one compute node and one broker node.

  • Administrative permissions on the cluster.

Steps

  1. Use one of the following methods to change the Scheduling Mode from Queued to Balanced:

    • In HPC Cluster Manager, click Options, and then click Job Scheduler Configuration. Scheduling Mode and the associated settings can be configured on the Policy Configuration tab.

    • Run HPC PowerShell as an administrator, and then type:

      Set-HpcClusterProperty –schedulingMode Balanced

    • Open a Command Prompt window as an administrator, and then type:

      cluscfg setparams SchedulingMode=Balanced

  2. Submit several jobs to the cluster.

After you have set the Balanced mode, you can adjust how additional resources are allocated with the PriorityBias setting, and how often the scheduler rebalances with the ReBalancingInterval setting.

PriorityBias controls how additional resources are allocated to jobs. In terms of Balanced mode, “additional resources” refers to cluster resource above the total minimum resources for all running jobs. Tasks that are running on additional resources can be canceled with immediate preemption to accommodate new jobs or to converge on the desired allocation pattern. You can choose from the following three options:

  • HighBias: All additional resources are allocated to higher priority jobs.

  • MediumBias (Default): Each priority band is given a higher proportion of additional resources than the band below it. The priority bands are Highest, Above Normal, Normal, Below Normal, and Lowest.

  • NoBias: Resources are allocated equally regardless of priority.

ReBalancingInterval represents the time, in seconds, between scheduler rebalancing passes.

You can use one of the following methods to change Priority Bias and ReBalancingInterval:

  • Run HPC PowerShell as an administrator, and then type:

    Set-HpcClusterProperty –PriorityBias HighBias –ReBalancingInterval 20

  • Open a Command Prompt window as an administrator, and then type:

    cluscfg setparams PriorityBias=HighBias ReBalancingInteral=20

Expected results

Jobs are started as soon as possible at their minimum resources requirements. If all jobs in the queue have started, all reMayning resources in the cluster are added to jobs based on their priority and workload. As new jobs start, cluster resources are reallocated in proportion to each job’s priority.

Related Resources

Queued mode is priority-based, first come first served scheduling like in Windows HPC Server 2008. For information, see Understanding Job Scheduling Policies (https://go.microsoft.com/fwlink/?LinkId=177866).

^ Top of page

Manage SOA service configuration settings from a single location

Scenario

You have multiple SOA services installed to a central location on your cluster, and you want the ability to see all of the deployed services, change settings to help diagnose and troubleshoot specific services, and modify the service configuration files from a centralized location.

In HPC Cluster Manager, in Configuration, the Services view lets you:

  • See a list of all of the SOA services that are centrally deployed on your cluster (services that are deployed locally on compute nodes are not included).

  • Run diagnostics to verify that the DLLs for the service can be loaded on the specified nodes, and that any detected dependencies for the DLL are present on the nodes.

  • Open the service configuration file in an editor.

  • Set event level tracing.

    Configure error log output.

Goal

Add a service on the cluster and manage the service configuration settings from HPC Cluster Manager.

Requirements

  • A Windows HPC Server 2008 R2 cluster with at least one compute node and one WCF Broker node.

  • Administrative permissions on the cluster.

  • A SOA service assembly (DLL) and a service configuration file (file must be named servicename.config, where the servicename is the same as that passed into the SessionStartInfo constructor).

  • Write permissions on the configuration file to edit the file.

  • A client application that starts an HPC session for that service.

  • Optionally, a client computer with HPC Pack 2008 R2 installed.

Steps

  1. Copy your service .dll file to a folder named C:\ServicesR2 on each compute node.

  2. On the head node, copy the service configuration file to the C:\Program Files\Microsoft HPC Pack 2008 R2\ServiceRegistration folder.

  3. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, and then click HPC Cluster Manager.

  4. In HPC Cluster Manager, click Configuration, and then click Services.

  5. The view pane displays a list all services that have configuration files in the ServiceRegistration folder. Verify that the service that you just added appears in the list.

  6. Right-click your service, then click Edit Configuration File. The configuration file for your service opens in the default XML editor.

    Ensure that the assembly attribute of the service element points to the location of your service .dll (C:\ServicesR2\<yourServiceName>.dll). For example:

    Save the changes, if you made any, and then close the text editor.

  7. To verify that the service can be loaded, right-click the service, and then click Run SOA Service Loading Diagnostic Test.

  8. The Run Diagnostic Tests dialog box appears, and the service that you selected is automatically specified in the parameter for the test. Click Run.

  9. To view test results: In Diagnostics, in the Navigation Pane, click Test Results.

Expected results

  • Service configuration files that you put in the C:\Program Files\Microsoft HPC Pack 2008 R2\ServiceRegistration folder appear in the Configuration section in HPC Cluster Manager.

  • The service loading diagnostic test checks for detected DLL dependencies.

Related Resources

^ Top of page

Enable and collect trace logs to troubleshoot SOA sessions

Scenario

You have a development cluster and you are testing SOA clients and services. Your service DLL includes code to generate trace information.

Goal

Enable tracing on the head node and collect the trace logs from each node that was used during the session.

Requirements

  • A Windows HPC Server 2008 R2 cluster with at least one compute node and one WCF Broker node.

  • Administrative permissions on the cluster.

  • A SOA service assembly (DLL) and a service configuration file deployed to the cluster. The service configuration file must be centrally deployed. For more information, see Manage SOA service configuration settings from a single location.

  • A client application that starts an HPC session for that service.

  • Optionally, a client computer with HPC Pack 2008 R2 installed.

Steps

When you enable tracing in the service configuration file, the trace information is logged to a file on the compute nodes. The log files trace steps from the service call and the intermediate results on cluster. You can collect and remove traces by using the Job Management view or the HPC PowerShell cmdlets. You can view the trace log files with the WCF Service Trace Viewer (SvcTraceViewer.exe).

  1. Click Start, point to All Programs, click Microsoft HPC Pack 2008 R2, and then click HPC Cluster Manager.

  2. In HPC Cluster Manager, click Configuration, and then click Services.

  3. Right-click the service that you want to troubleshoot, and then click Set Event Logging Level. In the dialog box, select the desired trace level and then click OK.

  4. Start a session with that service.

  5. Click Job Management, and then click All Jobs.

  6. In the job list, find the job that is associated with the session that you are debugging. The job ID is the same as the session ID.

  7. Right-click the job, and then click Collect Trace.

  8. In the Collect Trace dialog box, specify the shared folder where you would like to collect the trace logs. The folder must be accessible from the compute nodes.

  9. Verify that the trace logs appear in the specified folder.

  10. Right-click the job, and then click Delete Trace to delete the trace logs from the compute nodes.

Important
Event logging is not generally recommended for production environments. After collecting the trace logs, ensure that you delete them from the compute nodes to avoid consuming disk space.

Expected results

Easily enable and retrieve service tracing.

Related Resources

^ Top of page

Job scheduling and runtime

The scenarios in this section help you try new job scheduling and runtime features in Windows HPC Server 2008 R2.

Provide accurate job prioritization for your cluster

Scenario

Your cluster serves many departments and user groups, and you need accurate job prioritization to meet business needs. Each department has a prioritized list of jobs, and you want the jobs from each department to run in the requested order. Occasionally, you need to make adjustments to the order of the job queue based on particular circumstances or needs.

Priority and submit time help determine when the job will run, and how many resources the job will get. When multiple jobs are submitted with the same priority level, the jobs scheduler attempts to start the jobs in each priority level on a first-come, first-served basis. To ensure that business need has a stronger impact on the order of the job queue than submit time, you ask cluster users to specify a granular priority level for each job.

Goal

Users submit jobs with numerical priority levels.

When necessary, manually adjust priority levels on submitted jobs.

Requirements

  • A head node with Windows HPC Server 2008 R2 installed

  • Administrative permissions on the cluster

Steps

In HPC Pack 2008 R2, the job priority can have a value between 0-4000. Users can specify priority in terms of a priority band, a priority number, or a combination of the two. The priority bands and their corresponding numerical values are as follows:

  • Lowest (0)

  • BelowNormal (1000)

  • Normal (2000)

  • AboveNormal (3000)

  • Highest (4000)

The numerical priority can have a value between 0 (Lowest) and 4000 (Highest). If you enter a value numerically, it is displayed as the corresponding priority band, or as a combination. For example, if you specify a value of 2500, the priority is displayed as Normal+500.

Monitor and adjust the job queue

Cluster administrators and job owners can modify the Priority job property for any active job (Queued or Running).

  1. Run HPC PowerShell.

  2. Type the following cmdlets to see a list of active jobs in job queue order (descending priority and ascending submit time):

    get-hpcjob|sort @{expression=”priority”; descending=$true}, @{expression=”submittime”; ascending=$true}

  3. Modify priority levels on submitted jobs as appropriate. You can use the Set-HpcJob cmdlet to modify the priority of a job. For example, the following cmdlet sets a priority level of 2550 for job 122:

    set-hpcjob –id 122 -priority 2550

Expected results

  • Cluster administrators and job owners can modify the Priority job property for any active job (Queued or Running).

  • The job scheduler attempts to start jobs in priority order. Scheduling policies (such as backfilling) and activation filters can affect the order in which jobs start.

Related Resources

Check for license availability before a job is started

^ Top of page

Check for license availability before a job is started

Scenario

Your cluster runs several applications that use licenses that are shared on a licensing server. You want to:

  • Schedule jobs efficiently and reduce the number of jobs that are failing due to unavailable licenses

  • Mayntain First-Come First-Serve scheduling when jobs are waiting for licenses

  • Make efficient use of the cluster when jobs are waiting for licenses

The HPC Job Scheduler Service can run a custom activation filter on queued jobs that are about to start. A job activation filter is a custom application that you can write to provide additional checks and controls, such as checking for license availability. Depending on the return value from your filter, the HPC Job Scheduler Service takes the appropriate action on the job.

The HPC 2008 R2 SDK samples include an example of an activation filter that checks for license availability against a FlexLM license file.

Goal

Build and try the Activation Filter sample that is included in the HPC 2008 R2 SDK samples (HPC2008R2.SampleCode.zip). The sample is a Visual Studio 2008 project named FlexLM.sln that is in the HPC2008R2.SampleCode \Scheduler\Activation Filter folder.

Requirements

  • A head node with Windows HPC Server 2008 R2 installed

  • Administrative permissions on the cluster

The following must be installed on the head node:

Steps

FlexLM.sln includes a sample activation filter that checks for license availability and the FlexLM.exe.config file that you can use to specify the location of the FlexLM utilites and license file. In the FlexLM projects properties, there are custom pre-build event commands and post-build event commands. The pre-build commands are used to create the files needed to create event log entries. The post-build commands unregister any old version of the FlexLM activation filter and then register the new version so that it can create events and the event viewer can display them. The commands assume that Visual Studio is creating the files in c:\Program Files\Microsoft HPC Pack 2008 R2\Bin\. The cluscfg command tells the HPC Job Scheduler to use the new Activation Filter.

The following steps describe how to configure and build the solution:

  1. On the head node, run Visual Studio 2008.

  2. Open FlexLM.sln.

  3. Open the Jobs.cs file and update the reference to the job xml schema (from HPCS2008 to HPCS2008R2) in the ParseJobXml() method as follows:

    Change from:

    nsmgr.AddNamespace("ab", @"https://schemas.microsoft.com/HPCS2008/scheduler/");

    To:

    nsmgr.AddNamespace("ab", @"https://schemas.microsoft.com/HPCS2008R2/scheduler/");

    Important
    If you do not perform this step, the sample filter will not be able to parse the job XML file.
  4. In the FlexLM.exe.config file, specify the paths to the FlexLM utility (in PollCommandName) and the FlexLM license file (in PollCommandArguments). The following XML code example shows how these paths are specified:

  5. Build the solution. The program deploys the DLL and .config files on the head node, creates event log files, and adds the activation filter to the Job Scheduler parameters.

To test the filter, submit jobs to the cluster that require licenses.

Expected results

The following list describes the supported exit codes for an activation filter, and the corresponding Job scheduler action:

  • 0: The job is started.

  • 1: The job is not started and reMayns in the queue. The filter reevaluates the job periodically until either the job passes, or until the job is canceled. No other jobs of equal or lower priority are started until the job passes or is canceled.

  • 2: The job is not started, but available resources are reserved for it depending on the Scheduling Mode: In Queued, up to the job’s maximum resources are reserved; in Balanced, the minimum resources are reserved. Other jobs can be started on other resources. The filter reevaluates the job periodically until the job passes.

  • 3: The job is put on hold until the date and time specified by the Hold Until job property. After the hold period, the job is reevaluated by the filter program. If the filter returns with exit code 3 and no Hold Until value is specified for that job, the job is held for the amount of time specified by the Default Hold Duration cluster setting.

  • 4: The job is marked as Failed with an error message that the job was failed by the activation filter.

  • Any other exit code: Undefined.

  • Filter timeout: Same as exit code 2.

  • Filter not found: Same as exit code 2.

Related Resources

None.

^ Top of page

Stop a running job or task immediately

Scenario

You want to stop a running job or task immediately.

In HPC Server 2008 R2, a cluster administrator defines a Task Cancelation Grace Period that can allow tasks that are canceled time to save state information and clean up before exiting. To use the grace period, the application must process the CTRL_BREAK event. If the application does not process the event, the task exits immediately. For a service to use the grace period, it must process the ServiceContext.OnExiting event. Job owners can define Node Release tasks for their jobs to perform data or log file collection or return nodes to their pre-job state. The Node Release task runs when a job is about to release a node, including when the job is canceled.

You can force cancel a job or task to skip grace periods and node release tasks.

Goal

Force cancel a job or task.

Requirements

  • A head node with Windows HPC Server 2008 R2 installed.

  • Administrative or user permissions on the cluster.

Steps

  1. Submit a job with at least two tasks including:

    1. A task that runs an application that responds to the CTRL_BREAK event.

    2. A Node Release task.

  2. To force cancel a task, use one of the following methods. Include the –force parameter and specify the ID of your job and task, and optionally, the sub-task.

    1. In HPC PowerShell, use the following cmdlet: Stop-HpcTask –JobId <yourJobID> -TaskID <yourTaskID> [-subTaskID <yourSubTaskID>] -force

    2. At a command prompt, use the following command: task cancel <yourJobID>.<yourTaskID>[.<yourSubTask>] /force

  3. To force cancel a job, use one of the following methods. Include the –force parameter, and specify the ID of your job.

    1. In HPC PowerShell use the following cmdlet: Stop-HpcJob <yourJobID> -force

    2. At a command prompt use the following command: job cancel <yourJobID> /force

Expected results

Force cancelling a task: the task stops immediately and does not use the Task Cancel Grace period (the application must process the CTRL_BREAK event to make use of the grace period).

Force cancelling a job: the job stops immediately. The tasks in the job do not use the Task Cancel Grace period, and the Node Release task does not run.

Related Resources

^ Top of page

Exclude particular nodes from running tasks in your job

Scenario

You notice that one particular node keeps failing tasks in your job. You want the job scheduler to stop scheduling your tasks on that node.

In Windows HPC Server 2008 R2, you can specify a list of nodes to exclude from your job.

Note
For SOA jobs, the broker node automatically updates and Mayntains the list of excluded nodes according to the EndPointNotFoundRetryPeriod setting (in the service configuration file). This setting specifies how long the service host should retry loading the service and how long the broker should wait for a connection. If this time elapses, the broker adds the node (service host) to the Excluded Nodes list. The service configuration also includes the maxExcludedNodes setting that specifies how many nodes can be excluded before the session fails.

Goal

Add one or more nodes to the Excluded Nodes job property.

See all excluded nodes on the cluster (Administrator).

Requirements

  • A head node with Windows HPC Server 2008 R2 installed.

  • Administrative or user permissions on the cluster.

Steps

Defining excluded nodes for a job

For any active job, you can add or remove nodes in the Excluded Nodes jobs property, or clear the list. The following lists the commands to modify and view the Excluded Nodes list using HPC PowerShell or a command prompt.

In HPC PowerShell, use the following cmdlets:

  • Set-HpcJob –JobId <yourJobID> /addExludedNodes <nodeName>, <nodename>

  • Set-HpcJob –JobId <yourJobID> /removeExcludedNodes <nodeName>, <nodename>

  • Set-HpcJob –JobId <yourJobID> /clearExcludedNodes

  • (Get-HpcJob –JobId <yourJobID>).ExcludedNodes

  • Or to view all job properties, Get-HpcJob –JobId <yourJobID>|fl

At a command prompt, use the following commands:

  • job modify <yourJobID> /addExludedNodes <nodeName>, <nodename>

  • job modify <yourJobID> /removeExcludedNodes <nodeName>, <nodename>

  • job modify <yourJobID> /clearExcludedNodes

  • job view <yourJobID> /detailed|find “excludednodes” /i

  • Or to view all job properties, job view <yourJobID> /detailed

Monitoring excluded nodes on the cluster

To see all excluded nodes on a cluster, use the Get-HpcJob PowerShell cmdlet. The following example shows how to list all of the excluded nodes for jobs that were submitted today. The script also lists the job template that was used for the job that excluded the node. In the following cmdlet, <today’s date> is specified in a date format such as mm/dd/yyyy:

Get-HpcJob –beginSubmitDate <today’s date>|select ExcludedNodes, Job Template|sort

If the cluster administrator detects and resolves the issue on one or more nodes, the administrator can remove the fixed node from any node exclusion list in which it appears. The following cmdlet gets all active jobs and removes the fixed nodes from the node exclusion lists (this has no effect on jobs that do not list the specified nodes):

Get-HpcJob|Set-HpcJob –removeExcludedNodes <fixedNodeName>,<fixedNodeName>

Expected results

  • Tasks in the job that are running on a node that has been added to Excluded Nodes are canceled and marked as Failed.

  • No tasks in the job are started on nodes that are listed in Excluded Nodes.

Related Resources

Job Submission in Windows HPC Server 2008 Quick Reference

^ Top of page

Receive notification when your job is done

Scenario

You submitted a long-running job to the cluster, and would like to be notified when the job is done.

Goal

Enable eMayl notification on the cluster and submit a job that requests notification on job completion.

Requirements

  • A head node with Windows HPC Server 2008 R2 installed.

  • Administrative permissions on the cluster.

Steps

Enable eMayl notification on the cluster:

  1. In HPC Cluster Manager, click Options, and then click Job Scheduler Configuration.

  2. On the E-Mayl notifications tab, specify the SMTP server, authentication, and originating address. Click the “More about eMayl notifications” link on the tab for important considerations.

Submit a job that requests notification on completion:

  1. In HPC Cluster Manager, in Job Management, click New Job.

  2. In Job run options, in Send a notification, select the Completes check box.

  3. Add one or more tasks to the job, and then click Submit.

Expected results

If notification is selected for a specific job, and eMayl notification is enabled on the cluster, job owners receive the requested eMayl messages to the e-eMayl account that is associated with their doMayn credentials.

Related Resources

None.

^ Top of page

Provision or clean up the nodes that are allocated to your job

Scenario

You want to perform some basic provisioning of the nodes that are allocated to your job. For example, you may want to copy files or verify the running environment before your primary tasks run. To prepare the nodes that are allocated to your job, you can add a Node Preparation task to your job.

After your tasks complete, you need to collect data or log files from the nodes that were allocated to your job or return the nodes to their pre-job state. To clean up nodes after running your primary tasks, you can add a Node Release task to your job.

Goal

Submit a job with Node Preparation and Node Release tasks.

Requirements

  • A Windows HPC Server 2008 R2 cluster with at least one compute node.

  • User permission on the cluster.

Steps

For detailed step-by-step instructions, see Submitting a Job with Node Preparation and Node Release Tasks in Windows HPC Server 2008 R2 Step-by-Step Guide.

  1. Create a new job.

  2. Add a Node Preparation task.

    Note
    If you are using HPC PowerShell or a Command Prompt window, use the –Type property to designate a Node Preparation task, for example: Add-HpcTask –jobID <ID> –Type NodePrep Job add <ID> -Type:”NodePrep”
  3. Add one or more primary tasks (Basic or Parametric Sweep) to the job.

  4. Add a Node Release task.

    Note
    If you are using HPC PowerShell or a Command Prompt window, specify a task with Type set to NodeRelease.
  5. Submit the job.

Now try to cancel a Running job that includes a Node Release task.

Expected results

  • The Node Preparation task runs on each node before the Basic or Parametric Sweep tasks.

  • If a Node Preparation task fails to run on a node, that node is not added to the job.

  • The Node Release task runs on each node as it is released from the job.

  • The Node Release task runs if the job is canceled or preempted.

Related Resources

^ Top of page

Provide custom job progress information

Scenario

Many of the applications that you run on your cluster run for a long time, and they consist of many internal stages. To better monitor job progress, you want to be able to see information about the percentage of completion or about the internal state of the application (such as data file loaded, running simulation, or writing data).

You can include commands in your application or script files to set and Mayntain custom job progress information with the Progress and Progress Message job properties.

  • Progress is an integer between 0-100 that represents the percentage of the job that is complete.

  • Progress Message is a string up to 80 characters that can display a custom status message.

Goal

Set and Mayntain values for job Progress and Progress Message from an application or script.

Requirements

  • A Windows HPC Server 2008 R2 cluster with at least one compute node.

  • User permissions on the cluster.

Steps

Include commands to set Progress and Progress Message in your scripts or applications. For example, if your application includes a loop that performs some work, you can update the progress properties at each iteration.

To set the Progress and Progress Message properties in a batch (.bat) file, an HPC PowerShell script (.ps1), or an application, you can use the %CCP_JOBID% environment variable to get the job ID of the current job, as follows:

You can use one of the following methods to see the progress information for a running job:

  • In HPC Job Manager, double-click a running job to open the View Job dialog box.

  • At a command prompt, type the following command, where <jobID> is the ID of your job:

    Job view <jobID>

  • In HPC PowerShell, type the following cmdlet, where <jobID> is the ID of your job:

    Get-HpcJob <jobID>|select id, Progress, ProgressMessage

Expected results

  • You can view custom progress information in HPC Job Manager, HPC PowerShell, or a Command Prompt window.

  • By default, the HPC Job Scheduler Service sets and Mayntains the value for the Progress job property. The service does not continue to update Progress for a job if you provide a value for Progress through the command-line interface, HPC PowerShell, or the APIs.

Related Resources

^ Top of page

Allow canceled tasks time to save state information or clean up before exiting

Scenario

When a running task is stopped during execution, you want to allow time for the application to save state information, write a log message, create or delete files, or for services to finish computation of their current service call. You can configure the amount of time, in seconds, to allow applications to exit gracefully by setting the Task Cancelation Grace Period cluster property. The default Task Cancelation Grace Period is 15 seconds.

Important
In Windows HPC Server 2008 R2, the HPC Node Manager Service stops a running task by sending a CTRL_BREAK signal to the application. To use the grace period, the application must process the CTRL_BREAK event. If the application does not process the event, the task exits immediately. For a service to use the grace period, it must process the ServiceContext.OnExiting event.

Goal

Allow tasks that are canceled time to perform cleanup or completion steps before exiting.

Requirements

  • A Windows HPC Server 2008 R2 cluster with at least one compute node.

  • Administrative permissions on the cluster.

Steps

  1. Submit a job that runs an application that includes code to handle a CTRL_BREAK event.

  2. Cancel the job while it is running.

  3. Verify that the actions in the application’s CTRL_BREAK event handler were performed.

You can use one of the following methods to change the Task Cancellation Grace Period to 10 seconds:

  • Run HPC PowerShell as an administrator, and then type:

    Set-HpcClusterProperty –TaskCancelGracePeriod 10

  • Open a Command Prompt window as an administrator, and then type:

    cluscfg setparams TaskCancelGracePeriod=10

Expected results

  • Canceled tasks that do not handle CTRL_BREAK events exit immediately.

  • Canceled tasks that include code to handle CTRL_BREAK events can exit gracefully.

Related Resources

None.

^ Top of page