Store and Access Data Files

Updated: June 6, 2006

Applies To: Windows Compute Cluster Server 2003

Compute Cluster Server 2003 is very flexible regarding the organization of task input, output, and error files. Files in any location can be read and written to directly by a task using a local or UNC path with the task properties Standard Input, Standard Output, and Standard Error. (The scheduler also supports the use of file paths specified as command parameters.)Tasks can also operate on files stored in the default working directory, where they can be staged in and out relative to a central location. In either case, a central file store on a shared folder, either on the head node or on an external node, is recommended.

noteNote
Creating a file store for input, output, and error files is usually a coordinated effort between the cluster administrator and the user, and requires administrator's permissions and oversight over shared resources and the user's specific knowledge of the projects, jobs, and files involved.

Default behavior

By default, the Job Scheduler looks for standard input files and writes standard output and error files to the working directory of the compute node(s) on which the executable file is run. (In the case of an MPI task, this is the designated node.)This is the path specified by the task property Working Directory (/workdir: in CLI). Working Directory, in turn, takes as its default value the user's home directory on the node %userprofile%, which typically points to C:\Documents and Settings\<user_alias>). The files themselves can then be specified by file name alone, using the properties Standard Input, Standard Output, and Standard Error (in the CLI, /stdin:, /stdout:, and /stderr:). Because different tasks use different compute nodes in an unpredictable pattern, the default working directory does not provide a central or predictable location for data files. Instead it should be used as convenient staging area for input files transferred to and from a more central file store using a job script. For more information about staging using scripts, see Use Batch and Script Files with Compute Cluster Jobs. For large clusters with many users, changing the default working directory to allow the application to read and write directly to a central file store using UNC paths can be useful.

Central file store

The preferred location for a central file store is an external file server. The head node and the client node may also be used. In using an external file store, Compute Cluster Server 2003 is like any other multi-user system that operates primarily in batch mode. Similarly, the steps by which files on remote servers are shared and accessed are no different than in any Windows environment. The following diagram shows an example file store.

noteNote
In setting up file stores, it is a good practice to avoid unnecessarily long file names. For Standard Input, Standard Output, and Standard Error files, the maximum path length is 160 characters.

A file tree for data files

As the diagram shows, a different organization is called for depending on whether a job is a parametric sweep or an MPI job. In a parametric sweep, the input and output are usually a set of indexed files (for example, input1, input2, input3…, output1, output2, output3…) set up to reside in a single common folder or separate common folders. An MPI job typically has a single input file and produces a single output file, which can easily reside in the same folder.

The previous figure shows only one of many possible data stores. What is important to know is that almost any structure is possible, including data stores that already exist on your network.

Using Working Directory to access the file store

Using the Working Directory property can simplify task access to data files on a shared folder. If input and output files are stored in the same folder, that folder itself can be made a shared folder. If input and output files reside in separate folders, the parent folder can be made the shared folder and the standard input and standard output files specified relative to the parent folder, as shown in the following example. In this example it is also necessary that the application, Myapp.exe, resides in the remote working directory.

job new

job add 34 /workdir:\\CCSFileserver\Project1_run1_data\ /stdin:input\infile1.txt /stdout:output\outfile1.txt myapp.exe

job add 34 /workdir:\\CCSFileserver \Project1_run1_data\ /stdin: input\infile2.txt /stdout: output\outfile2.txt myapp.exe

….

job add 34 /workdir:\\CCSFileserver \Project1_run1_data\ /stdin: input\infile50.txt /stdout: output\outfile50.txt myapp.exe

noteNote
For large parametric sweeps, it is generally more efficient to stage data files into and out a local working directory using a script than to set the working directory to the file server. See Use Batch and Script Files with Compute Cluster Jobs.

Specifying data file paths by using command line arguments

Some applications use command-line arguments to specify input and output files instead of standard I/O:

myapp.exe -input infile.txt -output outfile.txt

This command-line argument is the exact equivalent of specifying the data files as the standard input and standard output file properties and the file paths will be interpreted as relative to the working directory. This is because Working Directory becomes the effective current directory for the entry of the task command line. The task Myapp.exe, for example, will be executed from the working directory exactly as if the cd were never performed.

cd c:\CCS_data_files\user1\project1\run1

job submit myapp.exe -input input\infile.txt -output output\outfile.txt

Using environment variables in file names

Compute Cluster Server 2003 supports the use of environment variables for output file names that cannot be known in advance. The most common instance is a file name that incorporates the job ID. For example:

job submit /numprocessors:4, 8 /stdin:infile.txt /stdout:^%CCP_JOBID^%.txt mpiexec myapp.exe

For more information, see Use Environment Variables.

Setting up data files

The following procedure shows how to store and access data files.

To store and access data files

  1. Create the data file tree.

  2. Select the folder you want to act as a working directory.

    noteNote
    This can be anywhere in the tree as long as the files reside under it. For an MS MPI job, it can be the folder that contains the files. For a parametric sweep, it can be the parent folder of input and output file subfolders.

  3. Set up the selected folder as a shared folder. If you are using a Windows operating system, right-click the folder, click Sharing and Security, click the Sharing tab, and then click Share this folder.

    noteNote
    The shared folder you create will go by the name \\<node_name>\<share_name>, where share name can be any name you specify, not necessarily the actual name of the folder.

  4. Specify each file by its UNC pathname when defining the task in Job Manager or the command-line interface. For example:

    job submit /workdir: \\CCSFileserver\Project1\datafiles /stdin: infile /stdout: outfile1 myMPIapp.exe

    -OR-

    job new

    job add 39 /workdir: \\CCSFileserver\Project2\datafiles /stdin: input\infile1 /stdoutoutput\outfile1 myapp.exe

    job add 39 /workdir: \\CCSFileserver\Project2\datafiles /stdin: input\infile2 /stdout:output\outfile2 myapp.exe...

    job add 39 /workdir: \\CCSFileserver\Project2\datafiles /stdin: input\infile50 /stdout:output\outfile50 myapp.exe

Community Additions

ADD
Show: