Array jobs
In many cases, a computational analysis job contains a number of similar independent subtasks. A user may have several datasets that are analyzed in the same way, or the same simulation code is executed with a number of different parameters. These kinds of tasks are often called embarrassingly parallel jobs, or collectively task farming, since they can, in principle, be distributed to as many processors as there are tasks to run.
Array jobs may be a suitable approach if:
- The runtime of each independent job is long enough for the SLURM batch system
overhead to be irrelevant.
- Individual runtimes are longer than about 30 minutes.
- The total number of independent jobs is not excessively large.
- A user can only have up to 400 jobs either running or queuing on the batch system.
Other options
When the runtimes are very short or the number of individual jobs is very large,
there are more suitable options for running high-throughput calculations.
The recommended tool for these use cases is HyperQueue.
Alternatives include the
Linux xargs
utility
(see this batch script for a usage example)
and the GNU Parallel shell tool.
Defining an array job
In Slurm, an array job is defined using the option --array
or -a
, e.g.
will launch not just one batch job, but 100 batch jobs where the subjob specific environment
variable $SLURM_ARRAY_TASK_ID
has a value ranging from 1 to 100. This variable can then
be utilized in the actual job launching commands so that each subtask gets processed. All
subjobs are launched in the batch job system at once, and they are executed using as many
processors as are available.
In addition to defining a job range, you can also provide a list of job index values, e.g.
would launch three jobs with $SLURM_ARRAY_TASK_ID
values 4, 7 and 22.
You can also include step size in the job range definition. The array job definition
would run five jobs with $SLURM_ARRAY_TASK_ID
values: 1, 21, 41, 61 and 81.
In some cases, it may be reasonable to limit the number of simultaneously running processes.
This is done with the notation %max_number_of_jobs
. For example, in a case were you have
100 jobs but a license for only five simultaneous processes, you could ensure that you will
not run out of license using the definition
A simple array job example
As a first array job example, let's assume that we have 50 datasets (data_1.inp
, data_2.inp
... data_50.inp
) that we would like to analyze using the program my_prog
that uses the syntax
Each of the subtasks requires less than two hours of computing time and less than 4 GB of memory. We can perform all 50 analysis tasks with the following batch job script:
#!/bin/bash -l
#SBATCH --job-name=array_job
#SBATCH --output=array_job_out_%A_%a.txt
#SBATCH --error=array_job_err_%A_%a.txt
#SBATCH --account=<project>
#SBATCH --partition=small
#SBATCH --time=02:00:00
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=4000
#SBATCH --array=1-50
# run the analysis command
my_prog data_${SLURM_ARRAY_TASK_ID}.inp data_${SLURM_ARRAY_TASK_ID}.out
In the batch job script, the line #SBATCH --array=1-50
defines that 50 subjobs will be submitted.
The other #SBATCH
lines refer to individual subjobs. In this case, one subjob uses at most one
processor (--ntasks=1
), 4 GB of memory (--mem-per-cpu=4000
), and can last up to two hours
(--time=02:00:00
). However, the total wall clock time needed to process all 50 tasks is not
limited.
In the job execution commands, the script utilizes the $SLURM_ARRAY_TASK_ID
variable in
the definition of the input and output files so that the first subjob will run the command
the second one will run the command
and so forth.
The job can be now launched using the command
Typically, not all jobs are executed at once. However, after a while, a large number of jobs may
be running simultaneously. When the batch job is finished, the data_dir
directory contains 50
output files.
After submitting an array job, the command
reveals that you have one pending job and possibly several other jobs running in the batch job
system. All these jobs have a jobid
that contains two parts: the jobid
number of the array
job and the subjob number. Directing the output of each subjob into a separate file is recommended
as the file system may fail if several dozens of processes try to write into the same file at
the same time. If the output files need to be merged into one file, it can often be easily done
after the array job has finished. For example in the case above, we could collect the results
into one file using the command
In the case of standard output and error files, defined in the #SBATCH
lines, you can use the
definitions %A
and %a
to give unique names to the output files of each subjob. In the file
names, %A
will be replaced by the ID of the array job and %a
will be replaced by the
$SLURM_ARRAY_TASK_ID
.
Using a file name list in an array job
In the example above, we were able to use $SLURM_ARRAY_TASK_ID
to refer to the order numbers
in the input files. If this type of approach is not possible, a list of files or commands created
before the submission of the batch job can be used. Let's assume that we have a similar task as
defined above but the file names do not contain numbers but are in the format data_aa.inp
,
data_ab.inp
, data_ac.inp
and so forth. Now, we first need to make a list of the files to
be analyzed. In this case, we could collect the file names into the file namelist
using the
command
The command
reads a certain line from the name list file. In this case, the actual command script could be
#!/bin/bash -l
#SBATCH --job-name=array_job
#SBATCH --output=array_job_out_%A_%a.txt
#SBATCH --error=array_job_err_%A_%a.txt
#SBATCH --account=<project>
#SBATCH --partition=small
#SBATCH --time=02:00:00
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=4000
#SBATCH --array=1-50
# set the input file to process
name=$(sed -n ${SLURM_ARRAY_TASK_ID}p namelist)
# run the analysis
my_prog ${name} ${name}.out
This example is otherwise similar to the first one except that it reads the name of the file to
analyze from the file namelist
. This value is stored in the variable ${name}
, which is used in
the job execution command. As the row number to read is defined by the $SLURM_ARRAY_TASK_ID
,
each data file listed in the file namelist
is processed as a different subjob. Note that as
we now also use ${name}
in the output definition, the output file name will be in the format
data_aa.inp.out
, data_ab.inp.out
, data_ac.inp.out
and so forth.
Using array jobs in workflows with sbatch_commandlist
sbatch-hq
A more efficient alternative to sbatch_commandlist
is the CSC utility tool sbatch-hq
,
which is essentially a wrapper for HyperQueue. sbatch-hq
allows you to submit an ensemble of similar independent non-MPI parallel tasks from a
command list, i.e. a file in which each line corresponds to an individual subtask to be
executed. See the HyperQueue page for more details.
In Puhti, you can use the command sbatch_commandlist
to execute a list of commands as an
array job. This command takes as an input a text file. The command list is split into several
pieces that are executed as an independent sub-task of a array batch job, automatically
generated by the sbatch_commandlist
. Thus, one subtask of an array job may process several
commands from the command list.
The syntax of this command is:
Options -t
and -mem
can be used to modify the time and memory reservation of the subjobs
(default 12 h, 8 GB). By default, the billing project is set based on the name of the scratch
directory where this command is executed, but if needed, it can be assigned using option
-project
. The command list is split into maximum 200 subtasks. If individual tasks are
very short you could use the option --max_jobs
to decrease the splitting so that each array
job task would take at least around half an hour to process.
After submitting an array job, sbatch_commandlist
starts monitoring the progress of the
job. If you use sbatch_commandlist
interactively in the login nodes, you normally don't
want to keep the monitor running for hours. In these cases, you can just close the monitoring
process by pressing Ctrl-c
. The actual array job is not deleted, but it stays active in the
batch job system and you can manage it with normal Slurm commands.
In addition to interactive usage, sbatch_commandlist
can be utilized in batch jobs and
automatic workflows, where only certain steps of the workflow can utilize array jobs based
parallel computing. As an example, let's assume we have a gzip compressed tar-archive file
my_data.tgz
containing a directory with a large number of files. To create a new compressed
archive, that includes also a md5 checksum file for each file we would need to:
- un-compress and un-pack
my_data.tgz
- execute
md5sum
for each file and finally - pack and compress the
my_data
directory again.
The second step of the workflow could be executed using a for-loop, but we could also use the
loop just to generate a list of md5sum
commands, that can be processed with sbatch_commandlist
.
#!/bin/bash -l
#SBATCH --job-name=workfow
#SBATCH --output=workflow_out_%j.txt
#SBATCH --error=workflow_err_%j.txt
#SBATCH --account=<project>
#SBATCH --time=12:00:00
#SBATCH --mem=4000
#SBATCH --ntasks=1
#SBATCH --partition=small
#open the tgz file
tar zxf my_data.tgz
cd my_data
#generate a list of md5sum commands
for my_file in *
do
echo "md5sum $my_file > $my_file.md5" >> md5commands.txt
done
#execute the md5commands as an array job
sbatch_commandlist -commands md5commands.txt
#remove the command file and compress the directory
rm -f md5commands.txt
cd ..
tar zcf my_data_with_md5.tgz my_data
rm -rf my_data
Note that the batch job script above is not an array job, but it launches another batch job that is an array job.