Creating a batch job script for Puhti
A batch job script contains the definitions for the resources to be reserved for the job and the commands the user wants to run.
A basic batch job script
An example of a simple batch job script:
#!/bin/bash #SBATCH --job-name=myTest #SBATCH --account=<project> #SBATCH --time=02:00:00 #SBATCH --mem-per-cpu=2G #SBATCH --partition=small ##SBATCH --mail-type=BEGIN #uncomment to enable mail module load myprog/1.2.3 srun myprog -i input -o output
The first line
#!/bin/bash tells that the file should be interpreted
as a bash script.
The lines starting with
#SBATCH are arguments for the batch system.
These examples only use a small subset of the options. For a list of all possible
options, see the Slurm documentation.
The general syntax of a
#SBATCH option_name argument
In our example,
sets the name of the job. It can be used to identify a job in the queue and other listings.
sets the billing project for the job. This argument is mandatory. Failing to
set it will cause the job to be held with the reason AssocMaxJobsLimit
<project> with the Unix group of the project. You
can find it in My CSC under the My projects tab. More
information about billing.
The time reservation is set with option
Time is provided using the format hh:mm:ss (optionally d-hh:mm:ss, where d is days). The maximum time depends on the selected queue. When the time reservation ends, the job is terminated regardless of whether it is finished or not, so the time reservations should be sufficiently long. A job consumes billing units according to its actual runtime.
sets the required memory per requested CPU core. If the requested memory is exceeded, the job is terminated.
The partition needs to be set according to the job requirements.
The user can be notified by email when the jobs starts by using the
##SBATCH --mail-type=BEGIN #uncomment to enable mail
Other useful arguments (multiple arguments are separated by a comma) are
By default, the email will be sent to the email address of your csc account.
This can be overridden with the
After defining all required resources in the batch job script, set up the environment. Note that for modules to be available for batch jobs, they need to be loaded in the batch job script.
module load myprog/1.2.3
Finally, we launch our program using the
srun myprog -i input -o output
Serial and shared memory batch jobs
Serial and shared memory jobs need to be run within one computing node. Thus, the jobs are limited by the hardware specifications available in the nodes. In Puhti, each node has two processors with 20 cores each, i.e. 40 cores in total.
The Sbatch option
--cpus-per-task is used the define the number of computing cores that the batch job task uses. The option
--nodes=1 ensures that all the reserved cores are located in the same node, and
--ntasks=1 assigns all reserved computing cores for the same task.
In thread-based jobs, the
--mem option is recommended for memory reservation. This option defines the amount of memory required per node. Note that if you use
--mem-per-cpu option instead, the total memory request of the job will be the memory request multiplied by the number of reserved cores (
--cpus-per-task). Thus, if you modify the number of cores, also check the memory reservation.
In most cases, it is the most efficient to match the number of reserved cores to the number of threads or processes the application uses. Check the documentation for application-specific details.
If the application has a command line option to set the number of threads/processes/cores, it should always be used to make sure the software behaves as expected. Some applications use only one core by default, even if more are reserved.
Some other applications may try to use all cores in the node even if only some are reserved. The environment variable
$SLURM_CPUS_PER_TASK can be used instead of a number. This way, the command does not need to be edited if the
--cpus-per-task is changed. Use the environment variable
OMP_NUM_THREADS to set the number of threads the program uses.
MPI-based batch jobs
In MPI jobs, each task has its own memory allocation. Thus, the tasks can be distributed between nodes.
Set the number of MPI tasks:
If more fine-tuned control is required, the exact number of nodes and number of tasks per node can be specified with
It is recommended to request memory using the
- MPI should not be started with mpirun or mpiexec, use
- A MPI module has to be loaded in the batch job script for the submission to work properly.
Hybrid batch jobs
In hybrid jobs, each tasks is allocated several cores. Each tasks then uses some other parallelization than MPI to do work.
The most common strategy is for every MPI-task to launch multiple threads using OpenMP.
To request more cores per MPI task, use the argument
--cpus-per-task. The default value is one core per task.
The optimal ratio between the number of tasks and cores per tasks varies for each program, testing is required to find the right combination for your application.
By default, running a single task per node with multiple threads using hpcx-mpi will bind all threads to a single
core and no speedup will be gained. This can be fixed by setting
export OMP_PROC_BIND=true in your job script. This
will bind the threads to different cores. Another possibility is to turn off slurms core binding with the
Additional resources in batch jobs
Some nodes in Puhti have a local fast storage available for jobs. The local storage is good for I/O-intensive programs.
The local storage is available on:
- GPU nodes in the
- I/O nodes shared by the
Request local storage using the
--gres flag in the job submission:
The amount of space is given in GB (with a maximum of 3600 GB per node). For example, to request 100 GB of storage, use option
--gres=nvme:100. The local storage reservation is on a per node basis.
Use the environment variable
$LOCAL_SCRATCH in your batch job scripts to access the local storage on each node.
The local storage is emptied after the job has finished, so please move any data you want to keep to the shared disk area.
Puhti has 320 NVIDIA Tesla V100 GPUs. The GPUs are available on the
gputest partitions using the option:
--gres reservation is on a per node basis. There are 4 GPUs per GPU node.
Multiple resources can be requested with a comma-separated list.
Request both GPU and local storage:
For example, to request 1 GPU and 10 GB of NVME storage the option would be
Last edited Thu Nov 26 2020