-
Create Roihu batch jobs
Creating a batch job script for Roihu
A batch job script contains the definitions of the resources to be reserved for a job and the commands the user wants to run.
The structure of a batch job script
An example of a batch job script using a share of resources on a single node:
#!/bin/bash
#SBATCH --job-name=my-test # Job name
#SBATCH --account=<project> # Billing project, has to be defined!
#SBATCH --partition=small # Job partition (queue)
#SBATCH --time=00:30:00 # Max. duration of the job
#SBATCH --nodes=1 # Number of nodes used for the job
#SBATCH --ntasks=1 # Number of tasks allocated
#SBATCH --cpus-per-task=1 # Number of CPU cores allocated per task
#SBATCH --mem-per-cpu=1000M # Memory to reserve per CPU core
#SBATCH --output=slurm-%j.out # Standard output of the job script
##SBATCH --mail-type=BEGIN # Uncomment to enable mail
module load myprog/1.2.3 # Load required modules
srun myprog -i input -o output # Run program using requested resources
The first line #!/bin/bash tells that the file should be interpreted as a
Bash script.
The lines starting with #SBATCH are arguments (directives) for the batch job
system. These examples only use a small subset of the options. For a list of
all possible options, see the
Slurm documentation.
The general syntax of an #SBATCH option is:
The first line in our example, sets the name of the job:
The name of the job will be my-test. It can be used to identify a job in the queue and other listings.
The billing project for the job is set with option --account:
Please replace <project> with the Unix
group of your project. You can find it in MyCSC under the
Projects tab. More information about billing.
Remember to specify the billing project
The billing project argument is mandatory. Failing to set it will cause an error:
The partition (queue) needs to be set according to the job requirements. For example:
Available partitions
The runtime reservation is set with option --time:
Time is provided using the format hours:minutes:seconds (optionally days-hours:minutes:seconds)
The maximum runtime depends on the selected queue. When the
time reservation ends, the job is terminated regardless of whether it has
finished or not, so the time reservations should be sufficiently long. Note
that a job consumes Billing Units according to its actual runtime.
The number of nodes used for the job can be set with option --nodes:
This does not mean that the all the resources of the node(s) would be reserved, but that
all the tasks and CPU cores will be allocated from a single node in this case. In general,
here one could give a range --nodes=<minnodes>-<maxnodes> to define the spread of nodes
across which the resources would be allocated.
The number of tasks allocated for the job can be set with option --ntasks:
The allocated tasks can be used in different ways in the job script, most typically as MPI processes.
The number of CPU cores per task allocated for the job can be set with option --cpus-per-task:
The product of --ntasks and --cpus-per-task defines the total amount of CPU cores
allocated for the job.
The amount of memory reserved for each CPU core is set with option --mem-per-cpu:
If the program exceeds the reserved amount of memory, the job is terminated.
The standard output file of the job script is set with option --output:
The standard output means all the prints that would be visible in the shell if
the commands listed in the script were executed in an interactive shell.
Here %j is a replacement symbol for jobid, so the output will go to the file slurm-<slurm-jobid>.out.
By default, this file collects also the standard error, but it is possible
to specify a different file for standard error with --error=<filename_pattern>.
The user can be notified by email when the job starts by using the
--mail-type option
Other useful arguments (multiple arguments are separated by a comma) are END
and FAIL. By default, the email will be sent to the email address linked to
your CSC account. This can be overridden with the --mail-user= option.
Notifications by email not enabled yet
The notifications by email are not enabled yet and
the --mail-type option does not do anything at the moment.
After defining all required resources in the batch job script, set up the required environment by loading suitable modules. Note that for modules to be available for batch jobs, they need to be loaded in the batch job script. More information about environment modules.
Finally, we launch our application using the requested resources with the
srun command:
Serial and shared memory batch jobs
Serial and shared memory jobs need to be run within one compute node. Thus, the jobs are limited by the hardware specifications available in the nodes. See the available node types and the number of cores available per node on this page.
The #SBATCH option --cpus-per-task is used to define the number of
computing cores that the batch job task uses. The option --nodes=1 ensures
that all the reserved cores are located in the same node, and --ntasks=1
assigns all reserved computing cores for the same task.
In thread-based jobs, the --mem option is recommended for memory reservation.
This option defines the amount of memory required per node. Note that if you
use --mem-per-cpu option instead, the total memory request of the job will be
the memory requested per CPU core (--mem-per-cpu) multiplied by the number of
reserved cores (--cpus-per-task). Thus, if you modify the number of cores,
also check that the memory reservation is appropriate.
Typically, the most efficient practice is to match the number of reserved cores
(--cpus-per-task) to the number of threads or processes the application uses.
However, always check the application-specific details.
If the application has a command-line option to set the number of threads/processes/cores, it should always be used to ensure that the software behaves as expected. Some applications use only one core by default, even if more are reserved.
Other applications may try to use all cores in the node, even if only some
are reserved. The environment variable $SLURM_CPUS_PER_TASK, which stores the
value of --cpus-per-task, can be used instead of a number when specifying the
amount of cores to use. This is useful as the command does not need to be
modified if the --cpus-per-task is changed later.
Finally, use the environment variable OMP_NUM_THREADS to set the number of
threads the application uses. For example,
(note the :-1 syntax, which sets the number of threads to 1
if --cpus-per-task was not set).
MPI-based batch jobs
In MPI jobs, each task has its own memory allocation. Thus, the tasks can be distributed over multiple nodes.
When running jobs on a partial node (small partition), set the number of MPI tasks with:
When running on full nodes (medium and large partitions),
it is recommended to not use --ntasks option,
but instead set --nodes, --ntasks-per-node, and --cpus-per-task instead:
#SBATCH --partition=medium
#SBATCH --nodes=<number_of_full_nodes>
#SBATCH --ntasks-per-node=384 --cpus-per-task=1 # The product should be 384
This ensures predictable distribution and CPU binding of processes within the node, see Performance checklist.
Set both --ntasks-per-node and --cpus-per-task for full nodes
It is advisable to set both --ntasks-per-node and --cpus-per-task
to keep their product as 384 for best performance.
See notes on undersubscribing full nodes.
Running MPI programs
- MPI programs should not be started with
mpirunormpiexec. Usesruninstead. - An MPI module has to be loaded in the batch job script for the program to work properly.
Hybrid batch jobs (e.g. MPI+OpenMP)
In hybrid jobs, each task is allocated several cores. Each task then uses some
parallelization, other than MPI, to do the work. The most common strategy is
for every MPI task to launch multiple threads using OpenMP. To request more
cores per MPI task, use the argument --cpus-per-task. The default value is
one core per task.
When running on full nodes, it is recommended to write
--ntasks-per-node and --cpus-per-task options on
the same #SBATCH line for clarity:
#SBATCH --partition=medium
#SBATCH --nodes=<number_of_full_nodes>
#SBATCH --ntasks-per-node=192 --cpus-per-task=2 # The product should be 384
#SBATCH --ntasks-per-node=96 --cpus-per-task=4 # The product should be 384
The reason is that these options go hand-in-hand in a sense that their product should always be 384 in order to use all CPU cores available on the node. You can comment out one of the lines to test the optimal run configuration for your application, see Performance checklist.
The optimal ratio between the number of tasks and cores per tasks varies for each program and job input. Testing is required to find the right combination for your application. You can find some examples for CP2K and NAMD.
Threads per task in hybrid MPI+OpenMP jobs
Set the number of OpenMP threads per MPI task in your batch script using
the OMP_NUM_THREADS and SLURM_CPUS_PER_TASK environment variables:
High-memory jobs
Roihu-CPU has high-memory CPU nodes with 6 TiB memory.
These nodes are available in the hugemem and hugemem_longrun partitions.
See the technical details of the nodes on this page.
The use of these nodes is similar to other Roihu-CPU nodes, but note that
these nodes have a different processor.
In particular, these nodes have 128 CPU cores per node in total.
This means that if running on a full node in hugemem partitions,
the product of --ntasks-per-node and --cpus-per-task should be 128
for best performance:
#SBATCH --partition=hugemem
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128 --cpus-per-task=1 # The product should be 128
#SBATCH --ntasks-per-node=64 --cpus-per-task=2 # The product should be 128
#SBATCH --ntasks-per-node=32 --cpus-per-task=4 # The product should be 128
GPU jobs
Each Roihu GPU node has four Nvidia GH200 superchips. The GPUs are available in the gpu* partitions.
See the technical details of the nodes on this page.
The resource allocation is based on full GH200 GPUs in gputest, gpumedium, and gpularge partitions
and the GPUs can be requested with:
Note that the --gres reservation is on a per-node basis. There are 4 GPUs per GPU node.
About gpuinteractive partition
The MIGs are not configured yet.
In gpuinteractive partition, the GH200 GPUs are sliced into smaller Multi-Instance GPUs (MIG).
Each MIG here has one XXXth of the compute and memory capacity of a full GH200 GPU.
For each GPU slice you can reserve at most XXX CPU cores and for each GPU slice the job is allocated XXX GiB of CPU memory.
Also note that you can reserve at most one GPU slice per job. The GPU slices are available using the options:
GPU visualization jobs
Roihu has visualization nodes with Nvidia L40 GPUs. These nodes are available in the vizinteractive partition.
See the technical details of the nodes on this page.
These nodes can be requested with:
Note that the --gres reservation is on a per-node basis. There are 2 GPUs per GPU node.
Additional resources in batch jobs
Local temporary storage
All nodes on Roihu have local storage space (NVMe) available for jobs. Using local storage is recommended for I/O-intensive applications, i.e. jobs that, for example, read and write a lot of small files. See more details.
Local temporary storage is available for every job without extra billing. Quota is set per user, so available space on a node is independent of job count or reserved resources:
- Roihu-CPU shared nodes (
small,interactive, andtestpartitions) have 20 GiB quota - Roihu-CPU full nodes (
mediumandlargepartitions) have 600 GiB quota - Roihu-GPU nodes have 150 GiB quota
Use the environment variable $TMPDIR in your batch job scripts to
access the local temporary storage space on each node. For example, to extract a large
dataset package to the local storage:
Remember to recover your data
The local storage space reserved for your job is emptied after the job has finished. Thus, if you write data to the local disk during your job, please remember to move anything you want to preserve to the shared disk area at the end of your job. Particularly, the commands to move the data must be given in the batch job script as you cannot access the local storage space anymore after the batch job has completed. For example, to copy some output data back to the directory from where the batch job was submitted:
Fast local scratch storage
As a new feature on Roihu, it is possible to request local disk mounts from a centralized pool of fast storage resources. This fast storage capacity is provided over the network and appears as local scratch from within a Slurm job.
You must request these resources in conjunction with --exclusive
At the present you can only request this storage for jobs that are making use of full nodes,
i.e. that are submitted with the --exclusive flag. Presently if you do not specify this flag
your job will fail, but will be marked "CANCELLED by 350" and you will lack any stdout or stderr
logs. This should be resolved once support for shared node jobs arrives in Q3 2026.
Request this local storage using the following flag in the batch script:
#SBATCH --exclusive
#SBATCH --bb="#BB_LUA SBF storagesize=<local_storage_space> path=/run/sbb/<username>"
For example, requesting 100 GiB storage
(remember to update <username> to your username in the sbatch header):
Then, this storage is available in path /run/sbb/$USER during the job script.
Undersubscribing full nodes on Roihu-CPU
If an application requires more memory per core than there is available with full node (2 GB / core) it is possible to use also a subset of cores within a node. Also, if the application is memory bound, memory bandwidth and the application performance can be improved by using only a single core per NUMA domain or L3 cache (look at Roihu technical description for details. Note that billing is, however, always based on full nodes.
When undersubscribing nodes, one should always set
--ntasks-per-node=X and --cpus-per-task=Y so that X * Y = 384,
even with pure MPI jobs. By default, Slurm scatters MPI tasks
--cpus-per-task apart, i.e. with --cpus-per-task=16 the MPI task
0 is bound to CPU core 0, the MPI task 1 is bound to CPU
core 15 etc.. Memory bandwidth (and application performance) is
the best when the tasks are executing on maximally scattered cores. As
an example, in order to use 32 GB / core, one can run only with 24
tasks per node as
...
#SBATCH --ntasks-per-node=24 --cpus-per-task=16 # The product should be 384
module load myprog/1.2.3
export OMP_NUM_THREADS=1
srun myprog -i input -o output
For hybrid applications, one should use
OMP_PROC_BIND OpenMP runtime environment variable for
placing the OpenMP threads. As an example, in order to run
one MPI task per NUMA domain and one OpenMP thread per L3 cache one
can set
...
#SBATCH --ntasks-per-node=8 --cpus-per-task=48 # The product should be 384
export OMP_NUM_THREADS=3
export OMP_PROC_BIND=spread
module load myprog/1.2.3
srun myprog -i input -o output
Using the small partition for non-parallel pre- or post-processing
In many cases, large computing tasks include pre- or post-processing steps that are not able to utilize parallel computing. In these cases it is recommended that, if possible, the task is split into several, chained, batch jobs and that the non-parallel processing is executed in the small partition of Roihu. In the small partition the jobs can reserve just few cores so that the non-parallel tasks can be executed without wasting resources.
For example, say that we would like to post-process an output file of a previous job. The post processing command:
python post-proc.py output uses only serial computing and requires about 40 minutes and 3 GB of memory. Instead of including the post-processing
to the main job it is reasonable to execute it as separate job in the small partition as in the example below.
Further, by defining --dependency=afterok:<slurm-jobid>, the job is allowed to start only when the previously sent job has successfully finished.
Here the <slurm-jobid> is replaced with ID number of the batch job that produces the output file (you'll get the ID number when you submit the job).
#!/bin/bash
#SBATCH --job-name=post-process-my-test
#SBATCH --account=<project>
#SBATCH --time=00:50:00
#SBATCH --partition=small
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --dependency=afterok:<slurm-jobid>
python post-proc.py output
Executing large amounts of small non-MPI jobs
In Roihu, HyperQueue meta-scheduler can be used to process large amounts of small non-MPI jobs.