Creating a batch job script for Mahti

Please have a look at the Puhti documentation for the general introduction to batch scripts in the CSC supercomputing environment. On this page we focus on Mahti-specific topics.

Note

Full nodes are allocated for jobs, with the exception of jobs in small, interactive and GPU partitions, see also below. Many options also work differently on Mahti compared to Puhti, so it is not advisable to copy scripts from Puhti to Mahti without appropriate modifications.

Creating a batch job script for Mahti

Basic MPI batch jobs

An example of a simple MPI batch job script:

#!/bin/bash
#SBATCH --job-name=myTest
#SBATCH --account=<project>
#SBATCH --time=02:00:00
#SBATCH --partition=medium
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128

module load myprog/1.2.3

srun myprog -i input -o output

Specify the exact number of nodes and number of tasks per node with --nodes and --ntasks-per-node, respectively. Use all 128 cores in the node.

Note

MPI processes should not be started with mpirun or mpiexec. Use srun instead.
appropriate software module has to be loaded in the batch job script for the submission to work properly.

Hybrid batch jobs

As explained for Puhti, hybrid parallelization can run multiple OpenMP threads per MPI task. In addition to the --ntasks-per-node=X one needs to set --cpus-per-task=Y. The default is one cpu (thread) per task. To use all physical cores in a Mahti node choose X * Y = 128, like in this example. If you are using simultaneous multithreading (see section below) your should use X * Y = 256

The optimal ratio between the number of tasks and cores per tasks varies for each program and job input. Testing is required to find the right combination for your application. You can find some examples for CP2K and NAMD.

Hybrid batch jobs with simultaneous multithreading (SMT)

Mahti is configured so that it doesn't place any theads to the logical cores by default. SMT support can be enabled with --hint=multithread option. When this option is used, it is important to use the --ntasks-per-node=X and --cpus-per-task=Y so that X * Y = 256. Failing to do so will leave some of the actual physical cores unallocated and performance will be suboptimal. Example batch job script for SMT.

Local storage

Mahti nodes in interactive, small and GPU partitions also have fast local storage, which is good for I/O-intensive applications. Request local storage using the --gres flag in the job submission:

#SBATCH --gres=nvme:<local_storage_space_per_node>

The amount of space is given in GB (with a maximum of 3800 GB per node). For example, to request 100 GB of storage, use option --gres=nvme:100. The local storage reservation is on a per node basis. Use the environment variable $LOCAL_SCRATCH in your batch job scripts to access the local storage on each node.

GPU batch jobs

Mahti has 24 GPU nodes and each of them has four Nvidia Ampere A100 GPUs and a local 3.8 TB Nvme drive. The GPUs are available on the gputest, gpusmall and gpumedium partitions using the option:

#SBATCH --gres=gpu:a100:<number_of_gpus_per_node>

Mahti's gpusmall partition supports only one or two GPU jobs. So the maximum is --gres=gpu:a100:2

#SBATCH --partition=gpusmall
#SBATCH --gres=gpu:a100:1

In Mahti's gpusmall partition there are also A100 GPUs that have been sliced into smaller a100_1g.5gb GPUs with one seventh of the compute and memory capacity of a full A100 GPU. For each GPU slice you can reserve at most 4 CPU cores and for each GPU slice the job is allocated 17.5 GiB of memory. Also note that you can reserve at most one GPU slice per job. The GPU slices are available on gpusmall using the options:

#SBATCH --partition=gpusmall
#SBATCH --gres=gpu:a100_1g.5gb:1

Mahti's gpumedium partition supports multi-GPU jobs with four GPUs per compute node. The example below will allocate four GPUs per compute node, so eight GPUs altogether:

#SBATCH --nodes=2
#SBATCH --partition=gpumedium
#SBATCH --gres=gpu:a100:4

The gpumedium is the only gpu partition where more than one compute node is available (maximum number for the --nodes flag is six).

The gputest partition is for short test runs. Maximum for the --time flag is 15 minutes and one job per account can be run in a RUNNING state. Maximum for the --nodes flag is one but all four GPUs on a node can be allocated for a test job.

Multiple resources can be requested with a comma-separated list. Request both GPU and local storage:

#SBATCH --gres=gpu:a100:<number_of_gpus_per_node>,nvme:<local_storage_space_per_node>

Many GPU applications also support cpu multithreading but not all. If cpu threading is supported cpu cores for the application threading operations can be enabled using --cpus-per-task flag. The example below will use one GPU and 32 cores are available for cpu threading (32 is 1/4 of the CPU cores of a single node) also 950 GB local fast disk storage (1/4 of the total amount of local disk on a node). Ampere A100 GPU also has its own 40GB memory (and that memory will not need any reservation flag). Default amount of main memory allocated per GPU is 122.5GB

#SBATCH --partition=gpusmall
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:a100:1,nvme:950

# If multithreading is OpenMP implementation then define also OMP_NUM_THREADS environment variable
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

So above example will use 1/4 of all resources on a GPU node and therefore four similar batch jobs could run on a GPU node.

Undersubscribing nodes

If an application requires more memory per core than there is available with full node (2 GB / core) it is possible to use also a subset of cores within a node. Also, if the application is memory bound, memory bandwidth and the application performance can be improved by using only a single core per NUMA domain or L3 cache (look at Mahti technical description for details. Note that billing is, however, always based on full nodes.

When undersubscribing nodes, one should always set --ntasks-per-node=X and --cpus-per-task=Y so that X * Y = 128, even with pure MPI jobs. By default, Slurm scatters MPI tasks --cpus-per-task apart, i.e. with --cpus-per-task=8 the MPI task 0 is bound to CPU core 0, the MPI task 1 is bound to CPU core 7 etc.. Memory bandwidth (and application performance) is the best when the tasks are executing on maximally scattered cores. As an example, in order to use 32 GB / core, one can run only with 8 tasks per node as

#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16

module load myprog/1.2.3
export OMP_NUM_THREADS=1

srun myprog -i input -o output

For hybrid applications, one should use OMP_PROC_BIND OpenMP runtime environment variable for placing the OpenMP threads. As an example, in order to run one MPI task per NUMA domain and one OpenMP thread per L3cache one can set

#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16

export OMP_NUM_THREADS=4
export OMP_PROC_BIND=spread

module load myprog/1.2.3

srun myprog -i input -o output

Please check also our Mahti batch script examples page.

Using interactive partition for non-parallel pre- or post-processing

In many cases the computing tasks include pre- or post-processing steps that are not able to utilize parallel computing. In these cases it is recommended that, if possible, the task is split into several, chained, batch jobs and that the non-parallel processing is executed in the interactive partition of Mahti.

In the interactive partition the jobs can reserve just few cores so that the non-parallel tasks can be executed without wasting resources.
Note that you can use interactive partition also for non-interactive jobs and that you can link two batch jobs so that the second job starts only when the first one has finished.

For example, say that we would like to post-process the output file, produced with the very first MPI example job in this page. The post processing command: python post-proc.py output uses only serial computing and requires about 40 minutes and 3 GB of memory. Instead of including the post-processing to the main job it is reasonable to execute it as separate job in the interactive partition.

Jobs in interactive partition can reserve 1-8 cores and each core reserves 1.875 GB of memory. Thus in this case we will reserve 2 cores --cpus-per-task=2 to have enough memory (3.75 GB) available. Further, --dependency=afterok:<slurm-jobid> defines that the job can start only when the previously sent job has successfully finished. Here the <slurm-jobid> is replaced with ID number of the batch job that produces the output file (you'll get the ID number when you submit the job).

#!/bin/bash
#SBATCH --job-name=post-process-myTest
#SBATCH --account=<project>
#SBATCH --time=00:50:00
#SBATCH --partition=interactive
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --dependency=afterok:<slurm-jobid>

python post-proc.py output

Executing large amounts of small non-MPI jobs

In Mahti, HyperQueue meta-scheduler can be used to process large amounts of small non-MPI jobs.