Running Julia batch jobs on CSC clusters
This tutorial contains examples for running various Julia batch jobs on Puhti, Mahti and LUMI clusters.
Examples
These examples demonstrate the usage of the Julia environment for various batch jobs.
They are adapted from the general instructions of running jobs on Puhti and Mahti and on LUMI.
Note that we do not use srun
to start processes in the batch script.
Instead we use Julia for process management or call srun
inside the Julia code.
Serial program
We use the following directory structure and assume it is our working directory.
.
├── Project.toml # Julia environment
├── batch.sh # Slurm batch script
└── script.jl # Julia script
An example of a script.jl
code.
An example of a batch.sh
batch script.
An example of a batch.sh
batch script.
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=small
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1000
module use /appl/local/csc/modulefiles
module load julia
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
Multi-threading on single node
We use the following directory structure and assume it is our working directory.
.
├── Project.toml # Julia environment
├── batch.sh # Slurm batch script
└── script.jl # Julia script
An example of a script.jl
code.
# Number of threads
n = Threads.nthreads()
println(n)
# Lets fill the id of each thread to the ids array.
ids = zeros(Int, n)
Threads.@threads for i in eachindex(ids)
ids[i] = Threads.threadid()
end
println(ids)
# Alternatively, we can use the @spawn macro to run task on threads.
ids = zeros(Int, n)
@sync for i in eachindex(ids)
Threads.@spawn ids[i] = Threads.threadid()
end
println(ids)
An example of a batch.sh
batch script.
An example of a batch.sh
batch script.
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=small
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=1000
module use /appl/local/csc/modulefiles
module load julia
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
Multi-processing on single node
We use the following directory structure and assume it is our working directory.
.
├── Project.toml # Julia environment
├── batch.sh # Slurm batch script
└── script.jl # Julia script
An example of a Project.toml
project file.
An example of a script.jl
code.
using Distributed
# We set one worker process per core.
proc_num = Sys.CPU_THREADS
# Environment variables that we pass to the worker processes.
# We set the thread count to one since each process uses one core.
proc_env = [
"JULIA_NUM_THREADS"=>"1",
"JULIA_CPU_THREADS"=>"1",
"OPENBLAS_NUM_THREADS"=>"1",
]
# We add worker processes to the local node using LocalManager.
addprocs(proc_num; env=proc_env, exeflags="--project=.")
# We use the `@everywhere` macro to include the task function in the worker processes.
# We must call `@everywhere` after adding worker processes; otherwise the code won't be included in the new processes.
@everywhere function task()
return (worker=myid(), hostname=gethostname(), pid=getpid())
end
# We run the task function in each worker process.
futures = [@spawnat worker task() for worker in workers()]
# Then, we fetch the output from the processes.
outputs = fetch.(futures)
# Remove processes after we are done.
rmprocs.(workers())
# Print the outputs of master and worker processes.
println(task())
println.(outputs)
An example of a batch.sh
batch script.
An example of a batch.sh
batch script.
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=small
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=1000
module use /appl/local/csc/modulefiles
module load julia
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
Multi-processing on multiple nodes
We use the following directory structure and assume it is our working directory.
.
├── Project.toml # Julia environment
├── batch.sh # Slurm batch script
└── script.jl # Julia script
An example of a Project.toml
project file.
[deps]
ClusterManagers = "34f1f09b-3a8b-5176-ab39-66d58a4d544e"
Distributed = "8ba89e20-285c-5b6f-9357-94700520ee1b"
An example of a script.jl
code.
using Distributed
using ClusterManagers
# We set one worker process per core.
proc_num = parse(Int, ENV["SLURM_NTASKS"])
# Environment variables that we pass to the worker processes.
# We set the thread count to one since each process uses one core.
n = Threads.nthreads()
proc_env = [
"JULIA_NUM_THREADS"=>"$n",
"JULIA_CPU_THREADS"=>"$n",
"OPENBLAS_NUM_THREADS"=>"$n",
]
# We add worker processes to the local node using SlurmManager
addprocs(SlurmManager(proc_num); env=proc_env, exeflags="--project=.")
# We use the `@everywhere` macro to include the task function in the worker processes.
# We must call `@everywhere` after adding worker processes; otherwise the code won't be included in the new processes.
@everywhere function task()
return (worker=myid(), hostname=gethostname(), pid=getpid())
end
# We run the task function in each worker process.
futures = [@spawnat worker task() for worker in workers()]
# Then, we fetch the output from the processes.
outputs = fetch.(futures)
# Remove processes after we are done.
rmprocs.(workers())
# Print the outputs of master and worker processes.
println(task())
println.(outputs)
An example of a batch.sh
batch script.
An example of a batch.sh
batch script.
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=standard
#SBATCH --time=00:15:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=0
module use /appl/local/csc/modulefiles
module load julia
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
MPI program
We launch the MPI program using Julia's mpiexec
wrapper function.
The wrapper function substitutes the correct command from local preferences to the mpirun
variable to run the MPI program.
The command is srun
in Puhti, Mahti, and LUMI.
The wrapper allows us to write more flexible code, such as mixing MPI and non-MPI code, and more portable code because the command to run MPI programs can vary across platforms.
We note that for large-scale Julia MPI jobs with thousands of ranks, we have to distribute the depot directory to local node storage or memory and modify the depot paths accordingly.
Otherwise, package loading will become extremely slow.
We use the following directory structure and assume it is our working directory.
.
├── Project.toml # Julia environment
├── batch.sh # Slurm batch script
├── prog.jl # Julia MPI program
└── script.jl # Julia script
An example of a Project.toml
project file.
An example of a script.jl
code.
An example of a prog.jl
Julia MPI code.
using MPI
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
println("Hello from rank $(rank) out of $(size) from host $(gethostname()) and process $(getpid()).")
MPI.Barrier(comm)
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=large
#SBATCH --time=00:15:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1000
module load julia
module load julia-mpi
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=medium
#SBATCH --time=00:15:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=0
module load julia
module load julia-mpi
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=standard
#SBATCH --time=00:15:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=0
module use /appl/local/csc/modulefiles
module load julia
module load julia-mpi
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
Single GPU
We use the following directory structure and assume it is our working directory.
.
├── Project.toml # Julia environment
├── batch.sh # Slurm batch script
└── script.jl # Julia script
An example of a Project.toml
project file.
An example of a script.jl
code.
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=gpu
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:v100:1
#SBATCH --mem-per-cpu=8000
module load julia
module load julia-cuda
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
An example of a Project.toml
project file.
An example of a script.jl
code.
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=gpusmall
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:a100:1
#
module load julia
module load julia-cuda
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
An example of a Project.toml
project file.
An example of a script.jl
code.
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=small-g
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=1750
module use /appl/local/csc/modulefiles
module load julia
module load julia-amdgpu
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
Multi GPU with MPI
We use the following directory structure and assume it is our working directory.
.
├── Project.toml # Julia environment
├── batch.sh # Slurm batch script
├── prog.jl # Julia GPU-aware MPI program
└── script.jl # Julia script
An example of a script.jl
code.
An example of a Project.toml
project file.
An example of a prog.jl
code. (source)
using MPI
using AMDGPU
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
# select device
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
rank_l = MPI.Comm_rank(comm_l)
device = AMDGPU.device_id!(rank_l+1)
gpu_id = AMDGPU.device_id(AMDGPU.device())
# select device
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id - $device), size=$size, dst=$dst, src=$src")
N = 4
send_mesg = ROCArray{Float64}(undef, N)
recv_mesg = ROCArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
AMDGPU.synchronize()
rank==0 && println("start sending...")
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")
rank==0 && println("done.")
An example of a batch.sh
batch script.
#!/bin/bash
#SBATCH --account=<project>
#SBATCH --partition=small-g
#SBATCH --time=00:15:00
#SBATCH --nodes=2
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=0
module use /appl/local/csc/modulefiles
module load julia
module load julia-mpi
module load julia-amdgpu
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. script.jl
Notes
Multi-threading in linear algebra
Julia uses OpenBLAS as the default LinearAlgebra
backend.
External linear algebra backends such as OpenBLAS use internal threading.
We can set their thread counts using environment variables.
The julia
module sets them to the number of CPU threads.
We must be careful not to oversubscribe cores when using BLAS operations within Julia threads or processes.
We can change the amount of BLAS threads at runtime using the BLAS.set_num_threads
function.
using LinearAlgebra
# Number of threads
n = Threads.nthreads()
# Define a matrix
X = rand(1000, 1000)
# Set the number of threads to one before performing BLAS operations of multiple Julia threads.
BLAS.set_num_threads(1)
Y = zeros(n)
Threads.@threads for i in 1:n # uses n Julia threads
Y[i] = sum(X * X) # uses one BLAS thread
end
# Set the number of threads back to the default when performing BLAS operation on a single Julia Thread.
BLAS.set_num_threads(n)
Z = zeros(n)
for i in 1:n # uses one Julia thread
Z[i] = sum(X * X) # uses n BLAS threads
end
Alternatively, we can use the MKL backend via MKL.jl as a linear algebra backend. MKL is often faster than OpenBLAS when using multiple threads on Intel CPUs, such as those on Puhti. We can set the MKL thread count as follows.
If we use MKL, we should load it before other linear algebra libraries.
There are caveats for using different numbers than one or all cores of BLAS threads on OpenBLAS and MKL.