Running Snakemake workflow on Puhti
Snakemake workflow is one of the popular scientific workflows in the bioinformatics community, although the workflow manager itself can be used in other scientific disciplines as well. Snakemake enables scalable and reproducible scientific pipelines by chaining a series of rules in a fully-specified software environment.
If you are still wondering about workflows at more general level or which workflow tool to use, see also High-throughput computing and workflows page.
Installation
Snakemake is available as a module in Puhti supercomputer. This options suits well, if the workflow includes commandline-tools from other modules or Apptainer containers. If the workflow includes Python scripts that require custom Python packages, make own Snakemake installation with Tykky.
Snakemake module
Snakemake module is the easiest option. The available version are listed on the Snakemake app page.
Note
Please pay attention to the version of Snakemake you are using. If you are using earlier versions of Snakemake (e.g., v7.xx.x) the syntax might be different.
Installation of tools used in the the workflow
The tools used in the workflow can be installed in following ways:
- Tools available in other Puhti modules or own custom module.
- If all Snakemake rules use the same module(s), load it before running snakemake commands.
- If different Snakemake rules use different modules, include the module information in the Snakefile.
- Own custom installations as Apptainer containers:
- Apptainer container can be downloaded from some repository or built locally. For building custom Apptainer containers, see Creating containers page.
- See Snakemake's Running jobs in containers for changes required in Snakemake file and command.
- For binding folders or using other Apptainer flags, use --apptainer-args option of
snakemake
command. - Sometimes it might be necessary to define the shell inside the container.
# If your Apptainer tutorial.sif image is stored locally in Puhti in folder "image":
container: "image/tutorial.sif"
# If you would like to covert a Docker iamge to Apptainer container image on-the-fly:
container: "docker://<repository>/<image_name>"
Snakemake Tykky installation for Python
To install Snakemake with custom Python packages, use Tykky container wrapper tool with conda. Follow the guidelines on Tykky page, the conda environment should include package snakemake
. If you plan to use Snakemake with SLURM or HyperQueue integration (explained below), install also snakemake-executor-plugin-slurm
for SLURM or snakemake-executor-plugin-cluster-generic
for HyperQueue. These packages are part of bioconda
repository, so add it to the channels list in the conda environment file.
For SLURM integration, you have to also fix the Python path of Snakemake executable:
- Find out your Tykky installation's Python path. You can check it with
which python
command after you have given theexport PATH ...
from Tykky printout. - Create a file
post.sh
. Change/projappl/project_200xxx/tykky_installation_folder/bin/python
to your own Tykky installation's Python path.
sed -i 's@#!.*@#!/projappl/project_200xxx/tykky_installation_folder/bin/python@g' $env_root/bin/snakemake
- Update the installation:
If you use own Tykky installation, then in the examples below, replace module load snakemake
with the export commant printed out by Tykky, something like: export PATH="/projappl/project_xxxx/$USER/snakemake_tykky/bin:$PATH"
Note
Please note, create one Tykky installation for the whole workflow, not individual installations for each Snakemake rule.
Usage
Snakemake can be run in 4 different ways in supercomputers:
- In interactive mode with local executor, with limited resources. Useful mainly for debugging or very small workflows.
- With batch job and local executor. Resource usage limited to one full node. Useful for small and medium size workflows, simpler than next options, start with this, if unsure.
- With batch job and SLURM executor. Can use multiple nodes and different SLURM partitions (CPU and GPU), but may create significant overhead, if many small jobs. Could be used, if each job step for each file takes at least 30 min.
- With batch job and HyperQueue as a sub-job scheduler. Can use multiple nodes in the same batch job allocation, most complex set up. Suits well for cases, when workflow includes a lot of small job steps with many input files (high-troughput computing).
Note
Please do not launch heavy Snakemake workflows on **login nodes**.
The following toy example illustrates how a Snakemake workflow can be deployed at CSC.
Snakefile
Snakefile describes the contents of the workflow. Further information is available from Snakemake Snakefile documentation
Let's use the following toy Snakemake script, Snakefile
(with a capital S and no file extension), for the illustration:
rule all:
input: "CAPITAL_CASE.txt"
rule say_hello:
output: "smaller_case.txt"
shell:
"""
echo "hello-world greetings from csc to snakemake community !" > smaller_case.txt
"""
rule capitalise:
input: "smaller_case.txt"
output: "CAPITAL_CASE.txt"
shell:
"""
tr '[:lower:]' '[:upper:]' < {input} > {output}
"""
Running Snakemake workflow with local executor interactively
The resources are reserved in advance, both for Snakemake and the workflow jobs as one interactive session. In interactive session, the workflow can be started for several times for debugging as long as the reserved resources are available. See resource limits for interactive partition.
sinteractive --cores 4 --mem 10000 # start an interactive session with 2 CPU cores and 10 Gb of memory
module load snakemake
cd <to_folder_with_snakefile>
snakemake -s Snakefile --jobs 4
--jobs
- maximum number of jobs run in parallel
Running Snakemake workflow with local executor and batch job
The resources are reserved in advance, both for Snakemake and the workflow as one batch job. The job will run as long as the snakemake command is running and stop automatically when it finishes. Local executor is limited to one node of supercomputer. The number of cores can be extended depending on the system - 40 in Puhti and 128 in Mahti.
#!/bin/bash
#SBATCH --job-name=myTest
#SBATCH --account=project_xxxxx
#SBATCH --time=00:10:00
#SBATCH --mem-per-cpu=2G
#SBATCH --partition=small
#SBATCH --cpus-per-task=4
module load snakemake
snakemake -s Snakefile --jobs 4
Running Snakemake workflow with SLURM executor
The first batch job file reserves resources only for Snakemake itself. Snakemake then creates further SLURM jobs for workflow's rules. The SLURM jobs created by Snakemake may be distributed to several nodes of a supercomputer and also to use different partitions for different workflow rules, for example CPU and GPU. SLURM executor should be used only, if the job steps are at least 20-30 minutes long, otherwise the it could overload SLURM.
Here is a bash script for running the above toy example with SLURM executor:
#!/bin/bash
#SBATCH --job-name=snakemake_slurm
#SBATCH --account=project_2008498
#SBATCH --time=00:20:00
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2GB
#SBATCH --partition=small
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
module load snakemake
snakemake --jobs 4 -s Snakefile --executor slurm --default-resources slurm_account=project_xxxx slurm_partition=small
Note
Make sure that the Snakemake own reservation is long enough to include also waiting time for other processes to get processed, including queueing time. Rather use too long time for the Snakemake own batch job.
Default resources for each SLURM job are rather limited, to increase (or change) define the resource needs for each rule in the Snakefile:
rule say_hello:
output: "smaller_case.txt"
resources:
runtime = 5, # minutes
cpus_per_task = 1,
mem_mb = 20000
shell:
"""
echo "hello-world greetings from csc to snakemake community !" > smaller_case.txt
"""
Finally, you can submit the batch job from the login node:
Further information about Snakemake SLURM executor
Note
Scaling up your jobs using Slurm should be done carefully to avoid unnecessarily overloading the Slurm accounting database with a large number of small jobs. Consider either using grouping, localrules or Hyperqueue executor.
Running Snakemake with HyperQueue executor
The resources are reserved in advance, both for Snakemake and the workflow as one batch job. It is possible to use several nodes on a supercomputer, but not to use different partitions for different workflow rules, for example CPU and GPU. HyperQueue executor fits well to workflows, which have a lot of short job steps, because it "hides" them from SLURM. Job step resources can be defined in the Snakefile as in SLURM job.
#!/bin/bash
#SBATCH --job-name=snakemake_hq
#SBATCH --account=project_2008498
#SBATCH --time=00:20:00
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=40GB
#SBATCH --partition=small
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
module load hyperqueue
export HQ_SERVER_DIR="$PWD/hq-server/$SLURM_JOB_ID"
mkdir -p "$HQ_SERVER_DIR"
hq server start & until hq job list &> /dev/null ; do sleep 1 ; done
srun --overlap --cpu-bind=none --mpi=none hq worker start \
--manager slurm \
--on-server-lost finish-running \
--cpus="$SLURM_CPUS_PER_TASK" & hq worker wait 1
# snakemake version 8.x.x.x
snakemake --keep-going -s Snakefile --jobs 4 --executor cluster-generic --cluster-generic-submit-cmd "hq submit --cpus 1"
# snakemake version 7.xx.x
# snakemake --cluster "hq submit ..."
See CSC HyperQueue page for more options and details about HyperQueue.
Note
HyperQueue creates task-specific folders (job-<n>
) in the same directory
from where you submitted the batch script. These are sometimes useful for
debugging. However, if your code is working fine, the creation of many folders
may be annoying besides causing some load on the Lustre parallel file system.
You can prevent the creation of such task-specific folders by setting stdout
and stderr
HyperQueue flags to none
( i.e., hq submit --stdout=none --stderr=none ...
)
If you have any questions or problems regarding Snakemake, contact CSC servicedesk.