FireWorks workflow tool
FireWorks is a free, open-source tool for defining, managing and executing workflows with multiple steps and potentially complex dependencies. Workflows are flexibly defined using YAML, JSON or through a Python API and stored in a MongoDB database. This page describes how to define and execute FireWorks workflows in CSC's computing environment using a MongoDB running in the Rahti container cloud.
Strengths of FireWorks
- Easy installation
- Can handle parallel (MPI/OpenMP) subtasks
- Supports complicated workflows with several dependent steps
Disadvantages of FireWorks
- Requires setting up a MongoDB database
- Steep learning curve
- May produce a lot of log files
- May create a lot of job steps
- Integrates with Slurm, but all subtasks must use identical resources
Installing FireWorks and setting up MongoDB in Rahti
FireWorks is easy to install. We recommend using Tykky to install FireWorks within a Singularity container. A plain pip installation with pip-containerize
is enough, just add the line fireworks
to the req.txt
file containing the requirements of your environment. For further instructions, see the Tykky documentation.
Note that the Python version used by pip-containerize
is the first Python executable found in the path, so it's affected by loading modules. FireWorks requires at least Python 3.7, so make sure you're using at least this version. To this end, you can use the --slim
flag of pip-containerize
to utilize a pre-built minimal Python container with a much newer version of Python than the system default 3.6.8.
The process of setting up and connecting to a MongoDB database in Rahti is detailed in a separate tutorial, see Accessing databases on Rahti from CSC supercomputers. Note that the OpenShift template in Rahti sets up MongoDB version 3.2, requiring that the PyMongo version used with FireWorks cannot be newer than 3.12. Thus, you may need to separately specify the PyMongo version in the req.txt
file when installing FireWorks. For example,
Note
Please do not install FireWorks in a Conda environment that is sitting directly on the shared Lustre file system. CSC has deprecated the direct usage of Conda installations on our supercomputers to avoid performance issues due to the large number of files brought by Conda. For reference, a Conda installation of FireWorks contains more than 24000 files, most of which are read each time the application is run. This causes startup delays and degrades the performance of Lustre for all users. With this said, you can still continue to use Conda environments, but only in case they are containerized. To achieve this easily, please see the Tykky container wrapper tool.
Defining and executing workflows with FireWorks
The basic components of FireWorks are the
- LaunchPad (manages workflows and metadata)
- FireTask (computing job to be performed)
- Firework (list of multiple FireTasks)
- Workflow (set of Fireworks including their dependencies and metadata)
A FireWorker (e.g. your laptop or in this case either of CSC's supercomputers) fetches a workflow from the LaunchPad and executes it. To appropriately run FireWorks in CSC's computing environment, you need to additionally configure a QueueAdapter for running jobs through the queueing system. The content of the files used to configure these is described below.
Step 1. Setting up the LaunchPad
Note
This page focuses on the usage of YAML files and the FireWorks command-line interface to define and execute workflows. For instructions on using the FireWorks Python API, see the official FireWorks documentation.
Before configuring the LaunchPad, make sure that you have opened a connection to your MongoDB database in Rahti using WebSocat as outlined in Accessing databases on Rahti from CSC supercomputers. Note that websocat
should be launched in an interactive session to avoid stressing the login nodes. With the obtained target port, database username and password, run lpad init
to interactively configure the LaunchPad:
$ lpad init
Please supply the following configuration values
(press Enter if you want to accept the defaults)
Enter host parameter. (default: localhost). Example: 'localhost' or 'mongodb+srv://CLUSTERNAME.mongodb.net': localhost
Enter port parameter. (default: 27017). : <target port>
Enter name parameter. (default: fireworks). Database under which to store the fireworks collections: <database name>
Enter username parameter. (default: None). Username for MongoDB authentication: <username>
Enter password parameter. (default: None). Password for MongoDB authentication: <password>
Enter ssl_ca_file parameter. (default: None). Path to any client certificate to be used for Mongodb connection: None
Enter authsource parameter. (default: None). Database used for authentication, if not connection db. e.g., for MongoDB Atlas this is sometimes 'admin'.: None
Configuration written to my_launchpad.yaml!
Note
Upon configuration, WebSocat may complain websocat: Connection reset by peer (os error 104)
. This warning is due to minor timing issues based on the Python global interpreter lock (GIL) and can be safely ignored.
Step 2. Setting up the QueueAdapter for submission through SLURM
To run FireWorks through the batch queue system a file my_qadapter.yaml
is required where the queue parameters and any commands to be run prior to or after the workflow (e.g. module loads, export environment variables) are written. An example my_qadapater.yaml
file compatible with Puhti is provided below (edit paths and content marked with <>
as needed).
_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch multi 1
nodes: 1
cpus_per_task: 1
ntasks_per_node: 40
mem_per_cpu: 1000
walltime: '00:05:00'
queue: small
account: <billing project>
job_name: example
pre_rocket: |
module load <my module>
export PATH=$PATH:/path/to/websocat
websocat -b tcp-l:127.0.0.1:<port> wss://websocat-<database name>.rahtiapp.fi -E &
post_rocket: null
In addition to queue parameters (resource requests, billing project), the QueueAdapter contains the rocket_launch
key which specifies how the workflow should be launched within the batch job. This detail is discussed further in Step 3. Additionally, the batch queue system (SLURM) is specified with the _fw_q_type
key, and any commands to be run before and/or after the workflow are provided using the pre_rocket
and post_rocket
keys.
Note
To open a TCP tunnel to your MongoDB in Rahti from the compute side, websocat
should in addition to the interactive session also be launched in the pre_rocket
. Here, the previously obtained target port can be used. See Accessing databases on Rahti from CSC supercomputers for further details.
For all possible SLURM flags that can be specified in the QueueAdapter, see the SLURM template file distributed with FireWorks. Note the usage of underscores instead of dashes compared to the common SLURM options, e.g. cpus_per_task
vs. --cpus-per-task
, as well as the keys walltime
and queue
in contrast to time
and partition
used by SLURM. If the existing SLURM template does not suit your needs, please consult the official FireWorks documentation on how to program custom QueueAdapters.
Step 3. Defining and executing a simple FireWorks workflow
Note
If not using the default names my_launchpad.yaml
and my_qadapter.yaml
for the LaunchPad and QueueAdapter files, you need to specify the filenames using the -l
and -q
flags of the qlaunch
and rlaunch
commands (only -l
for rlaunch
). If the files are neither in the current working directory, the full paths should be given or a configuration directory specified with the -c
option. A good idea is also to leverage a FW_config.yaml
configuration file in which several default parameters can be set. The official FireWorks documentation gives further instructions on how to use the FW config file.
A FireTask describes a subtask to be performed and combining multiple FireTasks yields Fireworks and Workflows. Similar to the LaunchPad and QueueAdapter, these can be specified using YAML files. A simple hello_wf.yaml
example is shown below.
fws:
- fw_id: 1
spec:
_tasks:
- _fw_name: ScriptTask
script: srun /path/to/hello_mpi.x >> first-hello.out
- _fw_name: FileTransferTask
files:
- src: first-hello.out
dest: $HOME/first-hello.out
mode: move
- fw_id: 2
spec:
_tasks:
- _fw_name: ScriptTask
script: srun /path/to/hello_mpi.x >> second-hello.out
- _fw_name: FileTransferTask
files:
- src: second-hello.out
dest: $HOME/second-hello.out
mode: move
links:
1:
- 2
metadata: {}
This toy workflow demonstrates the usage of the built-in ScriptTask
FireTask to execute an MPI parallelized hello_mpi.x
program. Upon completion, the output is moved to the user's home directory using a FileTransferTask
. To illustrate dependencies, the workflow is composed of two identical Fireworks of which the second is launched only after the first one has completed. This connection is enforced using the links
section. See the official documentation for a more in depth description on how to design FireWorks workflows.
The steps to execute this example workflow through the batch queue system consist of resetting the LaunchPad, adding the workflow YAML file to the database and, finally, submitting the workflow using the qlaunch
command.
$ lpad reset
Are you sure? This will RESET 1 workflows and all data. (Y/N)
2022-02-14 11:42:59,323 INFO Performing db tune-up
2022-02-14 11:42:59,496 INFO LaunchPad was RESET.
$ lpad add hello_wf.yaml
2022-02-14 11:43:32,144 INFO Added a workflow. id_map: {1: 1, 2: 2}
$ qlaunch singleshot
2022-02-14 11:44:09,835 INFO moving to launch_dir /path/to/launch_dir
2022-02-14 11:44:09,847 INFO submitting queue script
Based on the content of the my_qadapter.yaml
file, FireWorks creates a submission script FW_submit.script
and automatically submits it when qlaunch
is run. The singleshot
option is used above to launch a single batch job. If you have multiple workflows in your LaunchPad ready to be executed, you can use the rapidfire
option to execute them all as separate batch jobs.
Note
The rapidfire
mode is designed such that it will continuously pull jobs from the LaunchPad that are marked as READY
. This state applies also for jobs that have already been submitted, but are still queueing. Consequently, too many workflows can accidentally be submitted to the queue! To avoid duplicates, the -m
and --nlaunches
options can be used to limit the number of simultaneous jobs in the queue and the total number of jobs to be submitted, respectively. See the official FireWorks documentation on how to launch jobs through a queue for further details.
Although qlaunch
is run instead of the basic rlaunch
command that is normally used in the absence of a batch queue system, rlaunch
is still used inside the my_qadapter.yaml
file to instruct FireWorks how the workflow should be run within the batch job. In the above case, the multi 1
option is used to launch a single parallel job using all the requested resources. The multi
launcher is designed to spawn a specified number of workers that run FireTasks with identical resources requirements in parallel. If more than one worker is specified, srun
commands issued within any ScriptTask
must be modified to use the appropriate amount of tasks/threads together with the --exclusive
option so that the jobs are actually able to run concurrently within the same resource allocation. For example if one full Puhti node (40 cores) is requested for running two concurrent jobs (multi 2
) with the same number of MPI tasks, the FireTasks should read srun -n 20 --exclusive <my program>
. However, beware of idling resources if the FireTasks complete asynchronously! For further details on running parallel jobs using FireWorks, see the official documentation on the multi job launcher.
Note
Each time srun
is issued, a SLURM job step is created. If your workflow is composed of a large number of FireTasks in which srun
is used, the SLURM log will get bloated, risking degrading the performance of the batch queue system. If unavoidable, consider using orterun
instead of srun
to launch your parallel jobs through the queue, or use another workflow tool that packs your tasks within a single large job step. Note also that serial jobs do not require the usage of srun
. Don't hesitate to contact our Service Desk if you're unsure about the efficiency of your workflow.
Step 4. Monitoring the state of your workflow
After running qlaunch
you'll see that your job has been submitted to the queue, and using the lpad get_fws
command you're able to query the current state of your workflow. Before the job starts running, the command will show that the first Firework is marked as READY
to be run, while the second one is WAITING
because we told FireWorks not to launch it before the first one has completed.
$ lpad get_fws
[
{
"fw_id": 1,
"created_on": "2022-02-14T09:43:31.941934",
"updated_on": "2022-02-14T09:45:27.533414",
"state": "READY",
"name": "Unnamed FW"
},
{
"fw_id": 2,
"created_on": "2022-02-14T09:43:31.942114",
"updated_on": "2022-02-14T09:43:31.942114",
"name": "Unnamed FW",
"state": "WAITING"
}
]
When the batch job starts, a launch directory is created for each Firework within which the particular job will be executed and the state of the Firework is updated accordingly as RUNNING
. So if your workflow consists of two Fireworks such as in this example, two launcher_*
directories will be created by default. This behavior can, however, be altered and controlled as described in the official FireWorks documentation. Finally, upon successful completion, the state of the Firework is marked as COMPLETED
, allowing any dependent Fireworks to launch. You can verify that the first Firework was indeed completed before the second one by inspecting the timestamps of the *-hello.out
files in your home directory.
Note
Errors during a run will result in a FIZZLED
Firework, and any jobs depending on the crashed one will not be able to start. The official FireWorks documentation has an in-depth description on how to deal with failures and crashes.