Data storage for machine learning
This guide explains how to store your data efficiently for machine learning applications on CSC's supercomputers. It is part of our Machine learning guide.
Where to store data?
Puhti and Mahti have three types of shared disk areas: home, projappl and scratch. You can read more about the disk areas here. In general, keep your code and software in projappl and datasets, logs and calculation outputs in scratch. The home directory is not intended for data analysis and computing, and you should only store small personal files there.
It is recommended to store big datasets in the Allas object store, and download them to your project's scratch directory prior to starting your computation. For example:
module load allas allas-conf cd /scratch/<your-project> swift download <bucket-name> your-dataset.tar
Anything that needs to be stored for a longer time (project life-time) should be copied back to Allas. CSC may at some point start cleaning the scratch drives so that files older than 90 days will be automatically removed. However, this clean-up process has not yet been activated.
Some CPU nodes and all GPU nodes also have fast local NVME drives with at least 3.6 TB disk space. This space is available only during the execution of the Slurm job, and is cleaned up afterwards. For data intensive jobs it is often worthwhile to copy the data to the NVME at the start of the job and then to store the final results on the scratch drive at the end of the job. See below for more information on how to use the fast local NVME drive.
Using the shared file system efficiently
The training data for machine learning models often consists of a huge number of
files. A typical example is training a neural network with tens of thousands of
relatively small JPEG image files. Unfortunately the Lustre file system used in
/projappl and users' home directories does not perform
well with random access of a lot of files or when performing many
small reads. In addition to slowing down the computation it may also
in extreme cases cause noticeable slowdowns for all users of the
supercomputer, sometimes making the entire supercomputer unusable for
Please do not read a huge number of files from the shared file system. Use the fast local drives or package your data into larger files for sequential access instead!
More efficient data format
Many machine learning frameworks support formats for packaging your data more efficiently. Common formats include TensorFlow's TFRecord and WebDataset for PyTorch. Other examples include using HDF5, or LMDB formats, or even humble ZIP-files, e.g., via Python's zipfile library. See also an example of creating TFRecord files from an image dataset.
The main point with all of these formats is that instead of many thousands of small files you have one or a few bigger files, which are much more efficient to access and read sequentially. Don't hesitate to contact our service desk if you need advice about how to access your data more efficiently.
Fast local drive
If you really need to access the individual small files, you can use the fast NVME
local drive that is present in every GPU node. In brief, you just need to add
nvme:<number-of-GB> to the
--gres flag in your submission script, and then
the fast local storage will be available in the location specified by the
$LOCAL_SCRATCH. Here is an example run that reserves 100
GB of the fast local drive and extracts the dataset tar-package on that drive
before launching the computation:
#!/bin/bash #SBATCH --account=<project> #SBATCH --partition=gpu #SBATCH --ntasks=1 #SBATCH --cpus-per-task=10 #SBATCH --mem=64G #SBATCH --time=1:00:00 #SBATCH --gres=gpu:v100:1,nvme:100 tar xf /scratch/<your-project>/your-dataset.tar -C $LOCAL_SCRATCH srun python3 myprog.py --input_data=$LOCAL_SCRATCH <options>
Note that you need to communicate somehow to your own program where to find the dataset, for example with a command line argument. Also see our general instructions on how to take the fast local storage into use.
If you are running a multi-node job, you need to modify the
line so that it is performed on each node separately:
srun --ntasks=$SLURM_NNODES --ntasks-per-node=1 \ tar xf /scratch/<your-project>/your-dataset.tar -C $LOCAL_SCRATCH