Managing data on Puhti and Mahti scratch disks

An important task for all users on Puhti and Mahti is to manage what data resides in project folders in scratch. These are only intended as temporary storage space for data that is in active use. All other data should be removed, or stored in other more suitable storage systems. Users are not expected to use all of their quota, the maximum quota is only meant for short-term bursts.

Also note that:

A Lustre parallel file system starts to lose performance when more than approximately 70% of disk space is used, and the more the disks fill up, the slower the performance will get. CSC has allocated more quota than there is space, hence it is not even possible for all users to use their scratch folders for longer term storage.
There are no backups of scratch disk area. Do not trust it to store all of your research data.
Removing files may decrease the BU consumption of your project, since you are billed for excess disk usage beyond 1 TiB.

We kindly ask all users to help to keep disk usage manageable, and performance reasonable. Please do the following tasks:

Remove files that are not needed anymore in your project's scratch folder. Note that we cannot bring back files that you delete by mistake so do these operations carefully!
Compress files if it reduces file size. Ascii text files usually compress very well. Test with one file first. If the file size drops by 50%, go ahead and compress all similar files. See here for available compression tools.
Move files not in active use now, but that need to be available later during the project. The typical model is to move the files to Allas. We recommend to use a-tools for small to medium sized data transfers, in particular when you have a large amount of small files. These tools make the usage of Allas safer, and can make your data management easier. For very large data transfers we recommend using rclone. A tutorial for data transfer is available at allas-examples.
Archive files that should be available longer than the lifetime of compute projects. Options for this can be for example your organizations own storage systems, or IDA safe storage for research data.

Identifying where you have data

If you have a large amount of files, analyzing how much data you have in different folders can be time consuming and also heavy on the file system. Our recommendations for tools that can show the amount of data in folders:

Avoid using find options like -size or similar
Avoid using du
Do use lue or lfs find --lazy

CSC has developed an approximate tool called LUE (Lustre usage explorer) for reporting amount of data in folders. Read the documentation at LUE before using it. lfs find --lazy has some edge-case where it can be as bad as du or silently fail to get correct size information. Run man lfs-find for further instructions and information on its limitations.

Note

No matter what tool you use you should never try to list or process all files in your project or scratch folder with a single command. Instead you should run commands on specific subdirectories with limited amount of files and data. The total amount of used data is available from the csc-workspaces command.

Automatic removal of files

There is a policy of removing files older than 180 days from scratch (not projappl) to ensure that only actively used data resides on the disk (currently implemented only on Puhti).

Files that will be deleted in the next clean up are listed in so called "purge lists" files. These are split up by project, and can be found on Lustre at one of the locations below. Only members of the project groups can access the project directories. If your project is newly created, your project might not yet have its own subdirectory in the purge_lists directory, in which case it won't participate in the automatic cleaning.

/scratch/purge_lists/<PROJECT NAME>/path_summary.txt
/fmi/scratch/purge_lists/<PROJECT NAME>/path_summary.txt (only on Puhti, for FMI projects)

In case the path_summary.txt file does not exist, your project did not have any files that matched the clean-up criteria, and thus nothing will be deleted from it. To indicate that the file is intentionally missing, CSC will place a file named nothing-to-remove-for-your-project in your project's purge_lists subdirectory, so check for the existence of this file as well.

As part of the automated cleaning process, the files will change names. Before the cleaning has begun, each project that is part of the clean-up will have a file named path_summary.txt. In special cases where a project is exempt from the upcoming cleaning, or requires more time to transfer files, the administrators will rename the file to something else, usually path_summary.txt-later-delete. Once a project has been processed by the automated cleaning, the file will be renamed to path_summary.txt-stashed. These files are still readable to projects, so that it is possible to refer to the list also after the cleaning is performed. The previous round's files will be archived when the next round of cleaning is about to begin. You can check whether your project's purge list has been updated recently by checking its last modification date. In the example below, the file is a few months old, so it is clearly from the prior round of cleaning:

$ stat -c %y /scratch/purge_lists/project_2001659/path_summary.txt-stashed
2023-05-23 00:35:28.000000000 +0300
$ date +%F
2023-08-04

Another file which is put into each project's purge_lists directory is the total_size.txt file. This file contains a precalculated size estimate based on the numbers inside the path_summary.txt files. This file exists for every project, and is created automatically when the purge lists are generated. The file might look like this:

$ cat /scratch/purge_lists/project_2001659/total_size.txt
Total size: 798343125192 bytes = 743.515 GiB = 0.726 TiB

With this information, you are able to estimate how much time might be required to back up the data elsewhere, if you want to keep everything on the purge list outside of Puhti's scratch file system. The file system tools which CSC uses to generate the list of files to remove will output files which are quite verbose and difficult to read. By using the LCleaner tool described in the next section, users can get the relevant information in a more user-friendly format.

Using LCleaner to check which files will be automatically removed

LCleaner is a tool developed by CSC, which is intended to help you to discover what files your project has that have been targeted for automatic removal.

Run lcleaner --help on the login nodes to see what options LCleaner supports.

LCleaner examples

Check if your project has a path_summary.txt file

The first thing to check, is whether your project indeed has a path_summary.txt file. All projects don't automatically have one, only the ones which have something to clean up.

# Check if your project has a path_summary.txt file
my_project="project_2001659" # Replace with your own project name
ls "/scratch/purge_lists/${my_project:?}/"
# Or if you are in an FMI project on Puhti:
ls "/fmi/scratch/purge_lists/${my_project:?}/"

If you see a path_summary.txt file in the directory, read ahead to discover what files are on the list. However, if you find a file named nothing-to-remove-for-your-project, your project doesn't have anything that will be automatically removed.

If you want a quick, copy-pasteable solution, use the small script below:

# Check all of the projects you belong to in one go:

for g in $(/usr/bin/groups) ; do
  if [ -d "/scratch/$g" -a ! -L "/scratch/$g" ]; then
    dir="/scratch/purge_lists/$g" ;
  elif [ -d "/fmi/scratch/$g" ]; then
    dir="/fmi/scratch/purge_lists/$g";
  else
    continue;
  fi ;
  echo -n "- Project '$g': ";
  if [ ! -d "${dir:?}" ]; then
    echo "doesn't have a purge_lists subdirectory. No files will be removed.";
    continue;
  fi ;
  if [ -f "${dir:?}/path_summary.txt" ]; then
    echo "has files that will be removed." ;
  elif [ -f "${dir:?}/nothing-to-remove-for-your-project" ]; then
    echo "is not included in the automatic cleaning.";
  else
    echo "is unclear, based on this script. Check with Service desk what to do.";
  fi ;
done

List your files

To get a simple list of all file paths in your purge list, simply give the path_summary.txt file path as an argument:

# List all files in your purge list:
lcleaner "/scratch/purge_lists/${my_project:?}/path_summary.txt"

If your path_summary.txt is big (over 100 MB in size), it may take some time to execute the tool. You can save time and resources by saving the result into an output file:

# List all files in your purge list into an output file in your home folder:
lcleaner --out-file ~/purge_list "/scratch/purge_lists/${my_project:?}/path_summary.txt"

# Alternatively, you can redirect the standard output with the bash shell:
lcleaner "/scratch/purge_lists/${my_project:?}/path_summary.txt" > ~/purge_list

# Check the output with less, or your preferred text editor
less ~/purge_list

If you want to search for a specific file or directory, you can use grep to achieve that. You can either search the path_summary.txt file directly, or if you saved the output of lcleaner somewhere, using the commands above, you can use that file.

# Search for directories to check if they are included in the purge list
my_project="project_2001659" # Replace with your own project name!
grep "/scratch/${my_project:?}/important-dir" "/scratch/purge_lists/${my_project:?}/path_summary.txt"
# Or search the purge_list if you saved it:
grep "/scratch/${my_project:?}/important-dir" ~/purge_list

# If there are no matches, grep will not print anything.

Find the biggest files on the list

LCleaner has an option to sort the files by size. This option is called --sort-by-size and always sorts in a decending order (i.e., biggest files first). If you want to see the size of the files when they are printed, use the --csv option. By default, only the file paths are printed. You can also limit the output to include a given number of files with the --limit N parameter, where N is the number of lines you want to see.

# Print the file paths to be purged in size order:
lcleaner --sort-by-size "/scratch/purge_lists/${my_project:?}/path_summary.txt"

# Print the 10 biggest files:
lcleaner --sort-by-size --limit 10 "/scratch/purge_lists/${my_project:?}/path_summary.txt"

# Print the 10 biggest files, and their sizes in bytes:
lcleaner --sort-by-size --limit 10 --csv "/scratch/purge_lists/${my_project:?}/path_summary.txt"

Delete your purge list files

We encourage you to delete the files you do not need, instead of waiting for the automatic cleaning to take place. If you are happy with purging all of the files that were listed in the path_summary.txt file, you can run the following command:

Warning-label

The commands in this section will delete your files! Be sure that you have reviewed the list of files to remove carefully! Also make sure that you have backed up the files you wish to save (outside the cluster) prior to running the commands. This operation is irreversible.

Note

The deletion process may take a considerable amount of time (several hours, depending on the amount of files), so it is best to start it within a screen or tmux session, so that you can disconnect from your SSH session while the deletion keeps running.

# Start a screen session
screen
# Delete all of the files on your purge list:
# Replace the "/path/to/my/path_summary.txt" with the path to your project's path_summary.txt
lcleaner -0 /path/to/my/path_summary.txt | xargs -0 -n 50 rm -vf --
# Then you can press "Ctrl + a" and then "d" to disconnect from the screen and keep
# the deletion running in the background.
# Run "screen -r" to reattach your screen.
# Close the screen session by typing "exit" in the shell.

If you want to delete only a part of the files, e.g., inside a certain directory, you can for example use a command like this:

# Delete only files on the list which are inside /scratch/$my_project/delete-this-dir/
screen lcleaner -0 /path/to/my/path_summary.txt | grep -zZ "/scratch/${my_project:?}/delete-this-dir/" | xargs -0 -n 50 rm -vf --
# Ctrl + a, d to detach from the screen.

LCleaner output formats

If you want to see the size of the files that are about to be purged, you can use either the JSON or the CSV formats. Be aware that if you want to run multiple output formats at the same time, you need to specify an output file path as well. Using the -0 or --nullbyte parameters will output the file paths separated by a null byte, which may be useful to avoid problems with whitespace in the file paths.

# Print your purge list as CSV output with file paths and sizes.
# Note that the CSV format also prints a header row.
lcleaner --csv "/scratch/purge_lists/${my_project:?}/path_summary.txt"

# Print your purge list as JSON output with file paths and sizes:
lcleaner --json "/scratch/purge_lists/${my_project:?}/path_summary.txt"
# TIP: You can pipe the output into the jq program to prettify the output.
# The dot at the end is a mandatory argument to jq.
lcleaner --json "/scratch/purge_lists/${my_project:?}/path_summary.txt" | jq .

# Output both JSON and CSV into purge_list.json and purge_list.csv:
lcleaner --json --csv --out-file purge_list "/scratch/purge_lists/${my_project:?}/path_summary.txt"

# Output file paths separated by null bytes:
lcleaner -0 "/scratch/purge_lists/${my_project:?}/path_summary.txt"
# Usually you will want to pipe null-byte-separated output into "xargs -0" and do some
# further processing with it. For example like this:
lcleaner -0 --limit 3 "/scratch/purge_lists/${my_project:?}/path_summary.txt" \
  | xargs -0 -Ifilepath echo "I should run: rm -vf 'filepath'"

Output examples:

# Plain text:
[westersu@puhti-login11 ~]$ lcleaner path_summary.txt | head -3
/scratch/westersu/my-old-files/file1
/scratch/westersu/my-old-files/file2
/scratch/westersu/my-old-files/file3

# CSV:
[westersu@puhti-login11 ~]$ lcleaner --csv path_summary.txt | head -4
"path","size"
"/scratch/westersu/my-old-files/file1","1704"
"/scratch/westersu/my-old-files/file2","452"
"/scratch/westersu/my-old-files/file3","4951"

# JSON, piped into jq:
[westersu@puhti-login11 ~]$ lcleaner --json path_summary.txt | jq .
{
  "lustre_files": [
    {
      "size": 1704,
      "path": "/scratch/westersu/my-old-files/file1"
    },
    ...
  ]
}

# Null byte xargs:
[westersu@puhti-login11 ~]$ lcleaner -0 --limit 3 path_summary.txt \
>   | xargs -0 -Ifilepath echo "I should run: rm -vf 'filepath'"
I should run: rm -vf '/scratch/westersu/my-old-files/file1'
I should run: rm -vf '/scratch/westersu/my-old-files/file2'
I should run: rm -vf '/scratch/westersu/my-old-files/file3'

Notes on LCleaner usage

This section details some things that may be good to know about how LCleaner behaves, or why the command examples above are architected the way they are.

Sometimes lcleaner prints errors about lines it wasn't able to parse. If there are errors, a warning will be printed at the end, indicating that there was at least one error. The warnings will say something like: "We detected N errors during the execution. Please check the logs, for more information!" The errors indicate which line number the problematic text was on, so you can go and check it manually.
- Tip: To print only a specific line, e.g., line 123 of the path_summary.txt, you can use this command: sed -n 123p /path/to/path_summary.txt
To capture the logging of lcleaner, you can redirect the standard error output stream into a file. This may be useful if you experience problems, and would like help to troubleshoot the situation.
- lcleaner --log-level debug path_summary.txt 2> ~/lcleaner-debug-$(date +%s).log
The use of -0 both with lcleaner and xargs in the example commands on this page is recommended in order to avoid problems with file names that include whitespace.
LCleaner also has some administrative functionality, which is not intended and in some cases will not work for unprivileged users. Anything which mentions the --admin-mode flag can safely be ignored.

Troubleshooting LCleaner

If you notice any bugs, please report them to CSC Service Desk.

Last update: August 9, 2023