Skip to content

Managing data on Puhti and Mahti scratch disks

An important task for all users on Puhti and Mahti is to manage what data resides in project folders in scratch. These are only intended as temporary storage space for data that is in active use. All other data should be removed, or stored in other more suitable storage systems. Users are not expected to use all of their quota, the maximum quota is only meant for short-term bursts.

Also note that:

  • A Lustre parallel file system starts to lose performance when more than approximately 70% of disk space is used, and the more the disks fill up, the slower the performance will get. CSC has allocated more quota than there is space, hence it is not even possible for all users to use their scratch folders for longer term storage.
  • There are no backups of scratch disk area. Do not trust it to store all of your research data.
  • Removing files decreases the BU consumption of your project, since you are billed for disk usage and not quota as before 2022.

We kindly ask all users to help to keep disk usage manageable, and performance reasonable. Please do the following tasks:

  • Remove files that are not needed anymore in your project's scratch folder. Note that we cannot bring back files that you delete by mistake so do these operations carefully!
  • Compress files if it reduces file size. Ascii text files usually compress very well. Test with one file first. If the file size drops by 50%, go ahead and compress all similar files. See here for available compression tools.
  • Move files not in active use now, but that need to be available later during the project. The typical model is to move the files to Allas. We recommend to use a-tools for small to medium sized data transfers, in particular when you have a large amount of small files. These tools make the usage of Allas safer, and can make your data management easier. For very large data transfers we recommend using rclone. A tutorial for data transfer is available at allas-examples.
  • Archive files that should be available longer than the lifetime of compute projects. Options for this can be for example your organizations own storage systems, or IDA safe storage for research data.

Identifying where you have data

If you have a large amount of files, analyzing how much data you have in different folders can be time consuming and also heavy on the file system. Our recommendations for tools that can show the amount of data in folders:

  • Avoid using find options like -size or similar
  • Avoid using du
  • Do use lue or lfs find --lazy

CSC has developed an approximate tool called LUE (Lustre usage explorer) for reporting amount of data in folders. Read the documentation at LUE before using it. lfs find --lazy has some edge-case where it can be as bad as du or silently fail to get correct size information. Run man lfs-find for further instructions and information on its limitations.

Note

No matter what tool you use you should never try to list or process all files in your project or scratch folder with a single command. Instead you should run commands on specific subdirectories with limited amount of files and data. The total amount of used data is available from the csc-workspaces command.

Automatic removal of files

There is a policy of removing files older than 90 days from scratch (not projappl) to ensure that only actively used data resides on the disk. This policy has not yet been fully implemented, but we plan to take this cleaning procedure fully in use later in 2022. In June 2022 the first stage of automatic removal is implemented in Puhti. All files that have not been accessed since 2019 and 2020 will be deleted on July 1, 2022.

Files that will be deleted in the next clean up are listed in so called "purge lists" files. These are split up by project, and can be found on Lustre at the locations below. Only members of the project groups can access the project directories.

  • /scratch/purge_lists/<PROJECT NAME>/path_summary.txt
  • /fmi/scratch/purge_lists/<PROJECT NAME>/path_summary.txt (only on Puhti)

The file system tools which CSC uses to generate the list of files to remove will output files which are quite verbose and difficult to read. By using the LCleaner tool described in the next section, users can get the relevant information in a more user-friendly format.

Using LCleaner to check which files will be automatically removed

LCleaner is a tool developed by CSC, which is intended to help you to discover what files your project has that have been targeted for automatic removal.

Run lcleaner --help on the login nodes to see what options LCleaner supports.

LCleaner examples

List your files

To get a simple list of all file paths in your purge list, simply give the path_summary.txt file path as an argument:

# List all files in your purge list:
my_project="project_2001659" # Replace with your own project name
lcleaner /scratch/purge_lists/${my_project}/path_summary.txt

If your path_summary.txt is big (over 100 MB in size), it may take some time to execute the tool. You can save time and resources by saving the result into an output file:

# List all files in your purge list into an output file in your home folder:
lcleaner --out-file ~/purge_list /scratch/purge_lists/${my_project}/path_summary.txt

# Alternatively, you can redirect the standard output with the bash shell:
lcleaner /scratch/purge_lists/${my_project}/path_summary.txt > ~/purge_list

# Check the output with less, or your preferred text editor
less ~/purge_list

If you want to search for a specific file or directory, you can use grep to achieve that. You can either search the path_summary.txt file directly, or if you saved the output of lcleaner somewhere, using the commands above, you can use that file.

# Search for directories to check if they are included in the purge list
my_project="project_2001659" # Replace with your own project name!
grep "/scratch/$my_project/important-dir" /scratch/purge_lists/$my_project/path_summary.txt
# Or search the purge_list if you saved it:
grep "/scratch/$my_project/important-dir" ~/purge_list

# If there are no matches, grep will not print anything.

Find the biggest files on the list

LCleaner has an option to sort the files by size. This option is called --sort-by-size and always sorts in a decending order (i.e., biggest files first). If you want to see the size of the files when they are printed, use the --csv option. By default, only the file paths are printed. You can also limit the output to include a given number of files with the --limit N parameter, where N is the number of lines you want to see.

# Print the file paths to be purged in size order:
lcleaner --sort-by-size /scratch/purge_lists/${my_project}/path_summary.txt

# Print the 10 biggest files:
lcleaner --sort-by-size --limit 10 /scratch/purge_lists/${my_project}/path_summary.txt

# Print the 10 biggest files, and their sizes in bytes:
lcleaner --sort-by-size --limit 10 --csv /scratch/purge_lists/${my_project}/path_summary.txt

Delete your purge list files

We encourage you to delete the files you do not need, instead of waiting for the automatic cleaning to take place. If you are happy with purging all of the files that were listed in the path_summary.txt file, you can run the following command:

Note

The commands in this section will delete your files! Be sure that you have reviewed the list of files to remove, and that you have backed up the files you wish to save outside of the cluster prior to running them. This operation is irreversible.

Note

The deletion process may take a considerable amount of time (several hours, depending on the amount of files), so it is best to start it within a screen or tmux session, so that you can disconnect from your SSH session while the deletion keeps running.

# Start a screen session
screen
# Delete all of the files on your purge list:
# Replace the "/path/to/my/path_summary.txt" with the path to your project's path_summary.txt
lcleaner -0 /path/to/my/path_summary.txt | xargs -0 -n 50 rm -vf --
# Then you can press "Ctrl + a" and then "d" to disconnect from the screen and keep
# the deletion running in the background.
# Run "screen -r" to reattach your screen.
# Close the screen session by typing "exit" in the shell.

If you want to delete only a part of the files, e.g., inside a certain directory, you can for example use a command like this:

# Delete only files on the list which are inside /scratch/$my_project/delete-this-dir/
screen lcleaner -0 /path/to/my/path_summary.txt | grep -zZ "/scratch/$my_project/delete-this-dir/" | xargs -0 -n 50 rm -vf --
# Ctrl + a, d to detach from the screen.

LCleaner output formats

If you want to see the size of the files that are about to be purged, you can use either the JSON or the CSV formats. Be aware that if you want to run multiple output formats at the same time, you need to specify an output file path as well. Using the -0 or --nullbyte parameters will output the file paths separated by a null byte, which may be useful to avoid problems with whitespace in the file paths.

# Print your purge list as CSV output with file paths and sizes.
# Note that the CSV format also prints a header row.
lcleaner --csv /scratch/purge_lists/${my_project}/path_summary.txt

# Print your purge list as JSON output with file paths and sizes:
lcleaner --json /scratch/purge_lists/${my_project}/path_summary.txt
# TIP: You can pipe the output into the jq program to prettify the output.
# The dot at the end is a mandatory argument to jq.
lcleaner --json /scratch/purge_lists/${my_project}/path_summary.txt | jq .

# Output both JSON and CSV into purge_list.json and purge_list.csv:
lcleaner --json --csv --out-file purge_list /scratch/purge_lists/${my_project}/path_summary.txt

# Output file paths separated by null bytes:
lcleaner -0 /scratch/purge_lists/${my_project}/path_summary.txt
# Usually you will want to pipe null-byte-separated output into "xargs -0" and do some
# further processing with it. For example like this:
lcleaner -0 --limit 3 /scratch/purge_lists/${my_project}/path_summary.txt \
  | xargs -0 -Ifilepath echo "I should run: rm -vf 'filepath'"

Output examples:

# Plain text:
[westersu@puhti-login1 ~]$ lcleaner path_summary.txt | head -3
/scratch/westersu/my-old-files/file1
/scratch/westersu/my-old-files/file2
/scratch/westersu/my-old-files/file3

# CSV:
[westersu@puhti-login1 ~]$ lcleaner --csv path_summary.txt | head -4
"path","size"
"/scratch/westersu/my-old-files/file1","1704"
"/scratch/westersu/my-old-files/file2","452"
"/scratch/westersu/my-old-files/file3","4951"

# JSON, piped into jq:
[westersu@puhti-login1 ~]$ lcleaner --json path_summary.txt | jq .
{
  "lustre_files": [
    {
      "size": 1704,
      "path": "/scratch/westersu/my-old-files/file1"
    },
    ...
  ]
}

# Null byte xargs:
[westersu@puhti-login1 ~]$ lcleaner -0 --limit 3 path_summary.txt \
>   | xargs -0 -Ifilepath echo "I should run: rm -vf 'filepath'"
I should run: rm -vf '/scratch/westersu/my-old-files/file1'
I should run: rm -vf '/scratch/westersu/my-old-files/file2'
I should run: rm -vf '/scratch/westersu/my-old-files/file3'

Notes on LCleaner usage

This section details some things that may be good to know about how LCleaner behaves, or why the command examples above are architected the way they are.

  • Sometimes lcleaner prints errors about lines it wasn't able to parse. If there are errors, a warning will be printed at the end, indicating that there was at least one error. The warnings will say something like: "We detected N errors during the execution. Please check the logs, for more information!" The errors indicate which line number the problematic text was on, so you can go and check it manually.
    • Tip: To print only a specific line, e.g., line 123 of the path_summary.txt, you can use this command: sed -n 123p /path/to/path_summary.txt
  • To capture the logging of lcleaner, you can redirect the standard error output stream into a file. This may be useful if you experience problems, and would like help to troubleshoot the situation.
    • lcleaner --log-level debug path_summary.txt 2> ~/lcleaner-debug-$(date +%s).log
  • The use of -0 both with lcleaner and xargs in the example commands on this page is recommended in order to avoid problems with file names that include whitespace.

Last update: June 10, 2022