Common use cases

Processing data in CSC supercomputers

The CSC supercomputers provide disk environments for working with large datasets. These storage areas are however not intended for storing data that is not actively used. For example in the scratch area of Puhti the un-used files are automatically removed after 90 days.

One of the main use cases of Allas is to store data while it is not actively used in the CSC supercomputers. When you start working, you stage in the data from Allas. And when the data is no longer actively used, it can be staged out to Allas.

In CSC supercomputers, connection to Allas can be established with commands:

module load allas
allas-conf

After that you can:

List the data buckets and objects in Allas: For listing we recommend a-list.

a-list

The command above lists available data buckets in Allas. To list data objects in a bucket give command:

a-list bucket_name

alternatively you can use rclone commands:

rclone lsd allas:
rclone ls allas:bucket_name

Copy data from Allas to a supercomputer (Puhti or Mahti) (stage in): For downloading we recommend a-get

a-get bucket/object_name

or rclone copy:

rclone copy allas:bucket/object_name ./

Copy data from a Supercomputer to Allas (stage out): For uploading we recommend a-put

a-put filename

or rclone copy:

rclone copy file.dat allas:/bucket_name 

Note

Both a-put/a-get and rclone use Swift protocol on Allas. It is important not to mix Swift and S3, as these protocols are not fully mutually compatible.

Sharing data

Sharing data, e.g. datasets or research results, is easy in the object storage. You can share these either with a limited audience, e.g. other projects, or allow access for everybody by making the data public.

The data can be accessed and shared in a variety of ways:

  • Private – default: By default, if you do not specify anything else, the contents of buckets can only be accessed by authenticated members of your project. Private/Public settings can be managed with:

  • Access control lists: Access control lists (ACLs) work on buckets, not objects. With ACLs, you can share your data in a limited manner to other projects. You can e.g. grant a collaboration project authenticated read access to your datasets.

  • Public: You can also have ACLs granting public read access to data, which is useful e.g. for sharing public scientific results or public datasets.

Static web content

A common way to use the object storage is storing static web content, such as images, videos, audio, pdfs or other downloadable content, and adding links to it on a web page, which can run either inside Allas or somewhere else, like this example.

Uploading data to Allas can be done with any of the following clients: web client, a-commands,rclone, Swift or S3.

Storing data for distributed use

There are several cases where you need to access data in several locations. In these cases, the practice of staging in the data to individual servers or computers from the object storage can be used instead of a shared file storage.

Accessing the same data via multiple CSC platforms

Since the data in the object storage is available anywhere, you can access the data via both the CSC clusters and cloud services. This makes the object storage a good place to store data as well as intermediate and final results in cases where the workflow requires the use of e.g. both Allas and Puhti.

Collecting data from different sources

It is easy to push data to the object storage from several different sources. This data can then later be processed as needed.

For example, several data collectors may push data to be processed, e.g. scientific instruments, meters, or software that harvests social media streams for scientific analysis. They can push their data into the object storage, and later virtual machines and computing jobs on Puhti can process the data.

Self-service backups of data

The object storage is also often used as a location for storing backups. It is a convenient place to push copies of database dumps.

Note

Allas-backup is not a real backup service. It only copies the data to another bucket in Allas which can be easily removed or overwrited by any authenticated user.

Files larger than 5 GB

Files larger than 5 GB are divided into smaller segments during upload.

  • a-put and rclone split large files automatically: a-put

  • Using Swift, you can use the Static Large Object: swift with large files

  • s3cmd splits large files automatically: s3cmd put

After upload, s3cmd connects therese segments into one large object, but in case of swift based uploads (a-put, rclone , swift) the large files are also stored as several objects. This is done automatically to a bucket that is named by adding extension _segments to the original bucket name. For example, if you would use a-put to upload a large file to bucket 123-dataset the actual data would be stored as several pieces into bucket 123-dataset_segments. The target bucket 123_dataset would contain just a front object that contains information what segments make the stored file. Operations performed to the front object are automatically reflected to the segments. Normally users don't need to operate with the segments buckets at all and objects inside these buckets should not be deleted or modified.

Viewing

In CSC supercomputers you can check the number of objects and the amount of stored data in your current Allas project with command:

a-info

If you are using the s3cmd client, check your project's object storage usage:

s3cmd du -H

If you use the Swift client:

swift stat

Display how much space a bucket has used:

swift stat $bucketname

Please contact servicedesk@csc.fi if you have questions.