Using Allas to host a data set for a research project

An example scenario of an Allas use case.

Roles of the play

Saara: A professor coordinating an inspiring research project.

Pekka: A researcher that takes care of the data management of the project.

Mats: A technician working at Analysis Service Center.

Xi and Laura: Researchers working in the research project.

Act 1. Professor Saara opens CSC projects

Professor Saara is running a large research project called HiaNo in a Finnish university. The project has just sent a set of samples to Analysis Service Center to be processed and analyzed. The analysis takes some weeks and produces 80 TB of data that the research group will use in the actual research.

Saara and Pekka, who is taking care of the data management, study the storage options provided by CSC. They decide to use the Allas service for storing and sharing the data during the research project. The data is not sensitive personal data, so Allas is suitable.

As a first step, Saara and Pekka login to the MyCSC portal and register as CSC users.

Then Saara creates two research projects at CSC: one called Data management of the HiaNo project (project ID: project_2000444) and another called HiaNo research project (project ID: project_2000333).

Once the CSC projects are established, Saara activates the Allas, Puhti and cPouta services for both projects. As Saara knows that the default storage space of Allas (10 TB) will not be enough for the incoming data set, she sends a request for 90 TB of Allas quota for the project Data management of the HiaNo project to servicedesk@csc.fi.

Finally, Saara adds Pekka to both CSC projects and asks him to take care of the details of the incoming data.

Act 2. Creating a shared bucket

Mats from Analysis Service Center contacts Pekka and tells that the results are available, and asks how he should deliver the data. Mats has an account at CSC (msundber in the project project_2000111) with Allas enabled, so Pekka proposes that data be uploaded to Allas. For that purpose, Pekka creates a bucket in Allas and allows Mats to use it.

Pekka logs in to Puhti

ssh puhti.csc.fi   

and opens a connection to the data management project in Allas:

module load allas
allas-conf project_2000444

Then he creates a new bucket in Allas. There are many ways to do this but this time, Pekka does this by importing a new file to Allas with a-put:

echo “This bucket is used to host the original data of HiaNo project sample1” > README.txt
a-put --nc -b hiano-project-sample001 README.txt
a-list hiano-project-sample001 

Pekka included the project name in the bucket name (hiano-project-sample001) to make sure that the bucket name is unique in the whole Allas service. The a-list command shows that the bucket was successfully created.

Next Pekka uses the swift post command to modify the access rights of the new bucket so that Mats (user msundber from Allas project_2000111) is able so use it.

swift post hiano-project-sample001 -r "project_2000444:*,project_2000111:msundber"
swift post hiano-project-sample001 -w "project_2000444:*,project_2000111:msundber"
swift stat hiano-project-sample001

In Allas, large files (over 5 GB) are split during the upload and stored as several objects in a bucket, which is normally automatically created. This bucket's name is has the extension _segments. In this example, the name would be hiano-project-sample001_segments. Normally, users should not directly interact with the segments buckets, but this case is an exception. Pekka will now manually create the segments bucket as well, to ensure that it is created (and thus owned) by the same project and to be able set access rights for this bucket.

a-put --nc -b hiano-project-sample001_segments README.txt
a-list hiano-project-sample001_segments
swift post hiano-project-sample001_segments -r "project_2000444:*,project_2000111:msundber"
swift post hiano-project-sample001_segments -w "project_2000444:*,project_2000111:msundber"
swift stat hiano-project-sample001_segments

Now Pekka has prepared a bucket (and the corresponding segments bucket) into which Mats can import the data. Pekka still needs to send the name of the bucket to Mats, as normal Allas listing commands do not display the name for Mats who is not a member in the project that owns the bucket.

Act 3. Uploading data

Mats has Allas tools installed in the front end server of the measurement device at Analysis Service Center. Thus he can upload the data directly from the front end server to the hiano-project-sample1 bucket in Allas:

rclone copy sample1/cannel43/aa_3278830.dat  allas:hiano-project-sample001/sample1/cannel43/aa_3278830.dat

As there is a large amount of data to be transported, the upload takes few days and needs to be done in several batches. When Mats tells that he is ready with the data uploads, Pekka closes the shared bucket:

swift post hiano-project-sample001 -r ""
swift post hiano-project-sample001_segments -r ""
swift post hiano-project-sample001 -w ""
swift post hiano-project-sample001_segments -w ""
swift stat hiano-project-sample001

Act 4. Using the data in research

Once the data is available, the actual analysis work begins. There will be several users using the data set during the research project. Pekka knows that if all users use the data with full access rights (read and write), there is a danger that somebody accidentally deletes or overwrites some part of the data. Thus, it is agreed that while the data is hosted by the data management project (project_2000444), the researchers access the data through the HiaNo research project (project_2000333).

Pekka gives read access to the hiano-project-sample001 bucket for the project project_2000333 but no write access.

module load allas
allas-conf project_2000444
swift post hiano-project-sample001 -r "project_2000333:*,project_2000444:*"
swift post hiano-project-sample001_segments -r "project_2000333:*,project_2000444:*"

Xi and Laura can now start working with the data. They register using the MyCSC portal, after which Saara, who is the Principal Investigator, adds them to the CSC project HiaNo research project (project_2000333).

Xi and Laura need to revisit MyCSC and accept the services of the research project. After that, they can download the research data they need to any environment that is able to connect to Allas: Puhti, a virtual machine in cPouta, or their own laptop. As new researchers join the project, Saara adds them in project_2000333, so that they can access the data.

Because storing data in Allas consumes billing units, Saara needs to check the saldo in MyCSC from time to time, and if needed, apply for more billing units (80 TB consumes 700 800 Bu in year). Fortunately, HiaNo is an academic research project, so Saara does not need to pay for the billing units.

Act 5. The end

After four years of intensive research that has expanded to several institutes in Finland and abroad, the HiaNo project has produced a few theses and many high quality publications (all acknowledging the use of CSC resources).

The data is no longer actively used presently. A part of the data that was imported to Allas has been published in international research databases. Some datasets have been moved to IDA, so that a DOI identifier and metadata can be linked to the data to make it reusable by other researchers. Some data can now be deleted and some remaining parts be moved to the buckets of the new HiaNo2 project.

At this stage, Pekka cleans the remaining data objects from Allas, after which Saara informs CSC that the project can be closed.