The S3 client
This chapter describes how to use the Allas object storage service with the s3cmd command line client. This client uses the S3 protocol that differs from the Swift protocol used in the Rclone, swift and a-commands examples. Normally data uploaded with S3 can be utilized with swift protocol too. However, over 5 GB files uploaded to Allas with swift can't be downloaded with S3 protocol.
From the user perspective, one of the main differences between S3 and Swift protocols is that Swift based connections remain valid for eight hours at a time, but with S3, the connection remains permanently open. The permanent connection is practical in many ways but it has a security aspect: if your CSC account is compromised, so is the object storage space.
The syntax of the
s3cmd -options command parameters
The most commonly used s3cmd commands:
|mb||Create a bucket|
|put||Upload an object|
|ls||List objects and buckets|
|get||Download objects and buckets|
|del||Remove objects or buckets|
|md5sum||Get the checksum|
|signurl||Create a temporary URL|
|put -P||Make an object public|
|setacl --acl-grant||Manage access rights|
The table above lists only the most essential s3cmd commands. For more complete list, visit the s3cmd manual page or type:
Getting started with s3cmd
If you use Allas on Puhti or Mahti, all required packages and software are already installed. In this case you can skip this chapter and proceed to the section Configuring S3 connection in supercomputers.
To configure a s3cmd connection, you need to have OpenStack and s3cmd installed in your environment.
OpenStack s3cmd installation:
sudo yum update sudo yum install python3 sudo pip3 install python-openstackclient sudo yum install s3cmd
sudo apt install python3-pip sudo pip3 install python-openstackclient sudo apt install restic curl https://rclone.org/install.sh | sudo bash sudo pip3 install s3cmd
python3 virtualenv pip3 install s3cmd s3cmd
Configuring S3 connection in local computer
Once you have OpenStack and s3cmd installed in your environment, you can download the allas_conf script to set up the S3 connection to your Allas project.
wget https://raw.githubusercontent.com/CSCfi/allas-cli-utils/master/allas_conf source allas_conf --mode s3cmd --user your-csc-username
Note that you should use the
--user option to define your CSC username. The configuration command first asks for your
CSC password and then for you to choose an Allas project. After that, the tool creates a key file for the S3 connection and stores it in the default location (.s3cfg in home directory).
Configuring S3 connection in supercomputers
To use s3cmd in Puhti and Mahti, you must first configure the connection:
module load allas allas-conf --mode s3cmd
The configuration process first asks for your CSC password. Then it lists your Allas projects and asks to select the project to be used. The configuration information is stored in the file $HOME/.s3cfg. This configuration only needs to be defined once. In the future, s3cmd will automatically use the object storage connection described in the .s3cfg file. If you wish to change the Allas project that s3cmd uses, you need to run the configuration command again.
You can use the S3 credentials, stored in the .s3cfg file, in other services too. You can check the currently used access key and secret_key with command:
grep key $HOME/.s3cfg
If you use these keys in other services, your should make sure that the keys always remain private. Any person who has access to these two keys, can access and modify all the data that the project has in Allas.
In needed, you can deactivate an S3 key pair with command:
Create buckets and upload objects
Create a new bucket:
s3cmd mb s3://my_bucket
Upload a file to a bucket:
s3cmd put my_file s3://my_bucket
List objects and buckets
List all buckets in a project:
List all objects in a bucket:
s3cmd ls s3://my_bucket
Display information about a bucket:
s3cmd info s3://my_bucket
Display information about an object:
s3cmd info s3://my_bucket/my_file
Download objects and buckets
Download an object:
s3cmd get s3://my_bucket/my_file new_file_name
The parameter new_file_name is optional. It defines a new name for the downloaded file.
Using the command
md5sum, you can check that the file has not been changed or corrupted:
$ md5sum my_file new_file_name 39bcb6992e461b269b95b3bda303addf my_file 39bcb6992e461b269b95b3bda303addf new_file_name
In the above example, the checksums match between the original and downloaded file.
Download an entire bucket:
s3cmd get -r s3://my_bucket/
Copy an object to another bucket. Note that should use these commands only for objects that were uploaded to Allas with S3 protocol:
s3cmd cp s3://sourcebucket/objectname s3://destinationbucket
$ s3cmd cp s3://bigbucket/bigfish s3://my-new-bucket remote copy: 's3://bigbucket/bigfish' -> 's3://my-new-bucket/bigfish'
Rename the file while copying it:
$ s3cmd cp s3://bigbucket/bigfish s3://my-new-bucket/newname remote copy: 's3://bigbucket/bigfish' -> 's3://my-new-bucket/newname'
Delete objects and buckets
Delete an object:
s3cmd del s3://my_bucket/my_file
Delete a bucket:
s3cmd rb s3://my_bucket
Note: You can only delete empty buckets.
s3cmd and public objects
In this example, the object salmon.jpg in the pseudo folder fishes is made public:
$ s3cmd put fishes/salmon.jpg s3://my_fishbucket/fishes/salmon.jpg -P Public URL of the object is: https://a3s.fi/my_fishbucket/fishes/salmon.jpg
Giving another project read access to a bucket
You can control access rights using the command
s3cmd setacl. This command requires the UUID (universally unique identifier) of the project you want to grant access to. Project members can check their project ID in https://pouta.csc.fi/dashboard/identity/ or using the command
openstack project show. For example in Puhti and Mahti:
module load allas allas-conf -k --mode s3cmd openstack project show $OS_PROJECT_NAME
In case of s3cmd the read and write access can be controlled for both buckets and objects:
Following command gives project with UUID 3d5b0ae8e724b439a4cd16d1290 read access to my_fishbucket but not to the objects inside :
s3cmd setacl --acl-grant=read:3d5b0ae8e724b439a4cd16d1290 s3://my_fishbucket
Similarly, following command gives write access to just single object:
s3cmd setacl --acl-grant=write:3d5b0ae8e724b439a4cd16d1290 s3://my_fishbucket/bigfish
If you want to modify the access permissions of all the objects in a bucket, you can add option
--recursive to the command:
s3cmd setacl --recursive --acl-grant=read:3d5b0ae8e724b439a4cd16d1290 s3://my_fishbucket
You can check the access permissions with s3cmd info:
$ s3cmd info s3://my_fishbucket|grep -i acl ACL: other_project_uuid: READ ACL: my_project_uuid: FULL_CONTROL
Option --acl-revoke can be used to remove a read or write access:
s3cmd setacl --recursive --acl-revoke=read:$other_project_uuid s3://my_fishbucket
The shared objects and buckets can be used with both S3 and Swift based tools. Note howerver, that listing commands show only buckets owned by your project. In the case of shared buckets and objects you must know the names of the buckets in order to use them.
In the case of the example above, user from project 3d5b0ae8e724b439a4cd16d1290 will not see my_fishbucket , when it is shared, with command:
However she can list the content of the bucket with command:
s3cmd ls s3://my_fishbucket
In the Pouta web UI, user can move to a shared bucket by defining the bucket name in the URL. Move to some bucket of your project and replace the bucket name in the end of the URL with the name of the shared bucket:
In this example, we store a simple dataset in Allas using s3cmd.
First, create a new bucket. The command
s3cmd ls reveals that the object storage is empty at first. Then, use the command
s3cmd mb to create a new bucket called fish-bucket.
$ s3cmd ls ls $ s3cmd mb s3://fish-bucket mb s3://fish-bucket/ Bucket 's3://fish-bucket/' created $ s3cmd ls ls 2018-03-12 13:01 s3://fish-bucket
It is recommended to collect the data to be stored as larger units and compress it before uploading it to the system.
In this example, we store the Bowtie2 indices and the genome of the zebrafish (danio rerio) in the fish bucket. Running
ls -lh shows that the index files are available in the current directory:
$ ls -lh total 3.2G -rw------- 1 kkayttaj csc 440M Mar 12 13:41 Danio_rerio.1.bt2 -rw------- 1 kkayttaj csc 327M Mar 12 13:41 Danio_rerio.2.bt2 -rw------- 1 kkayttaj csc 217K Mar 12 13:20 Danio_rerio.3.bt2 -rw------- 1 kkayttaj csc 327M Mar 12 13:20 Danio_rerio.4.bt2 -rw------- 1 kkayttaj csc 1.3G Mar 12 13:13 Danio_rerio.GRCz10.dna.toplevel.fa -rw------- 1 kkayttaj csc 440M Mar 12 14:03 Danio_rerio.rev.1.bt2 -rw------- 1 kkayttaj csc 327M Mar 12 14:03 Danio_rerio.rev.2.bt2 -rw------- 1 kkayttaj csc 599K Mar 12 13:13 log
The data is collected and compressed as a single file using the
tar zcf zebrafish.tgz Danio_rerio*
The size of the resulting file is about 2 GB. The compressed file can be uploaded to the fish bucket using the command
$ ls -lh zebrafish.tgz -rw------- 1 kkayttaj csc 9.3G Mar 12 15:23 zebrafish.tgz $ s3cmd put zebrafish.tgz s3://fish-bucket put zebrafish.tgz s3://fish-bucket upload: 'zebrafish.tgz' -> 's3://fish-bucket/zebrafish.tgz' [1 of 1] 2081306836 of 2081306836 100% in 39s 50.16 MB/s done $ s3cmd ls s3://fish-bucket ls s3://fish-bucket 2019-10-01 12:11 9982519261 s3://fish-bucket/zebrafish.tgz
Uploading 2 GB of data takes time. Retrieve the uploaded file:
s3cmd get s3://fish-bucket/zebrafish.tgz
By default, this bucket can only be accessed by the project members. However, using the command
s3cmd setacl, you can make the file publicly available.
First make the fish bucket public:
s3cmd setacl --acl-public s3://fish-bucket
Then make the zebrafish genome file public:
s3cmd setacl --acl-public s3://fish-bucket/zebrafish.tgz
The syntax of the URL of the file:
In this case, the file would be accessible using the link https://a3s.fi/fish-bucket/zebrafish.tgz
Publishing objects temporarily with signed URLs
With command s3cmd signurl an object in Allas can be temporarily published with URL that includes security increasing access token.
In the previous example object s3://fish-bucket/zebrafish.tgz was made permanently accessible through simple static URL. With signurl the same object could be shared more securely and only for a limited time. For example command:
s3cmd signurl s3://fish-bucket/zebrafish.tgz +3600
would print out an URL that remains valid for 3600 s (1 h). In this case URL, produced by the command above, would look something like:
Setting up an object lifecycle
In order to delete/expire objects automatically, a lifecycle policy can be set-up to the Allas bucket. Objects in the bucket are treated per the lifecycle policy if matching conditions are found. Matching conditions can be set to a prefix and/or tag(s) within the object. Lifecycle policy is especially well suited for the cases where data needs to be removed as a "maintenance" measure after certain intervals.
Before setting up the lifecycle policy, please check with your department/team that it correctly represents the retention policy for the data in the project. (Legal or regulatory constrains).
In the following lifecycle policy we have two rules set. let's name it as
<?xml version="1.0" ?> <LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Rule> <ID>1-days-expiration</ID> <Status>Enabled</Status> <Expiration> <Days>1</Days> </Expiration> <Filter> <Tag> <Key>days</Key> <Value>1</Value> </Tag> </Filter> </Rule> <Rule> <ID>30-days-expiration</ID> <Status>Enabled</Status> <Expiration> <Days>30</Days> </Expiration> <Filter> <Tag> <Key>days</Key> <Value>30</Value> </Tag> </Filter> </Rule> </LifecycleConfiguration>
Alternatively, one can set the policies using
prefix which can be thought as an equivalent to
folder. Both methods can also be combined using
<?xml version="1.0" ?> <LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Rule> <ID>Daily</ID> <Status>Enabled</Status> <Prefix>daily/</Prefix> <Expiration> <Days>30</Days> </Expiration> </Rule> <Rule> <ID>Weekly</ID> <Status>Enabled</Status> <Prefix>weekly/</Prefix> <Expiration> <Days>365</Days> </Expiration> </Rule> </LifecycleConfiguration>
To set this lifecycle policy into our bucket, we use the
s3cmd setlifecycle mypolicy.xml s3://MY_BUCKET
We can verify current policy with
s3cmd getlifecycle s3://MY_BUCKET
To review the bucket (or object) with
s3cmd info s3://MY_BUCKET s3://MY_BUCKET/ (bucket): Location: cpouta-production Payer: BucketOwner Expiration Rule: objects with key prefix 'weekly/' will expire in '365' day(s) after creation Policy: none CORS: none ACL: project_xxxxxxx: FULL_CONTROL
In order to put your object(s) under the lifecycle policy, you may utilize tags and/or prefixes.
- Tagging is done with adding a header with the format
- Prefix can be considered as a "folder".
Let's see the following cases:
# Should be removed in 24 hours per rule ID: 1-days-expiration s3cmd --add-header=x-amz-tagging:days=1 put MY_FILE_01.tar.gz s3://MY_BUCKET/ s3cmd --add-header=x-amz-tagging:days=1 put MY_FILE_02.tar.gz s3://MY_BUCKET/gone-in-one-day/ # Should be removed in 30 days per rule ID: 30-days-expiration s3cmd --add-header=x-amz-tagging:days=30 put MY_FILE_03.tar.gz s3://MY_BUCKET/ # Should be removed in 30 days per rule ID: Daily s3cmd put MY_FILE_04.tar.gz s3://MY_BUCKET/daily/ # Should be removed in 365 days per rule ID: Weekly s3cmd put MY_FILE_05.tar.gz s3://MY_BUCKET/weekly/
Other references to setting up a lifecycle:
- RedHat developer guide for Ceph storage.
- Creating an intelligent object storage system with Ceph’s Object Lifecycle Management
- Multiple lifecycles - s3cmd
- Surprise entry for the above found at cloud.blog.csc.fi
Last edited Tue Mar 16 2021