The S3 client

This chapter describes how to use the Allas object storage service with the s3cmd command line client. This client uses the S3 protocol that differs from the Swift protocol used in the Rclone, swift and a-commands examples. Normally data uploaded with S3 can be utilized with swift protocol too. However, over 5 GB files uploaded to Allas with swift can't be downloaded with S3 protocol.

From the user perspective, one of the main differences between S3 and Swift protocols is that Swift based connections remain valid for eight hours at a time, but with S3, the connection remains permanently open. The permanent connection is practical in many ways but it has a security aspect: if your CSC account is compromised, so is the object storage space.

The syntax of the s3cmd command:

s3cmd -options command parameters

The most commonly used s3cmd commands:

s3cmd command	Function
mb	Create a bucket
put	Upload an object
ls	List objects and buckets
get	Download objects and buckets
cp	Move object
del	Remove objects or buckets
md5sum	Get the checksum
info	View metadata
signurl	Create a temporary URL
put -P	Make an object public
setacl --acl-grant	Manage access rights

The table above lists only the most essential s3cmd commands. For more complete list, visit the s3cmd manual page or type:

s3cmd -h

Getting started with s3cmd

If you use Allas on Puhti or Mahti, all required packages and software are already installed. In this case you can skip this chapter and proceed to the section Configuring S3 connection in supercomputers.

To configure a s3cmd connection, you need to have OpenStack and s3cmd installed in your environment.

OpenStack s3cmd installation:

Fedora/RHEL derivatives:

sudo yum update
sudo yum install python3
sudo pip3 install python-openstackclient
sudo yum install s3cmd

Debian derivatives:

sudo apt install python3-pip
sudo pip3 install python-openstackclient
sudo apt install restic
curl https://rclone.org/install.sh | sudo bash
sudo pip3 install s3cmd

OSX:

python3 virtualenv
pip3 install s3cmd
s3cmd

Please refer to http://s3tools.org/download and http://s3tools.org/usage for upstream documentation.

Configuring S3 connection in local computer

Once you have OpenStack and s3cmd installed in your environment, you can download the allas_conf script to set up the S3 connection to your Allas project.

wget https://raw.githubusercontent.com/CSCfi/allas-cli-utils/master/allas_conf
source allas_conf --mode S3 --user your-csc-username

Note that you should use the --user option to define your CSC username. The configuration command first asks for your CSC password and then for you to choose an Allas project. After that, the tool creates a key file for the S3 connection and stores it in the default location (.s3cfg in home directory).

Configuring S3 connection in supercomputers

To use s3cmd in Puhti and Mahti, you must first configure the connection:

module load allas
allas-conf --mode S3

The configuration process first asks for your CSC password. Then it lists your Allas projects and asks to select the project to be used. The configuration information is stored in the file $HOME/.s3cfg. This configuration only needs to be defined once. In the future, s3cmd will automatically use the object storage connection described in the .s3cfg file. If you wish to change the Allas project that s3cmd uses, you need to run the configuration command again.

You can use the S3 credentials, stored in the .s3cfg file, in other services too. You can check the currently used access key and secret_key with command:

grep key $HOME/.s3cfg

If you use these keys in other services, your should make sure that the keys always remain private. Any person who has access to these two keys, can access and modify all the data that the project has in Allas.

In needed, you can deactivate an S3 key pair with command:

allas-conf --s3remove

Create buckets and upload objects

Create a new bucket:

s3cmd mb s3://my_bucket

Upload a file to a bucket:

s3cmd put my_file s3://my_bucket

List objects and buckets

List all buckets in a project:

s3cmd ls

List all objects in a bucket:

s3cmd ls s3://my_bucket

Display information about a bucket:

s3cmd info s3://my_bucket

Display information about an object:

s3cmd info s3://my_bucket/my_file

Download objects and buckets

Download an object:

s3cmd get s3://my_bucket/my_file new_file_name

The parameter new_file_name is optional. It defines a new name for the downloaded file.

Using the command md5sum, you can check that the file has not been changed or corrupted:

$ md5sum my_file new_file_name
   39bcb6992e461b269b95b3bda303addf  my_file
   39bcb6992e461b269b95b3bda303addf  new_file_name

In the above example, the checksums match between the original and downloaded file.

Download an entire bucket:

s3cmd get -r s3://my_bucket/

Move objects

Copy an object to another bucket. Note that should use these commands only for objects that were uploaded to Allas with S3 protocol:

s3cmd cp s3://sourcebucket/objectname s3://destinationbucket

For example:

$ s3cmd cp s3://bigbucket/bigfish s3://my-new-bucket
remote copy: 's3://bigbucket/bigfish' -> 's3://my-new-bucket/bigfish'

Rename the file while copying it:

$ s3cmd cp s3://bigbucket/bigfish s3://my-new-bucket/newname
remote copy: 's3://bigbucket/bigfish' -> 's3://my-new-bucket/newname'

Delete objects and buckets

Delete an object:

s3cmd del s3://my_bucket/my_file

Delete a bucket:

s3cmd rb s3://my_bucket

Note: You can only delete empty buckets.

s3cmd and public objects

In this example, the object salmon.jpg in the pseudo folder fishes is made public:

$ s3cmd put fishes/salmon.jpg s3://my_fishbucket/fishes/salmon.jpg -P
Public URL of the object is: https://a3s.fi/my_fishbucket/fishes/salmon.jpg

Giving another project read access to a bucket

You can control access rights using the command s3cmd setacl. This command requires the UUID (universally unique identifier) of the project you want to grant access to. Project members can check their project ID in https://pouta.csc.fi/dashboard/identity/ or using the command openstack project show. For example in Puhti and Mahti:

module load allas
allas-conf -k --mode s3cmd
openstack project show $OS_PROJECT_NAME

In case of s3cmd the read and write access can be controlled for both buckets and objects:

Following command gives project with UUID 3d5b0ae8e724b439a4cd16d1290 read access to my_fishbucket but not to the objects inside :

s3cmd setacl --acl-grant=read:3d5b0ae8e724b439a4cd16d1290 s3://my_fishbucket

Similarly, following command gives write access to just single object:

s3cmd setacl --acl-grant=write:3d5b0ae8e724b439a4cd16d1290 s3://my_fishbucket/bigfish

If you want to modify the access permissions of all the objects in a bucket, you can add option --recursive to the command:

s3cmd setacl --recursive --acl-grant=read:3d5b0ae8e724b439a4cd16d1290 s3://my_fishbucket

You can check the access permissions with s3cmd info:

$ s3cmd info s3://my_fishbucket|grep -i acl
   ACL:       other_project_uuid: READ
   ACL:       my_project_uuid: FULL_CONTROL

Option --acl-revoke can be used to remove a read or write access:

s3cmd setacl --recursive --acl-revoke=read:$other_project_uuid s3://my_fishbucket

The shared objects and buckets can be used with both S3 and Swift based tools. Note however, that listing commands show only buckets owned by your project. In the case of shared buckets and objects you must know the names of the buckets in order to use them.

In the case of the example above, user from project 3d5b0ae8e724b439a4cd16d1290 will not see my_fishbucket , when it is shared, with command:

s3cmd ls

However she can list the content of the bucket with command:

s3cmd ls s3://my_fishbucket

In the Pouta web UI, user can move to a shared bucket by defining the bucket name in the URL. Move to some bucket of your project and replace the bucket name in the end of the URL with the name of the shared bucket:

https://pouta.csc.fi/dashboard/project/containers/container/my_fishbucket

Use example

In this example, we store a simple dataset in Allas using s3cmd.

First, create a new bucket. The command s3cmd ls reveals that the object storage is empty at first. Then, use the command s3cmd mb to create a new bucket called fish-bucket.

$ s3cmd ls
ls

$ s3cmd mb s3://fish-bucket
mb s3://fish-bucket/
Bucket 's3://fish-bucket/' created

$ s3cmd ls
ls
2018-03-12 13:01  s3://fish-bucket

It is recommended to collect the data to be stored as larger units and compress it before uploading it to the system.

In this example, we store the Bowtie2 indices and the genome of the zebrafish (danio rerio) in the fish bucket. Running ls -lh shows that the index files are available in the current directory:

$ ls -lh
total 3.2G
-rw------- 1 kkayttaj csc 440M Mar 12 13:41 Danio_rerio.1.bt2
-rw------- 1 kkayttaj csc 327M Mar 12 13:41 Danio_rerio.2.bt2
-rw------- 1 kkayttaj csc 217K Mar 12 13:20 Danio_rerio.3.bt2
-rw------- 1 kkayttaj csc 327M Mar 12 13:20 Danio_rerio.4.bt2
-rw------- 1 kkayttaj csc 1.3G Mar 12 13:13 Danio_rerio.GRCz10.dna.toplevel.fa
-rw------- 1 kkayttaj csc 440M Mar 12 14:03 Danio_rerio.rev.1.bt2
-rw------- 1 kkayttaj csc 327M Mar 12 14:03 Danio_rerio.rev.2.bt2
-rw------- 1 kkayttaj csc 599K Mar 12 13:13 log

The data is collected and compressed as a single file using the tar command:

tar zcf zebrafish.tgz Danio_rerio*

The size of the resulting file is about 2 GB. The compressed file can be uploaded to the fish bucket using the command s3cmd put:

$ ls -lh zebrafish.tgz
-rw------- 1 kkayttaj csc 9.3G Mar 12 15:23 zebrafish.tgz

$ s3cmd put zebrafish.tgz s3://fish-bucket
put zebrafish.tgz s3://fish-bucket
upload: 'zebrafish.tgz' -> 's3://fish-bucket/zebrafish.tgz'  [1 of 1]
 2081306836 of 2081306836   100% in   39s    50.16 MB/s  done

$ s3cmd ls s3://fish-bucket
ls s3://fish-bucket
2019-10-01 12:11 9982519261   s3://fish-bucket/zebrafish.tgz

Uploading 2 GB of data takes time. Retrieve the uploaded file:

s3cmd get s3://fish-bucket/zebrafish.tgz

By default, this bucket can only be accessed by the project members. However, using the command s3cmd setacl, you can make the file publicly available.

First make the fish bucket public:

s3cmd setacl --acl-public s3://fish-bucket

Then make the zebrafish genome file public:

s3cmd setacl --acl-public s3://fish-bucket/zebrafish.tgz

The syntax of the URL of the file:

https://a3s.fi/bucket_name/object_name

In this case, the file would be accessible using the link https://a3s.fi/fish-bucket/zebrafish.tgz

Publishing objects temporarily with signed URLs

With command s3cmd signurl an object in Allas can be temporarily published with URL that includes security increasing access token.

In the previous example object s3://fish-bucket/zebrafish.tgz was made permanently accessible through simple static URL. With signurl the same object could be shared more securely and only for a limited time. For example command:

s3cmd signurl s3://fish-bucket/zebrafish.tgz +3600

would print out an URL that remains valid for 3600 s (1 h). In this case URL, produced by the command above, would look something like:

https://fish-bucket.a3s.fi/zebrafish.tgz?AWSAccessKeyId=78e6021a086d52f092b3b2b23bfd7a67&Expires=1599835116&Signature=OLyyCY14s%2F0HxKOOd108mldINyE%3D

Setting up an object lifecycle

In order to delete/expire objects automatically, a lifecycle policy can be set-up to the Allas bucket. Objects in the bucket are treated per the lifecycle policy if matching conditions are found. Matching conditions can be set to a prefix and/or tag(s) within the object. Lifecycle policy is especially well suited for the cases where data needs to be removed as a "maintenance" measure after certain intervals.

Warning

Before setting up the lifecycle policy, please check with your department/team that it correctly represents the retention policy for the data in the project. (Legal or regulatory constrains).

In the following lifecycle policy we have two rules set. let's name it as mypolicy.xml.

<?xml version="1.0" ?>
<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
   <Rule>
      <ID>1-days-expiration</ID>
      <Status>Enabled</Status>
      <Expiration>
         <Days>1</Days>
      </Expiration>
      <Filter>
         <Tag>
            <Key>days</Key>
            <Value>1</Value>
         </Tag>
      </Filter>
   </Rule>
   <Rule>
      <ID>30-days-expiration</ID>
      <Status>Enabled</Status>
      <Expiration>
         <Days>30</Days>
      </Expiration>
      <Filter>
         <Tag>
            <Key>days</Key>
            <Value>30</Value>
         </Tag>
      </Filter>
   </Rule>
</LifecycleConfiguration>

Alternatively, one can set the policies using prefix which can be thought as an equivalent to folder. Both methods can also be combined using <And> tag.

<?xml version="1.0" ?>
<LifecycleConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
   <Rule>
      <ID>Daily</ID>
      <Status>Enabled</Status>
      <Prefix>daily/</Prefix>
      <Expiration>
         <Days>30</Days>
      </Expiration>
   </Rule>
   <Rule>
      <ID>Weekly</ID>
      <Status>Enabled</Status>
      <Prefix>weekly/</Prefix>
      <Expiration>
         <Days>365</Days>
      </Expiration>
   </Rule>
</LifecycleConfiguration>

To set this lifecycle policy into our bucket, we use the setlifecycle sub-command:

s3cmd setlifecycle mypolicy.xml s3://MY_BUCKET

We can verify current policy with getlifecycle sub-command:

s3cmd getlifecycle s3://MY_BUCKET

To review the bucket (or object) with info sub-command:

s3cmd info s3://MY_BUCKET

s3://MY_BUCKET/ (bucket):
   Location:  cpouta-production
   Payer:     BucketOwner
   Expiration Rule: objects with key prefix 'weekly/' will expire in '365' day(s) after creation
   Policy:    none
   CORS:      none
   ACL:       project_xxxxxxx: FULL_CONTROL

In order to put your object(s) under the lifecycle policy, you may utilize tags and/or prefixes.

Tagging is done with adding a header with the format x-amz-tagging:KEY=VALUE.
Prefix can be considered as a "folder".

Let's see the following cases:

# Should be removed in 24 hours per rule ID: 1-days-expiration
s3cmd --add-header=x-amz-tagging:days=1 put MY_FILE_01.tar.gz s3://MY_BUCKET/
s3cmd --add-header=x-amz-tagging:days=1 put MY_FILE_02.tar.gz s3://MY_BUCKET/gone-in-one-day/

# Should be removed in 30 days per rule ID: 30-days-expiration
s3cmd --add-header=x-amz-tagging:days=30 put MY_FILE_03.tar.gz s3://MY_BUCKET/

# Should be removed in 30 days per rule ID: Daily
s3cmd put MY_FILE_04.tar.gz s3://MY_BUCKET/daily/

# Should be removed in 365 days per rule ID: Weekly
s3cmd put MY_FILE_05.tar.gz s3://MY_BUCKET/weekly/

Other references to setting up a lifecycle:

RedHat developer guide for Ceph storage.
Creating an intelligent object storage system with Ceph’s Object Lifecycle Management
Multiple lifecycles - s3cmd
Surprise entry for the above found at cloud.blog.csc.fi

Limit bucket access to specific IP addresses

You can limit access to a bucket to specific IP addresses by defining a policy.

Warning

Remember not to block your own access to the bucket, you can't access the bucket or fix the policy if you do so.

In the following IP policy example we allow access to bucket POLICY-EXAMPLE-BUCKET from IP subnet 86.50.164.0/24. Let's name the policy file myippolicy.json.

{
    "Version": "2012-10-17",
    "Id": "S3PolicyExample",
    "Statement": [
        {
            "Sid": "IPAllow",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::POLICY-EXAMPLE-BUCKET",
                "arn:aws:s3:::POLICY-EXAMPLE-BUCKET/*"
            ],
            "Condition": {
                "NotIpAddress": {
                    "aws:SourceIp": "86.50.164.0/24"
                }
            }
        }
    ]
}

To set this IP policy into our bucket, we use the setpolicy sub-command:

s3cmd setpolicy myippolicy.json s3://POLICY-EXAMPLE-BUCKET

The current policy can be viewed with info sub-command.

We can delete current policy with delpolicy sub-command:

s3cmd delpolicy s3://POLICY-EXAMPLE-BUCKET
s3://POLICY-EXAMPLE-BUCKET/: Policy deleted

Last update: July 24, 2024