Multi-attach Cinder volumes

Warning

By default the quota is set to 0, you must request it by sending an email to servicedesk@csc.fi

It is possible to attach and mount the same cinder volume into more that one VM at the same time. This means that each of the VMs will be able to read and write to the same block device. This is similar to what a SAN will allow you to achieve.

Multi attach

This feature has several advantages and disadvantages. On one hand it allows to share files among VMs without any kind of intermediary server that you will need with solutions like NFS or GlusterFS. This reduces the number of VMs needed, thus less maintenance and less single points of failure. On the other hand, it is necessary to run what is called a clustered file system like Oracle Cluster File System 2 (ocfs2), or Red Hat Global File System (GFS2). These systems need a cluster of connected daemons that will coordinate the read and write operations of the files. Other file systems like ext4 or xfs do not support this use case and their use might lead to read errors or even data corruption, their use is unadvised. Each VM runs a copy of the daemon and there is no master, but a quorum based system. The choice between the two file systems depends on the use case and preferences based on vendors. In our tests GFS2 seems to be more suitable to Redhat based systems and OCFS2 to Debian ones, but your mileage might vary.

Warning

The configuration, maintenance and operations of these file systems are not a trivial task. The guides below are as a starting point and do not cover all possibilities, for more comprehensive information, please check the upstream documentation.

Create and attach a volume

quota

Make sure that you have available quota for this kind of Volume

WebUI

Go to the Volume page of Pouta.
Click in "+Create Volume"
Create a volume as you would do for any other Type of volume. Set the Volume Name and Size (GiB) as desired.
Change the Type to standard.multiattach.
Click in "Create Volume".

Create Volume Multiattach

not supported

You cannot attach a volume to multiple VMs from the WebUI, only see its status. You can only attach a volume to multiple VMs using the CLI.

CLI

Before doing this, you need to install the openstack client:

Create a multi attach volume:
```
openstack volume create --size <size_in_GB> --type standard.multiattach <volume_name>
```
You need to replace <volume_name> by the name you want to give to the volume, and the <size_in_GB> by the size in Gigabytes you want the volume to have.
Attach the volume to a VM node:
```
openstack --os-compute-api-version 2.60 server add volume "<VM_name>" <volume_name>
```
You need to replace the <volume_name> by the name of the volume you created in the previous step, and the <VM_name> by the name of the VM node. When doing this for a cluster of VMs, you need to run the command once per VM.

GFS2 as an example

The Global file system or (GFS2 in short) is a file system currently developed by Red Hat. It uses dlm to coordinate file system operations among the nodes in the cluster. The actual data is read and written directly to the shared block device.

Warning

GFS2 supports up to 16 nodes connected to the same volume.

GFS2 with DLM

GFS2 ansible install

We have written a small ansible cinder-multiattach playbook, that installs a cluster of nodes and installs a shared GFS2 file system on them. The playbook is intended as a guide and demo, it is not production ready. For example, there is a manual step, attach the volume in each node. The Ansible playbook will create a cluster of VMs and install the requested file system on them. The end result will be the same volume mounted in every VM. The quick start commands are these:

$> source ~/Downloads/project_XXXXXXX-openrc.sh
Please enter your OpenStack Password for project project_XXXXXXX as user YYYYYYYY: 

$> ansible-playbook main.yml -e fs='gfs2' -e csc_username='johndoe' -e csc_password='easyaccess'

$> for i in $(seq 1 16);
do
    openstack --os-compute-api-version 2.60 server add volume "cinder-gfs2-$i" multi-attach-test-gfs2
done

$> ansible-playbook main.yml -e fs='gfs2'

csc_username and csc_password can also be added in the all.yaml file.
It can be a robot account

You need to run Ansible twice due to a bug in the openstack.cloud.server_volume which can only attach the volume to a single VM and fails with the other ones.

If you already have a cluster of VMs, or want to manually create them, it is still possible to use the gfs2 Ansible role. The steps are simple:

Create and attach the volume. See the manual Create and attach a volume from above.
Create a standard Ansible inventory like this one:
```
[all]
<VM_name> ansible_host=192.168.1.XXX ansible_user=<user>
# ...
[all:vars]
ansible_ssh_common_args='-J <jumphost>'
```
In the example above you need to replace <VM_name> by the name of the VM, the IP 192.168.1.XXX must be the correct IP of the VM, and finally the <user> has to also be replaced by the corresponding one. You need to have a line per VM node that you want to include in the cluster. Finally, if you are using a Jump Host, you need to replace <jumphost> by its connection information, like ubuntu@177.51.170.99
Create a playbook like this one:
```
---

- name: Configure VMs
  hosts: all
  gather_facts: true
  become: true
  roles:
    - role: hosts
    - role: gfs2
```
This will run two roles, the hosts one if to create a /etc/hosts file in every VM with the IPs and names of every VM. The gfs2 role installs and configures the cluster.

And run it:

$> ansible-playbook main-gfs2.yml -i inventory.ini

GFS2 manual install

In order to install GFS2, you need to follow few steps:

Install the VM nodes. There is no special consideration on this step, other than making sure the nodes can see each other in the Network (it is the default behaviour of VM nodes created in the same Pouta project), and that they are installed with the same distribution version. We have tested this with AlmaLinux-9, other distributions and versions might also work, but we have not tested them.
Create and attach the volume. See the manual Create and attach a volume from above.

For AlmaLinux and other RedHat based distributions you just need enable two collections and install few packages on every node:

$> dnf config-manager --enable  highavailability resilientstorage
$> dnf install pacemaker corosync pcs dlm gfs2-utils lvm2-lockd

Cluster setup

root user

The following commands are executed as the root user
It will be specified throughout this tutorial if the commands must be run on a single or every node.

Run the following commands on every node:

$> systemctl start pcsd.service
$> systemctl enable pcsd.service

When you install pacemaker, it creates a user named hacluster. You need to set a password to this user:
```
$> passwd hacluster
```
Make sure that every node domain name can be resolved in every other node. In Pouta, the simplest way is to use /etc/hosts, where each host has a line similar to:
```
<ip> <vm_name>
```

Run the following commands only on one node:

$> pcs host auth node1 node2 node3 ...
Username: hacluster
Password: *******
$> pcs cluster setup <cluster_name> node1 node2 node3 ...
$> pcs cluster start --all

You can check the status by running the commands:

$> pcs cluster status
$> pcs status corosync

By default, corosync and pacemaker services are disabled:

$> pcs status
Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

According to pacemaker docs:

requiring a manual start of cluster services gives you the opportunity 
to do a post-mortem investigation of a node failure 
before returning it to the cluster.

That means, if a node crash and restart, you have to run the command pcs cluster start [<NODENAME> | --all] to start the cluster on it.
You can enable them if you wish with pcs:

$> pcs cluster enable [<NODENAME> | --all]

Fencing setup

root user

The following commands are executed as the root user.
It will be specified throughout this tutorial if the commands must be run on a single or every node.

Run the following commands on every node:

$> setenforce 0
$> sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config
$> dnf install -y fence-agents-openstack pip
$> pip install python-openstackclient python-novaclient

Since we install python-openstackclient with the root user, you must add /usr/local/bin in the PATH:
```
$> vim ~/.bashrc
export PATH=/usr/local/bin:$PATH
```

Create a folder named openstack in /etc. Then, create a file named clouds.yaml in /etc/openstack. The YAML file must be like this:

clouds:
  ha-example:
    auth:
      auth_url: https://pouta.csc.fi:5001/v3
      project_name: project_xxxxxxx
      username: <username>
      password: <password>
      user_domain_name: Default
      project_domain_name: Default
<. . . additional options . . .>
  region_name: regionOne
  verify: False

Run the following commands only on one node:

$> pcs property set stonith-enabled=true

Check the value:
```
$> pcs property
```

Create fencing for the HA cluster. First, you have to determine the UUID for each node in your cluster. You can run the command:

$> openstack server list

Then:

$> pcs stonith create <fence_name> fence_openstack pcmk_host_map="node1:node1_UUID;node2:node2_UUID;node3:node3_UUID" power_timeout="240" pcmk_reboot_timeout="480" pcmk_reboot_retries="4" cloud="ha-example"

Substitute cloud="ha-example" with the name of the cloud you specified in the clouds.yaml file.

You can view the available options with the following command:
```
$> pcs stonith describe fence_openstack
```

You can test fencing by running these commands:

$> pcs stonith fence node2
$> pcs status
$> pcs cluster start node2

Tip

If you want to start (or restart) the fence, you can use this command:

$> pcs stonith cleanup <fence_name>

Useful if you apply a new clouds.yaml configuration for example.

GFS2 setup

root user

The following commands are executed as the root user.
It will be specified throughout this tutorial if the commands must be run on a single or every node.

Run the following command on every node:

$> sed -i.bak "s/# use_lvmlockd = 0/use_lvmlockd = 1/g" /etc/lvm/lvm.conf

Run the following commands only on one node:

$> pcs property set no-quorum-policy=freeze

Set up a dlm (Distributed Lock Manager) resource:

$> pcs resource create dlm --group locking ocf:pacemaker:controld op monitor interval=30s on-fail=fence

Clone the resource for the others nodes:

$> pcs resource clone locking interleave=true

Set up a lvmlockd resource part of the locking resource group:

$> pcs resource create lvmlockd --group locking ocf:heartbeat:lvmlockd op monitor interval=30s on-fail=fence

Check the status:
```
$> pcs status --full
```
Still only on one node, create one shared volume groups:
```
$> vgcreate --shared shared_vg1 /dev/vdb
```
On the other nodes, add the shared device to the device file:
```
$> lvmdevices --adddev /dev/vdb
```
Start the lock manager:
```
$> vgchange --lockstart shared_vg1
```

On one node, run:

$> lvcreate --activate sy -L <size>G -n shared_lv1 shared_vg1
$> mkfs.gfs2 -j <number_of_nodes> -p lock_dlm -t ClusterName:FSName /dev/shared_vg1/shared_lv1

ClusterName is the name of the cluster (you can retrieve the information with the command pcs status)
FSName is the file system name (i.e: gfs2-demo)

Create an LVM-activate resource to automatically activate that logical volume on all nodes:

$> pcs resource create sharedlv1 --group shared_vg1 ocf:heartbeat:LVM-activate lvname=shared_lv1 vgname=shared_vg1 \
    activation_mode=shared vg_access_mode=lvmlockd

Clone the new resource group:

$> pcs resource clone shared_vg1 interleave=true

Configure ordering constraints to ensure that the locking resource group that includes the dlm and lvmlockd resources starts first:
```
$> pcs constraint order start locking-clone then shared_vg1-clone
```
Configure a colocation constraints to ensure that the vg1 resource groups start on the same node as the locking resource group:
```
$> pcs constraint colocation add shared_vg1-clone with locking-clone
```

Verify on the nodes in the cluster that the logical volume is active. There may be a delay of a few seconds:

$> lvs
    LV         VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
    shared_lv1 shared_vg1 -wi-a----- <size>g

Create a file system resource to automatically mount the GFS2 file system.
Do not add it to the /etc/fstab file because it will be managed as a Pacemaker cluster resource:

$> pcs resource create sharedfs1 --group shared_vg1 ocf:heartbeat:Filesystem device="/dev/shared_vg1/shared_lv1" \
    directory="/mnt/gfs" fstype="gfs2" options=noatime op monitor interval=10s on-fail=fence

You can verify if the GFS2 file system is mounted:

$> mount | grep gfs2
$> pcs status --full

GFS2 FAQ

How to add more nodes?

It is possible to add new nodes to a GFS2 cluster. The supported limit is 16 nodes.

First you need to make sure there are enough journal entries. Use gfs2_edit to get the total number of journals:
```
sudo gfs2_edit -p jindex /dev/vdb | grep journal
```
If it is not enough, you can easily add more with gfs2_jadd:
```
$> sudo gfs2_jadd -j 1 /mnt
Filesystem: /mnt
Old journals: 15
New journals: 16
```
Secondly, create the new node, install the required software and attach the volume using openstack API. The process is described above.

Then you need to edit the file /etc/corosync/corosync.conf in every node and add an entry for the new one:
```
node {
 ring0_addr: cinder-gfs2-16
 nodeid: 16
}
```
Once the file is updated, you need to stop the mount and restart the dlm and corosync daemons in every node in the cluster.

Finally, you just need to mount the volume:
```
$> pcs resource create sharedfs1 --group shared_vg1 ocf:heartbeat:Filesystem device="/dev/shared_vg1/shared_lv1" directory="/mnt/gfs" fstype="gfs2" options=noatime op monitor interval=10s on-fail=fence
```
What happens if a VM gets disconnected?

This covers two different use cases, a temporal and/or unexpected disconnection, and a permanent one.

For a temporal and unexpected disconnection, the cluster should be able to deal with this kind of issues automatically. After the node is back, you need to check that all came back to normal. In some cases the automatic mount of the volume can fail, if so mount the volume as explained above.

If it is temporal but expected, like to update the kernel version. Umount the volume in the node before rebooting the node. It is not required, but recommended.

For a permanent disconnection of a VM, one need to do the inverse process of adding a new node. Umount the volume, remove the entry for this VM in the /etc/corosync/corosync.conf file of every node, and finally restart the daemons in every node. This needs to be done as it affects the quorum count for the cluster.

Is it possible to mount a node as read-only?

Yes, GFS2 has the "spectator mode":

spectator
   Mount  this filesystem using a special form of read-only mount.  The mount does not
   use one of the filesystem's journals. The node is unable to  recover  journals  for
   other nodes.

norecovery
   A synonym for spectator

So, just run this command:

$> pcs resource create sharedfs1 --group shared_vg1 ocf:heartbeat:Filesystem device="/dev/shared_vg1/shared_lv1" directory="/mnt/gfs" fstype="gfs2" options=noatime,spectator op monitor interval=10s on-fail=fence

fstype="gfs2" is not strictly necessary, as mount can detect the file system type, but it is recommended to avoid mounting the wrong file system. Then double check that the mount went as expected by:

$ mount | grep /mnt
/dev/vdb on /mnt type gfs2 (ro,relatime,spectator,rgrplvb)

GFS2 Links

OCFS2 as a second example

The Oracle Cluster File System version 2 is a shared disk file system developed by Oracle Corporation and released under the GNU General Public License. Meanwhile it is a different code base developed by a different vendor. The approach is the same as GFS2:

OCFS2

A single volume attached to a cluster of VM nodes, allowing the data reads and writes to be done directly, and a daemon running in each VM node that coordinates the read and write operations.

OCFS2 ansible install

Like with GFS2, the Ansible playbook will create a cluster of VMs and install the requested file system on them. The end result will be the same volume mounted in every VM. It is very similar than the instructions for GFS2. The quick start commands are these:

$ source ~/Downloads/project_XXXXXXX-openrc.sh
Please enter your OpenStack Password for project project_XXXXXXX as user YYYYYYYY: 

$ ansible-playbook main.yml -e fs='ocfs2'

$ for i in $(seq 1 16);
do
    openstack --os-compute-api-version 2.60 server add volume "cinder-ocfs2-$i" multi-attach-test-ocfs2
done

$ ansible-playbook main.yml -e fs='ocfs2'

You need to run Ansible twice due to a bug in the openstack.cloud.server_volume which can only attach the volume to a single VM and fails with the other ones.

If you already have a cluster of VMs, or want to manually create them, it is still possible to use the ocfs2 Ansible role. The steps are simple:

Create and attach the volume. See the manual Create and attach a volume from above.
Create a standard Ansible inventory like this one:
```
[all]
<VM_name> ansible_host=192.168.1.XXX ansible_user=<user>
# ...
[all:vars]
ansible_ssh_common_args='-J <jumphost>'
```
In the example above you need to replace <VM_name> by the name of the VM, the IP 192.168.1.XXX must be the correct IP of the VM, and finally the <user> has to also be replaced by the corresponding one. You need to have a line per VM node that you want to include in the cluster. Finally, if you are using a Jump Host, you need to replace <jumphost> by its connection information, like ubuntu@177.51.170.99
Create a playbook (main-ocfs2.yml in this example) like this one:
```
---

- name: Configure VMs
  hosts: all
  gather_facts: true
  become: true
  roles:
    - role: hosts
    - role: ocfs2
```
This will run two roles, the hosts one if to create a /etc/hosts file in every VM with the IPs and names of every VM. The ocfs2 role installs and configures the cluster.

And run it:

$ ansible-playbook main-ocfs2.yml -i inventory.ini

OCFS2 manual install

In order to install OCFS2, you need to follow few steps:

Install the VM nodes. There is no special consideration on this step, other than making sure the nodes can see each other in the Network (it is the default behaviour of VM nodes created in the same Pouta project), and that they are installed with the same distribution version. We have tested this with Ubuntu v22.04 and AlmaLinux-9, other distributions and versions might also work, but we have not tested them. This guide will use Ubuntu as an example.
AlmaLinux requires to install an specific Oracle kernel. More information in the FAQ
Create and attach the volume. See the manual Create and attach a volume from above.
Install the OCFS2 software:
```
ocfs2-tools linux-modules-extra-<kernel_version> linux-image-$(uname -r)
```
We have tested this with <kernel_version> == 6.5.0-21-generic, but newer versions should work as well or better.
Make sure that every node domain name can be resolved in every other node. In Pouta, the simplest way is to use /etc/hosts, where each host has a line similar to:
```
<ip> <vm_name>
```
Enable ocfs2 in every node using:
```
sudo dpkg-reconfigure ocfs2-tools
```
Create the file system. You need to do this in only one of the VM nodes.
```
mkfs.ocfs2 -N <number_instances> /dev/vdb
```
Replace <number_instances> by the number of VM nodes in the cluster. Pay also attention and double check that /dev/vdb is the proper volume name. In principle vdb is going to be the first attached volume to a VM, but this might not be true in all cases.

Generate the file /etc/ocfs2/cluster.conf. A minimal working example would follow this template:

{% for host in groups['all'] %}
node:
  ip_port = 7777
  ip_address = {{ hostvars[host]['ansible_host'] }}
  number = {{ groups['all'].index(host)+1 }}
  name = {{ host }}
  cluster = ocfs2
{% endfor %}
cluster:
  node_count = {{ number_instances }}
  name = ocfs2

Reboot so the kernel you installed is taken into use. Make sure that the ocfs2 service is up and running (systemctl status ocfs2).
Finally mount the volume in each node:
```
sudo mount /dev/vdb /mnt
```
As the device may change in any moment, it is recommended to use the UUID for any serious deployment. You can get the UUID by using the command blkid:
```
$ sudo blkid /dev/vdb
/dev/vdb: UUID="785134b8-4782-4a1f-8f2a-40bbe7b7b5d2" BLOCK_SIZE="4096" TYPE="ocfs2"
```
In this case the command will be sudo mount -U 785134b8-4782-4a1f-8f2a-40bbe7b7b5d2 /mnt

OCFS2 FAQ

How to add more nodes?

It is possible to add more nodes to a ocfs2 cluster, but it requires a downtime.

First you need to increase the number of slots, using tunefs.ocfs2. Before that, you need to umount the volume in every VM node. These are the two commands you need to run. The second one only needs to be executed in a single node:
```
sudo umount /mnt
sudo tunefs.ocfs2 -N 25 /dev/vdb
```
Secondly, create the new node, install the required software and attach the volume using openstack API. The process is described above.

Then you need to edit the file /etc/ocfs2/cluster.conf in every node and add an entry for the new one:
```
node:
  ip_port = 7777
  ip_address = <ip_address>
  number = <number>
  name = <vm_name>
  cluster = ocfs2
```
Replace <ip_address> by the address of the new server, <vm_name> by its name, and finally <number> is the node id number. It has to be unique for every node, ideally consecutive numbers.

Once the file is updated, you need to stop the mount and restart the ocfs2 in every node in the cluster. Lastly, remount the volume in every VM node.
What happens if a VM gets disconnected?

This covers two different use cases, a temporal and/or unexpected disconnection, and a permanent one. It is very similar to the GFS2 situation.

For a temporal and unexpected disconnection, the cluster should be able to deal with this kind of issues automatically. After the node is back, you need to check that all came back to normal. In some cases the automatic mount of the volume can fail, if so mount the volume as explained above.

If it is temporal but expected, like to update the kernel version. Umount the volume in the node (sudo umount /mnt) before rebooting the node. It is not required, but recommended.

For a permanent disconnection of a VM, one need to do the inverse process of adding a new node. Umount the volume (sudo umount /mnt), remove the entry for this VM in the /etc/ocfs2/cluster.conf file of every node, and finally restart the daemons in every node. This needs to be done as it affects the quorum count for the cluster.
Is it possible to mount a node as read-only?

Yes, it is possible to mount the volume as read-only. It is as simple as:
```
sudo mount /dev/vdb /mnt -o ro
```
After that, you can check that it was indeed mounted as read-only by:
```
mount | grep /mnt
/dev/vdb on /mnt type ocfs2 (ro,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,coherency=full,user_xattr,acl)
```
Also, as you can see in the output above, the default behaviour is that when any error occurs, to remount it as read only (errors-remount-ro). See mount.ocfs2 for more options.
I want to install Oracle Kernel on a RedHat 9 distro

You can find more information here on how to install the Oracle Linux repo. Once set, you can install the Oracle UEK kernel with these commands:

First
```
sudo dnf install oraclelinux-release-el9
```
And then
```
sudo dnf install kernel-uek
```

Upstream documentation

Last update: July 3, 2024