Creating a GlusterFS Cluster for VMs

The best GlusterFS layout that I have found to work with VMs is a distributed replicated cluster with shards enabled. I use laptops with 2 drives in each one. Let’s say we have 4 laptops with 2 drives on each one, we would do something like what I have listed below.

First, let’s create a partition on each disk. Use fdisk /dev/sda and fdisk /dev/sdb to create a partition. Now, each disk needs to be formatted as XFS. We format with the command.

mkfs.xfs -i size=512 /dev/sda1
mkfs.xfs -i size=512 /dev/sdb1
mkdir /mnt/disk1
mkdir /mnt/disk2

Now we can use blkid to list the UUID of sda1 and sdb1 so that we can add it to fstab. My fstab looks something like this (your UUIDs will be different. The allocsize is set to pre-allocate files 64MB at a time to limit fragmentation and improve performance. The noatime is to prevent the access time attribute being set every time the file is touched – this is to improve performance as well. The nofail is to prevent the system from not booting in the event of a disk failure.

#!/etc/fstab
UUID=3edc7ec8-303a-42c6-9937-16ef37068c72 /mnt/disk2 xfs defaults,allocsize=64m,noatime,nofail 0 1
UUID=b8906693-27ba-466b-9c39-8066aa765d2e /mnt/disk1 xfs defaults,allocsize=64m,noatime,nofail 0 1

Now I did something funky with my fstab because I wanted to mount my bricks under the volname so that I could have different volumes on the same disks. So I added these lines to my fstab (my volname is “prod”).

#!/etc/fstab
/mnt/disk1/prod/brick1 /mnt/gluster/prod/brick1 none bind 0 0
/mnt/disk2/prod/brick2 /mnt/gluster/prod/brick2 none bind 0 0
mkdir -p /mnt/gluster/prod/brick1
mkdir -p /mnt/gluster/prod/brick2
mount /mnt/disk1 mount /mnt/disk2
mkdir -p /mnt/disk1/prod/brick1
mkdir -p /mnt/disk2/prod/brick2

Now we can mount everything else.

mount -a

Make sure that everything is mounted properly.

df -h /mnt/gluster/prod/brick1

Make sure that you see /dev/sda1 next to it. If not, just reboot and fstab will mount everything appropriately.

Now let’s create the gluster cluster.

gluster volume create prod replica 2 gluster1:/mnt/gluster/prod/brick1 gluster2:/mnt/gluster/prod/brick1 gluster3:/mnt/gluster/prod/brick1 gluster4:/mnt/gluster/prod/brick1 gluster1:/mnt/gluster/prod/brick2 gluster2:/mnt/gluster/prod/brick2 gluster3:/mnt/gluster/prod/brick2 gluster4:/mnt/gluster/prod/brick2

By specifying the order as we did, we ensure that server gluster1 and server gluster2 are paired up with each other, and gluster3 and gluster4 are paired up with each other. If we did both bricks on gluster1 successively, then we would be unable to sustain a failure of the gluster1 node. So we alternate servers. This also improves performance.

Now, the only thing left is to tune some parameters meant for VMs. For each parameter below, we will use a command like the following:

gluster volume set prod storage.fips-mode-rchecksum on

Here are my options:

performance.readdir-ahead: on
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
cluster.granular-entry-heal: on
network.ping-timeout: 20
features.shard-block-size: 64MB
client.event-threads: 4
server.event-threads: 4
cluster.data-self-heal-algorithm: full
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
server.allow-insecure: on
features.shard: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: on
cluster.eager-lock: enable
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
cluster.locking-scheme: granular
performance.low-prio-threads: 32
cluster.choose-local: off
storage.fips-mode-rchecksum: on
config.transport: tcp

Now we can fire up the gluster cluster with the following command.

gluster volume start prod