In this post, we describe how we installed Ceph v12.2.0 (codename Luminous) on the Pulpos cluster.
In a surprising move, Red Hat released Ceph 12.2.0 on August 29, 2017, way ahead of their original schedule — Luminous was originally planned for release in Spring 2018! Luminous is the current Long Term Stable (LTS) release of Ceph, replacing both previous stable release Kraken (Ceph v11.2) and previous LTS release Jewel (Ceph v10.2). Luminous has introduced many major changes from Kraken and Jewel; upgrade from earlier release is non-trivial. So we’ll perform a clean re-installation of Luminous on Pulpos.
3) Install ceph-fuse on the client node (pulpo-dtn);
4) Install ceph-deploy on the admin node (pulpo-admin).
Let’s verify that Luminous is installed on all the nodes:
Ceph-Deploy
We use ceph-deploy to deploy Luminous on the Pulpos cluster. ceph-deploy is a easy and quick tool to set up and take down a Ceph cluster. It uses ssh to gain access to other Ceph nodes from the admin node (pulpo-admin), and then uses the underlying Python scripts to automate the manual process of Ceph installation on each node. One can also use a generic deployment system, such as Puppet, Chef or Ansible, to deploy Ceph. I am particularly interested in ceph-ansible, the Ansible playbook for Ceph; and may try it in the near future.
1) We use the directory /root/Pulpos on the admin node to maintain the configuration files and keys
2) Create a cluster, with pulpo-mon01 as the initial monitor node (We’ll add 2 monitors shortly):
which generates ceph.conf & ceph.mon.keyring in the directory.
3) Append the following 2 lines to ceph.conf
The public_network is 10 Gb/s and cluster_network 40 Gb/s (see Pulpos Networks)
4) Append the following 2 lines to ceph.conf (to allow deletion of pools):
5) Deploy the initial monitor(s) and gather the keys:
which generated ceph.client.admin.keyring, ceph.bootstrap-osd.keyring, ceph.bootstrap-mds.keyring & ceph.bootstrap-rgw.keyringmon.keyring in the directory.
6) Copy the configuration file and admin key to all the nodes
which copies ceph.client.admin.keyring & ceph.conf to the directory /etc/ceph on all the nodes.
7) Add 2 more monitors, on pulpo-mds01 & pulpo-admin, respectively:
It seems that we can only add one at a time.
8) Deploy a manager daemon on each of the monitor nodes:
ceph-mgr is new daemon introduced in Luminous, and is a required part of any Luminous deployment.
2) We use the following Bash script (zap-disks.sh) to zap the disks on the OSD nodes (Caution: device names can and do change!):
3) We then use the following Bash script (create-osd.sh) to create OSDs on the OSD nodes:
The goals were to (on each of the OSD nodes):
Create an OSD on each of the 8TB SATA HDDs, using the new bluestore backend;
Use a partition on the first NVMe SSD (/dev/nvme1n0) as the WAL device and another partition as the DB device for each of the OSDs on the HDDs;
Create an OSD on the second NVMe SSD (/dev/nvme1n1), using the new bluetore backend.
Let’s verify we have achieved our goals:
Each DB partition is only 1GB in size and each WAL partition is only 576MB. So there is plenty of space left on the first NVMe SSD (the total capacity is 1.1TB). We may create a new partition there to benchmark the NVMe SSD in the near future.
One nice new feature introduced in Luminous is CRUSH device class.
Luminous automatically associate the OSDs backed by HDDs with the hdd device class; and the OSDs backed by NVMes with the nvme device class. So we no longer need to manually modify CRUSH map (as in kraken and earlier Ceph releases) in order to place different pools on different OSDs!
an Erasure Code data pool on the OSDs backed by HDDs
a replicated metadata pool on the OSDs backed by HDDs
a replicated pool on the OSDs backed by NVMes, as the cache tier of the Erasure Code data pool
Creating an Erasure Code data pool
The default erasure code profile sustains the loss of a single OSD.
Let’s create a new erasure code profile pulpo_ec:
The important parameter is crush-device-class=hdd, which set hdd as the device class for the profile. So a pool created with this profile will only use the OSDs backed by HDDs.
Create the Erasure Code data pool for CephFS, with the pulpo_ec profile:
which also generates a new CRUSH rule with the same name cephfs_data. We note in passing a terminology change: what’s called CRUSH ruleset in Kraken and earlier is now called CRUSH rule; and the parameter crush_ruleset in the old ceph command is now replaced with crush_rule!
Let’s check the CRUSH rule for pool cephfs_data:
By default, erasure coded pools only work with uses like RGW that perform full object writes and appends. A new feature introduced in Luminous allows partial writes for an erasure coded pool, which may be enabled with a per-pool setting. This lets RBD and CephFS store their data in an erasure coded pool! Let’s enable overwrites for the pool cephfs_data:
Creating a replicated metadata pool
As stated earlier, the goal is to create a replicated metadata pool for CephFS on the OSDs backed by HDDs.
However, the default CRUSH rule for replicated pool, replicated_rule, will use all types of OSDs, no matter whether they are backed by HDDs, or by NVMes:
Here is the syntax for creating a new replication rule:
Let’s create a new replication rule, pulpo_hdd, that targets the hdd device class (root is default and bucket type host):
Check the rule:
We can now create the metadata pool using the CRUSH rule pulpo_hdd:
Let’s verify that pool cephfs_metadata indeed uses rule pulpo_hdd:
We note in passing that because we have enabled overwrites for the Erasure Code data pool, we could create a CephFS at this point:
which is a marked improvement over Kraken. We, however, will delay the creation of CephFS until we’ve added a cache tier to the data pool.
Adding Cache Tiering to the data pool
The goal is to create a replicated pool on the OSDs backed by the NVMes, as the cache tier of the Erasure Code data pool for the CephFS.
1) Create a new replication rule, pulpo_nvme, that targets the nvme device class (root is default and bucket type host):
Check the rule:
2) Create the replicated cache pool:
By default, the replication size is 3. But 2 is sufficient for the cache pool.
One can list all the placement groups of the cache pool (pool 3):
and get the placement group map for a particular placement group:
3) Create the cache tier:
Creating CephFS
Now we are ready to create the Ceph Filesystem:
A serious bug!
Unfortunately, there is a serious bug lurking in the current version of Luminous (v12.2.0)! If we check the status of the Ceph cluster, we are told that all placement groups are both inactive and unclean!
Same with ceph health:
However, if we query any placement group that is supposedly inactive and unclean, we find it to be actually both active and clean. Take, for example, pg 1.31d:
We hope this bug will be fixed soon!
Mounting CephFS on clients
There are 2 ways to mount CephFS on a client: using either the kernel CephFS driver, or ceph-fuse. The fuse client is the easiest way to get up to date code, while the kernel client will often give better performance.
On a client, e.g., pulpo-dtn, create the mount point:
Kernel CephFS driver
The Ceph Storage Cluster runs with authentication turned on by default. We need a file containing the secret key (i.e., not the keyring itself).
And here is another bug in current version of Luminous (v12.2.0): when CephFS is mounted, the mount point doesn’t show up in the output of df; and although we can list the mount point specifically with df -h /mnt/pulpos, the size of the filesystem is reported as 0!
Nonetheless, we can read and write to the CephFS just fine!
ceph-fuse
Make sure the ceph-fuse package is installed. We’ve already installed the package on pulpo-dtn, using Ansible.
cephx authentication is on by default. Ensure that the client host has a copy of the Ceph configuration file and a keyring with CAPS for the Ceph metadata server. pulpo-dtn already has a copy of these 2 files. NOTEceph-fuse uses the keyring rather than a secret file for authentication!
Then we can use the ceph-fuse command to mount the CephFS as a FUSE (Filesystem in Userspace):
on pulpo-dtn:
or more redundantly:
There are 2 options to automate mounting ceph-fuse: fstab or systemd.
2) ceph-fuse@.service and ceph-fuse.target systemd units are available. To mount CephFS as a FUSE on /mnt/pulpos, using systemctl:
To create a persistent mount point:
NOTE here the command must be systemctl enable ceph-fuse@-mnt-pulpos. If we run systemctl enable ceph-fuse@/mnt/pulpos instead, we’ll get an error “Failed to execute operation: Unit name pulpos is not valid.” However, when starting the service, we can run either systemctl start ceph-fuse@/mnt/pulpos or systemctl start ceph-fuse@-mnt-pulpos!
Lastly, we note the same bug in current version of Luminous (v12.2.0): when CephFS is mounted using ceph-fuse, the mount point doesn’t show up in the output of df; and although we can list the mount point specifically with df -h /mnt/pulpos, the size of the filesystem is reported as 0!