In this post, we describe how we cleanly installed Ceph v12.2.1 (codename Luminous) on the Pulpos cluster.
We installed Ceph v12.2.0 on Pulpos in late Auguest. We created a Ceph Filesystem (CephFS), using 3 RADOS pools:
an Erasure Code data pool on the OSDs backed by HDDs
a replicated metadata pool on the OSDs backed by HDDs
a replicated pool on the OSDs backed by NVMes, as the cache tier of the Erasure Code data pool
It worked remarkably well. However, the cache tier added a lot of complexity to the architecture, but didn’t appear to add much to performance! When Ceph v12.2.1 was release on September 28, 2017, I decided to wipe the slate clean, and create from scratch a much simpler Ceph Filesystem, using just 2 RADOS pools:
a replicated data pool on the OSDs backed by HDDs, using partitions on NVMes as the WAL and DB devices
a replicated metadata pool on the OSDs backed by NVMes
3) Install ceph-fuse on the client node (pulpo-dtn);
4) Install ceph-deploy on the admin node (pulpo-admin).
Let’s verify that Luminous is installed on all the nodes:
Ceph-Deploy
We use ceph-deploy to deploy Luminous on the Pulpos cluster. ceph-deploy is a easy and quick tool to set up and take down a Ceph cluster. It uses ssh to gain access to other Ceph nodes from the admin node (pulpo-admin), and then uses the underlying Python scripts to automate the manual process of Ceph installation on each node. One can also use a generic deployment system, such as Puppet, Chef or Ansible, to deploy Ceph. I am particularly interested in ceph-ansible, the Ansible playbook for Ceph; and may try it in the near future.
1) We use the directory /root/Pulpos on the admin node to maintain the configuration files and keys
2) Create a cluster, with pulpo-mon01 as the initial monitor node (We’ll add 2 monitors shortly):
which generates ceph.conf & ceph.mon.keyring in the directory.
3) Append the following 2 lines to ceph.conf
The public_network is 10 Gb/s and cluster_network 40 Gb/s (see Pulpos Networks)
4) Append the following 2 lines to ceph.conf (to allow deletion of pools):
5) Deploy the initial monitor(s) and gather the keys:
which generated ceph.client.admin.keyring, ceph.bootstrap-osd.keyring, ceph.bootstrap-mds.keyring & ceph.bootstrap-rgw.keyringmon.keyring in the directory.
6) Copy the configuration file and admin key to all the nodes
which copies ceph.client.admin.keyring & ceph.conf to the directory /etc/ceph on all the nodes.
7) Add 2 more monitors, on pulpo-mds01 & pulpo-admin, respectively:
It seems that we can only add one at a time.
8) Deploy a manager daemon on each of the monitor nodes:
ceph-mgr is new daemon introduced in Luminous, and is a required part of any Luminous deployment.
2) We use the following Bash script (zap-disks.sh) to zap the disks on the OSD nodes (Caution: device names can and do change!):
3) We then use the following Bash script (create-osd.sh) to create OSDs on the OSD nodes:
The goals were to (on each of the OSD nodes):
Create an OSD on each of the 8TB SATA HDDs, using the new bluestore backend;
Use a partition on the first NVMe SSD (/dev/nvme1n0) as the WAL device and another partition as the DB device for each of the OSDs on the HDDs;
Create an OSD on the second NVMe SSD (/dev/nvme1n1), using the new bluetore backend.
Let’s verify we have achieved our goals:
Each DB partition is only 1GB in size and each WAL partition is only 576MB. So there is plenty of space left on the first NVMe SSD (the total capacity is 1.1TB). We may create a new partition there to benchmark the NVMe SSD in the near future.
One nice new feature introduced in Luminous is CRUSH device class.
Luminous automatically associate the OSDs backed by HDDs with the hdd device class; and the OSDs backed by NVMes with the nvme device class. So we no longer need to manually modify CRUSH map (as in kraken and earlier Ceph releases) in order to place different pools on different OSDs!
a replicated metadata pool on the OSDs backed by NVMes
Creating a replicated data pool
However, the default CRUSH rule for replicated pool, replicated_rule, will use all types of OSDs, no matter whether they are backed by HDDs, or by NVMes:
Here is the syntax for creating a new replication rule:
Let’s create a new replication rule, pulpo_hdd, that targets the hdd device class (root is default and bucket type host):
Check the rule:
We can now create the metadata pool using the CRUSH rule pulpo_hdd:
Here we set placement groups number to be 1024, which was a bit too low; as we got a warning from ceph -s:
Let’s double the PG (Placement Groups) number:
Let’s verify that pool cephfs_data indeed uses rule pulpo_hdd:
By default, the replication size is 3. But 2 is sufficient for the cache pool.
Creating a replicated metadata pool
1) Create a new replication rule, pulpo_nvme, that targets the nvme device class (root is default and bucket type host):
Check the rule:
2) Create the replicated metadata pool:
By default, the replication size is 3. But 2 is sufficient for the metadata pool.
Let’s double the PG number:
Let’s check the status of our Ceph cluster again:
One can list all the placement groups of the metadata pool (pool 2):
and get the placement group map for a particular placement group:
Let’s verify that pool cephfs_metadata indeed uses rule pulpo_nvme:
Creating CephFS
Now we are ready to create the Ceph Filesystem:
By the way, the default maximum file size is 1TiB:
Let’s raise it to 2TiB:
Limitations
Unfortunately, there is a serious bug lurking in the current version of Luminous (v12.2.1)! If we check the status of the Ceph cluster, we are told that all placement groups are both inactive and unclean!
Same with ceph health:
However, if we query any placement group that is supposedly inactive and unclean, we find it to be actually both active and clean. Take, for example, pg 1.7cd:
Mounting CephFS on clients
There are 2 ways to mount CephFS on a client: using either the kernel CephFS driver, or ceph-fuse. The fuse client is the easiest way to get up to date code, while the kernel client will often give better performance.
On a client, e.g., pulpo-dtn, create the mount point:
Kernel CephFS driver
The Ceph Storage Cluster runs with authentication turned on by default. We need a file containing the secret key (i.e., not the keyring itself).
And here is another bug in current version of Luminous (v12.2.1): when CephFS is mounted, the mount point doesn’t show up in the output of df; and although we can list the mount point specifically with df -h /mnt/pulpos, the size of the filesystem is reported as 0!
Nonetheless, we can read and write to the CephFS just fine!
ceph-fuse
Make sure the ceph-fuse package is installed. We’ve already installed the package on pulpo-dtn, using Ansible.
cephx authentication is on by default. Ensure that the client host has a copy of the Ceph configuration file and a keyring with CAPS for the Ceph metadata server. pulpo-dtn already has a copy of these 2 files. NOTEceph-fuse uses the keyring rather than a secret file for authentication!
Then we can use the ceph-fuse command to mount the CephFS as a FUSE (Filesystem in Userspace):
on pulpo-dtn:
or more redundantly:
There are 2 options to automate mounting ceph-fuse: fstab or systemd.
2) ceph-fuse@.service and ceph-fuse.target systemd units are available. To mount CephFS as a FUSE on /mnt/pulpos, using systemctl:
To create a persistent mount point:
NOTE here the command must be systemctl enable ceph-fuse@-mnt-pulpos. If we run systemctl enable ceph-fuse@/mnt/pulpos instead, we’ll get an error “Failed to execute operation: Unit name pulpos is not valid.” However, when starting the service, we can run either systemctl start ceph-fuse@/mnt/pulpos or systemctl start ceph-fuse@-mnt-pulpos!
Lastly, we note the same bug in current version of Luminous (v12.2.1): when CephFS is mounted using ceph-fuse, the mount point doesn’t show up in the output of df; and although we can list the mount point specifically with df -h /mnt/pulpos, the size of the filesystem is reported as 0!