In this post, we describe how we installed Ceph v11.2 (codename Kraken) on the Pulpos cluster.
As of this writing, the current stable release of Ceph is Kraken (Ceph v11.2). Kraken, however, is not an LTS (Long Term Stable) release. So Kraken will only be maintained with bugfixes and backports until the next stable release, Luminous, is completed in the Spring of 2017. Every other stable release of Ceph is a LTS (Long Term Stable) and will receive updates until two LTS are published. The current Ceph LTS is Jewel (Ceph v10.2); and the next stable release, Luminous (Ceph v12.2), will be an LTS as well.
3) Install ceph-fuse on the client node (pulpo-dtn);
4) Install ceph-deploy on the admin node (pulpo-admin).
Let’s verify that Kraken is installed on all the nodes:
ceph-deploy
We use ceph-deploy to deploy Kraken on the Pulpos cluster. ceph-deploy is a easy and quick tool to set up and take down a Ceph cluster. It uses ssh to gain access to other Ceph nodes from the admin node (pulpo-admin), and then uses the underlying Python scripts to automate the manual process of Ceph installation on each node. One can also use a generic deployment system, such as Puppet, Chef or Ansible, to deploy Ceph. I am particularly interested in ceph-ansible, the Ansible playbook for Ceph; and may try it in the near future.
1) We use the directory /root/Pulpos on the admin node to maintain the configuration files and keys
2) Create a cluster, with pulpo-mon01 as the initial monitor node (We’ll add 2 monitors shortly):
which generates ceph.conf & ceph.mon.keyring in the directory.
3) Append the following 2 lines to ceph.conf
The public_network is 10 Gb/s and cluster_network 40 Gb/s (see Pulpos Networks)
4) Append the following 2 lines to ceph.conf (to allow deletion of pools):
5) Deploy the initial monitor(s) and gather the keys:
which generated ceph.client.admin.keyring, ceph.bootstrap-osd.keyring, ceph.bootstrap-mds.keyring & ceph.bootstrap-rgw.keyringmon.keyring in the directory.
6) Copy the configuration file and admin key to all the nodes
which copies ceph.client.admin.keyring & ceph.conf to the directory /etc/ceph on all the nodes.
7) Add 2 more monitors, on pulpo-mds01 & pulpo-admin, respectively:
2) We use the following Bash script (zap-disks.sh) to zap the disks on the OSD nodes (Caution: device names can and do change!):
3) We then use the following Bash script (create-osd.sh) to create OSDs on the OSD nodes:
The goals were to (on each of the OSD nodes):
Create an OSD on each of the 8TB SATA HDDs, using the default filestore backend;
Use a partition on the first NVMe SSD (/dev/nvme1n0) as the journal for each of the OSDs on the HDDs;
Create an OSD on the second NVMe SSD (/dev/nvme1n1), using the default filestore backend.
Let’s see if we have achieved our goals:
It looks about right!
Each journal partition is only 5GB in size. So there is plenty of space left on the first NVMe SSD (the total capacity is 1.1TB). We may create a new partition there to benchmark the NVMe SSD in the near future.
Changing pg_num
Let’s check the health of the Ceph cluster:
At this point, only one default pool, rbd, exists. But the default pg_num is too small!
The recommended pg_num for a Ceph cluster of Pulpos’ size is 1024. Let’s change both pg_num and pgp_num:
Wait for a couple of minutes; then check the health again:
Healthy now!
Modifying CRUSH map
Unlike Luminous, Kraken has no concept of CRUSH device class, so it doesn’t differentiate between OSDs backed by HDDs and an OSD backed by NVMes.
Thus, by default, the CRUSH algorithm will pseudo-randomly store data of a pool in OSDs across the cluster, including both OSDs backed by HDDs and OSDs backed by NVMes! Because of significant difference in speed between HDDs and NVMes, this will result in imbalance. A better way is to place different pools on different OSDs. In order to do that with kraken, one must manually edit the CRUSH map.
1) Get the current CRUSH map in compiled form (crushmap-0.bin):
2) Decompile the CRUSH map to a text file (crushmap-0.txt):
Here is the decompiled CRUSH map (crushmap-0.txt):
We can see that at this point, only default CRUSH ruleset 0 exists. We can verify that the default pool rbd uses the default ruleset 0:
3) Edit the CRUSH map. Here is the new CRUSH map in decompiled form (crushmap-2.txt):
A quick summary of the modifications:
We replace the old host bucket pulpo-osd01 (which contained both HDD OSDs and NVMe OSD) with pulpo-osd01-hdd (which only contains HDD OSDs) and pulpo-osd01-nvme (which contains the single NVMe OSD in the node);
We make similar modifications for pulpo-osd02 & pulpo-osd03;
We remove the old root bucket default;
We add a new root bucket hdd, which contains all OSDs backed by HDDs
We add a new root bucket nvme, which contains all OSDs backed by NVMes
We modify the default replicated_ruleset to take the root bucket hdd (so it’ll only use OSDs backed by HDDs).
an Erasure Code data pool on the OSDs backed by HDDs
a replicated metadata pool on the OSDs backed by HDDs
a replicated pool on the OSDs backed by NVMes, as the cache tier of the Erasure Code data pool
Creating an Erasure Code data pool
The default erasure code profile sustains the loss of a single OSD.
Let’s create a new erasure code profile pulpo_ec:
The important parameter is ruleset-root=hdd, which set hdd as the root bucket for the CRUSH ruleset. So a pool created with this profile will only use the OSDs backed by HDDs.
Create the Erasure Code data pool for CephFS, with the pulpo_ec profile:
which also generates a new CRUSH ruleset with the same name cephfs_data.
Let’s check the ruleset for pool cephfs_data:
Creating a replicated metadata pool
Create the replicated metadata pool for CephFS, using the default CRUSH ruleset replicated_ruleset:
Let’s verify the ruleset for pool cephfs_metadata:
NOTE at this point, if we try to create a CephFS, we’ll get an error!
So in Kraken, if the data pool is erasure-coded, it is required to add a ‘cache tier’ to the data pool for CephFS to work!
Adding Cache Tiering to the data pool
The goal is to create a replicated pool on the OSDs backed by the NVMes, as the cache tier of the Erasure Code data pool for the CephFS.
1) Create a new CRUSH ruleset for the cache pool:
2) Create the replicated cache pool:
By default, the replication size is 3. But 2 is sufficient for the cache pool.
One can list all the placement groups of the cache pool (pool 3):
and get the placement group map for a particular placement group:
3) Create the cache tier:
Creating CephFS
Now we can create the Ceph Filesystem:
Mounting CephFS on clients
There are 2 ways to mount CephFS on a client: using either the kernel CephFS driver, or ceph-fuse. The fuse client is the easiest way to get up to date code, while the kernel client will often give better performance.
On a client, e.g., pulpo-dtn, create the mount point:
Kernel CephFS driver
The Ceph Storage Cluster runs with authentication turned on by default. We need a file containing the secret key (i.e., not the keyring itself).
Make sure the ceph-fuse package is installed. We’ve already installed the package on pulpo-dtn, using Ansible.
cephx authentication is on by default. Ensure that the client host has a copy of the Ceph configuration file and a keyring with CAPS for the Ceph metadata server. pulpo-dtn already has a copy of these 2 files. NOTEceph-fuse uses the keyring rather than a secret file for authentication!
Then we can use the ceph-fuse command to mount the CephFS as a FUSE (Filesystem in Userspace):
on pulpo-dtn:
or more redundantly:
There are 2 options to automate mounting ceph-fuse: fstab or systemd.
2) ceph-fuse@.service and ceph-fuse.target systemd units are available. To mount CephFS as a FUSE on /mnt/pulpos, using systemctl:
To create a persistent mount point:
NOTE here the command must be systemctl enable ceph-fuse@-mnt-pulpos. If we run systemctl enable ceph-fuse@/mnt/pulpos instead, we’ll get an error “Failed to execute operation: Unit name pulpos is not valid.” However, when starting the service, we can run either systemctl start ceph-fuse@/mnt/pulpos or systemctl start ceph-fuse@-mnt-pulpos!