Installing Ceph v11.2 (Kraken) on Pulpos

August 21, 2017 | Ceph Storage Provisioning

In this post, we describe how we installed Ceph v11.2 (codename Kraken) on the Pulpos cluster.

As of this writing, the current stable release of Ceph is Kraken (Ceph v11.2). Kraken, however, is not an LTS (Long Term Stable) release. So Kraken will only be maintained with bugfixes and backports until the next stable release, Luminous, is completed in the Spring of 2017. Every other stable release of Ceph is a LTS (Long Term Stable) and will receive updates until two LTS are published. The current Ceph LTS is Jewel (Ceph v10.2); and the next stable release, Luminous (Ceph v12.2), will be an LTS as well.

Purging old Ceph installation
Installing Kraken packages
ceph-deploy
Adding OSDs
Changing pg_num
Modifying CRUSH map
Adding an MDS
Creating an Erasure Code data pool
Creating a replicated metadata pool
Adding Cache Tiering to the data pool
Creating CephFS
Mounting CephFS on clients
- Kernel CephFS driver
- ceph-fuse

Purging old Ceph installation

In order to start with a clean slate, we ran the following Bash script (start-over.sh) on the admin node (pulpo-admin) to purge old Ceph packages, and erase old Ceph data and configuration:

#!/bin/bash

cd ~/Pulpos

# Uninstall ceph-fuse on the client
ssh pulpo-dtn yum erase -y ceph-fuse

# http://docs.ceph.com/docs/master/start/quick-ceph-deploy/#starting-over
ceph-deploy purge pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
ceph-deploy purgedata pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
ceph-deploy forgetkeys

# More cleanups
ansible -m command -a "yum erase -y libcephfs2 python-cephfs librados2 python-rados librbd1 python-rbd librgw2 python-rgw" all
ansible -m shell -a "rm -rf /etc/systemd/system/ceph*.target.wants" all
yum erase -y ceph-deploy
rm -f ceph*

We then rebooted all the nodes.

Installing Kraken packages

We use a simple Ansible playbook to install Ceph v11.2 (Kraken) packages on all nodes in Pulpos, performing the following tasks:

1) Add a Yum repository for Kraken (/etc/yum.repos.d/ceph.repo) on all the nodes:

[Ceph]
name=Ceph packages for $basearch
baseurl=https://download.ceph.com/rpm-kraken/el7/$basearch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2

[Ceph-noarch]
name=Ceph noarch packages
baseurl=https://download.ceph.com/rpm-kraken/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2

[ceph-source]
name=Ceph source packages
baseurl=https://download.ceph.com/rpm-kraken/el7/SRPMS
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2

2) Install Ceph RPM packages on all the nodes;

3) Install ceph-fuse on the client node (pulpo-dtn);

4) Install ceph-deploy on the admin node (pulpo-admin).

Let’s verify that Kraken is installed on all the nodes:

[root@pulpo-admin ~]# ansible -m command -a "ceph --version" all
pulpo-admin.local | SUCCESS | rc=0 >>
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)

pulpo-mon01.local | SUCCESS | rc=0 >>
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)

pulpo-dtn.local | SUCCESS | rc=0 >>
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)

pulpo-osd01.local | SUCCESS | rc=0 >>
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)

pulpo-mds01.local | SUCCESS | rc=0 >>
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)

pulpo-osd02.local | SUCCESS | rc=0 >>
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)

pulpo-osd03.local | SUCCESS | rc=0 >>
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)

ceph-deploy

We use ceph-deploy to deploy Kraken on the Pulpos cluster. ceph-deploy is a easy and quick tool to set up and take down a Ceph cluster. It uses ssh to gain access to other Ceph nodes from the admin node (pulpo-admin), and then uses the underlying Python scripts to automate the manual process of Ceph installation on each node. One can also use a generic deployment system, such as Puppet, Chef or Ansible, to deploy Ceph. I am particularly interested in ceph-ansible, the Ansible playbook for Ceph; and may try it in the near future.

1) We use the directory /root/Pulpos on the admin node to maintain the configuration files and keys

[root@pulpo-admin ~]# cd ~/Pulpos/

2) Create a cluster, with pulpo-mon01 as the initial monitor node (We’ll add 2 monitors shortly):

[root@pulpo-admin Pulpos]# ceph-deploy new pulpo-mon01

which generates ceph.conf & ceph.mon.keyring in the directory.

3) Append the following 2 lines to ceph.conf

public_network = 128.114.86.0/24
cluster_network = 192.168.40.0/24

The public_network is 10 Gb/s and cluster_network 40 Gb/s (see Pulpos Networks)

4) Append the following 2 lines to ceph.conf (to allow deletion of pools):

[mon]
mon_allow_pool_delete = true

5) Deploy the initial monitor(s) and gather the keys:

[root@pulpo-admin Pulpos]# ceph-deploy mon create-initial

which generated ceph.client.admin.keyring, ceph.bootstrap-osd.keyring, ceph.bootstrap-mds.keyring & ceph.bootstrap-rgw.keyringmon.keyring in the directory.

6) Copy the configuration file and admin key to all the nodes

[root@pulpo-admin Pulpos]# ceph-deploy admin pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03

which copies ceph.client.admin.keyring & ceph.conf to the directory /etc/ceph on all the nodes.

7) Add 2 more monitors, on pulpo-mds01 & pulpo-admin, respectively:

[root@pulpo-admin Pulpos]# ceph-deploy mon add pulpo-mds01
[root@pulpo-admin Pulpos]# ceph-deploy mon add pulpo-admin

It seems that we can only add one at a time.

Adding OSDs

1) List the disks on the OSD nodes:

[root@pulpo-admin Pulpos]# ssh pulpo-osd01 ceph-disk list

[root@pulpo-admin Pulpos]# ssh pulpo-osd01 lsblk

2) We use the following Bash script (zap-disks.sh) to zap the disks on the OSD nodes (Caution: device names can and do change!):

#!/bin/bash

for i in {1..3}
do
  for x in {a..l}
  do
     # zap the twelve 8TB SATA drives
     ceph-deploy disk zap pulpo-osd0${i}:sd${x}
  done
  for j in {0..1}
  do
     # zap the two NVMe SSDs
     ceph-deploy disk zap pulpo-osd0${i}:nvme${j}n1
  done
done

3) We then use the following Bash script (create-osd.sh) to create OSDs on the OSD nodes:

#!/bin/bash

### HDDs
for i in {1..3}
do
  j=1
  for x in {a..l}
  do
     ceph-deploy osd prepare pulpo-osd0${i}:sd${x}:/dev/nvme0n1
     ceph-deploy osd activate pulpo-osd0${i}:sd${x}1:/dev/nvme0n1p${j}
     j=$[j+1]
     sleep 10
  done
done

### NVMe
for i in {1..3}
do
  ceph-deploy osd prepare pulpo-osd0${i}:nvme1n1
  ceph-deploy osd activate pulpo-osd0${i}:nvme1n1p1
  sleep 10
done

The goals were to (on each of the OSD nodes):

Create an OSD on each of the 8TB SATA HDDs, using the default filestore backend;
Use a partition on the first NVMe SSD (/dev/nvme1n0) as the journal for each of the OSDs on the HDDs;
Create an OSD on the second NVMe SSD (/dev/nvme1n1), using the default filestore backend.

Let’s see if we have achieved our goals:

[root@pulpo-admin Pulpos]# ssh pulpo-osd01 ceph-disk list
/dev/nvme0n1 :
 /dev/nvme0n1p1 ceph journal, for /dev/sda1
 /dev/nvme0n1p10 ceph journal, for /dev/sdj1
 /dev/nvme0n1p11 ceph journal, for /dev/sdk1
 /dev/nvme0n1p12 ceph journal, for /dev/sdl1
 /dev/nvme0n1p2 ceph journal, for /dev/sdb1
 /dev/nvme0n1p3 ceph journal, for /dev/sdc1
 /dev/nvme0n1p4 ceph journal, for /dev/sdd1
 /dev/nvme0n1p5 ceph journal, for /dev/sde1
 /dev/nvme0n1p6 ceph journal, for /dev/sdf1
 /dev/nvme0n1p7 ceph journal, for /dev/sdg1
 /dev/nvme0n1p8 ceph journal, for /dev/sdh1
 /dev/nvme0n1p9 ceph journal, for /dev/sdi1
/dev/nvme1n1 :
 /dev/nvme1n1p1 ceph data, active, cluster ceph, osd.36, journal /dev/nvme1n1p2
 /dev/nvme1n1p2 ceph journal, for /dev/nvme1n1p1
/dev/sda :
 /dev/sda1 ceph data, active, cluster ceph, osd.0, journal /dev/nvme0n1p1
/dev/sdb :
 /dev/sdb1 ceph data, active, cluster ceph, osd.1, journal /dev/nvme0n1p2
/dev/sdc :
 /dev/sdc1 ceph data, active, cluster ceph, osd.2, journal /dev/nvme0n1p3
/dev/sdd :
 /dev/sdd1 ceph data, active, cluster ceph, osd.3, journal /dev/nvme0n1p4
/dev/sde :
 /dev/sde1 ceph data, active, cluster ceph, osd.4, journal /dev/nvme0n1p5
/dev/sdf :
 /dev/sdf1 ceph data, active, cluster ceph, osd.5, journal /dev/nvme0n1p6
/dev/sdg :
 /dev/sdg1 ceph data, active, cluster ceph, osd.6, journal /dev/nvme0n1p7
/dev/sdh :
 /dev/sdh1 ceph data, active, cluster ceph, osd.7, journal /dev/nvme0n1p8
/dev/sdi :
 /dev/sdi1 ceph data, active, cluster ceph, osd.8, journal /dev/nvme0n1p9
/dev/sdj :
 /dev/sdj1 ceph data, active, cluster ceph, osd.9, journal /dev/nvme0n1p10
/dev/sdk :
 /dev/sdk1 ceph data, active, cluster ceph, osd.10, journal /dev/nvme0n1p11
/dev/sdl :
 /dev/sdl1 ceph data, active, cluster ceph, osd.11, journal /dev/nvme0n1p12

It looks about right!

Each journal partition is only 5GB in size. So there is plenty of space left on the first NVMe SSD (the total capacity is 1.1TB). We may create a new partition there to benchmark the NVMe SSD in the near future.

Changing pg_num

Let’s check the health of the Ceph cluster:

[root@pulpo-admin Pulpos]# ceph -s
    cluster ba892c66-7666-4957-b096-92a92bb87282
     health HEALTH_WARN
            too few PGs per OSD (4 < min 30)
     monmap e4: 3 mons at {pulpo-admin=128.114.86.2:6789/0,pulpo-mds01=128.114.86.5:6789/0,pulpo-mon01=128.114.86.4:6789/0}
            election epoch 12, quorum 0,1,2 pulpo-admin,pulpo-mon01,pulpo-mds01
        mgr active: pulpo-mon01 standbys: pulpo-mds01, pulpo-admin
     osdmap e182: 39 osds: 39 up, 39 in
            flags sortbitwise,require_jewel_osds,require_kraken_osds
      pgmap v793: 64 pgs, 1 pools, 0 bytes data, 0 objects
            1423 MB used, 265 TB / 265 TB avail
                  64 active+clean

At this point, only one default pool, rbd, exists. But the default pg_num is too small!

[root@pulpo-admin Pulpos]# ceph osd lspools
0 rbd,
[root@pulpo-admin Pulpos]# ceph osd pool get rbd pg_num
pg_num: 64
[root@pulpo-admin Pulpos]# ceph osd pool get rbd pgp_num
pgp_num: 64

The recommended pg_num for a Ceph cluster of Pulpos’ size is 1024. Let’s change both pg_num and pgp_num:

[root@pulpo-admin Pulpos]# ceph osd pool set rbd pg_num 1024
set pool 0 pg_num to 1024
[root@pulpo-admin Pulpos]# ceph osd pool set rbd pgp_num 1024
set pool 0 pgp_num to 1024

Wait for a couple of minutes; then check the health again:

[root@pulpo-admin Pulpos]# ceph -s
    cluster ba892c66-7666-4957-b096-92a92bb87282
     health HEALTH_OK
     monmap e4: 3 mons at {pulpo-admin=128.114.86.2:6789/0,pulpo-mds01=128.114.86.5:6789/0,pulpo-mon01=128.114.86.4:6789/0}
            election epoch 12, quorum 0,1,2 pulpo-admin,pulpo-mon01,pulpo-mds01
        mgr active: pulpo-mon01 standbys: pulpo-mds01, pulpo-admin
     osdmap e190: 39 osds: 39 up, 39 in
            flags sortbitwise,require_jewel_osds,require_kraken_osds
      pgmap v861: 1024 pgs, 1 pools, 0 bytes data, 0 objects
            1471 MB used, 265 TB / 265 TB avail
                1024 active+clean

Healthy now!

Modifying CRUSH map

Unlike Luminous, Kraken has no concept of CRUSH device class, so it doesn’t differentiate between OSDs backed by HDDs and an OSD backed by NVMes.

[root@pulpo-admin Pulpos]# ceph osd tree
ID WEIGHT    TYPE NAME            UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 265.17267 root default
-2  88.39088     host pulpo-osd01
 0   7.27539         osd.0             up  1.00000          1.00000
 1   7.27539         osd.1             up  1.00000          1.00000
 2   7.27539         osd.2             up  1.00000          1.00000
 3   7.27539         osd.3             up  1.00000          1.00000
 4   7.27539         osd.4             up  1.00000          1.00000
 5   7.27539         osd.5             up  1.00000          1.00000
 6   7.27539         osd.6             up  1.00000          1.00000
 7   7.27539         osd.7             up  1.00000          1.00000
 8   7.27539         osd.8             up  1.00000          1.00000
 9   7.27539         osd.9             up  1.00000          1.00000
10   7.27539         osd.10            up  1.00000          1.00000
11   7.27539         osd.11            up  1.00000          1.00000
36   1.08620         osd.36            up  1.00000          1.00000
-3  88.39088     host pulpo-osd02
12   7.27539         osd.12            up  1.00000          1.00000
13   7.27539         osd.13            up  1.00000          1.00000
14   7.27539         osd.14            up  1.00000          1.00000
15   7.27539         osd.15            up  1.00000          1.00000
16   7.27539         osd.16            up  1.00000          1.00000
17   7.27539         osd.17            up  1.00000          1.00000
18   7.27539         osd.18            up  1.00000          1.00000
19   7.27539         osd.19            up  1.00000          1.00000
20   7.27539         osd.20            up  1.00000          1.00000
21   7.27539         osd.21            up  1.00000          1.00000
22   7.27539         osd.22            up  1.00000          1.00000
23   7.27539         osd.23            up  1.00000          1.00000
37   1.08620         osd.37            up  1.00000          1.00000
-4  88.39088     host pulpo-osd03
24   7.27539         osd.24            up  1.00000          1.00000
25   7.27539         osd.25            up  1.00000          1.00000
26   7.27539         osd.26            up  1.00000          1.00000
27   7.27539         osd.27            up  1.00000          1.00000
28   7.27539         osd.28            up  1.00000          1.00000
29   7.27539         osd.29            up  1.00000          1.00000
30   7.27539         osd.30            up  1.00000          1.00000
31   7.27539         osd.31            up  1.00000          1.00000
32   7.27539         osd.32            up  1.00000          1.00000
33   7.27539         osd.33            up  1.00000          1.00000
34   7.27539         osd.34            up  1.00000          1.00000
35   7.27539         osd.35            up  1.00000          1.00000
38   1.08620         osd.38            up  1.00000          1.00000

Thus, by default, the CRUSH algorithm will pseudo-randomly store data of a pool in OSDs across the cluster, including both OSDs backed by HDDs and OSDs backed by NVMes! Because of significant difference in speed between HDDs and NVMes, this will result in imbalance. A better way is to place different pools on different OSDs. In order to do that with kraken, one must manually edit the CRUSH map.

1) Get the current CRUSH map in compiled form (crushmap-0.bin):

[root@pulpo-admin Pulpos]# ceph osd getcrushmap -o crushmap-0.bin
got crush map from osdmap epoch 191

2) Decompile the CRUSH map to a text file (crushmap-0.txt):

[root@pulpo-admin Pulpos]# crushtool -d crushmap-0.bin -o crushmap-0.txt

Here is the decompiled CRUSH map (crushmap-0.txt):

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pulpo-osd01 {
        id -2           # do not change unnecessarily
        # weight 88.391
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 7.275
        item osd.1 weight 7.275
        item osd.2 weight 7.275
        item osd.3 weight 7.275
        item osd.4 weight 7.275
        item osd.5 weight 7.275
        item osd.6 weight 7.275
        item osd.7 weight 7.275
        item osd.8 weight 7.275
        item osd.9 weight 7.275
        item osd.10 weight 7.275
        item osd.11 weight 7.275
        item osd.36 weight 1.086
}
host pulpo-osd02 {
        id -3           # do not change unnecessarily
        # weight 88.391
        alg straw
        hash 0  # rjenkins1
        item osd.12 weight 7.275
        item osd.13 weight 7.275
        item osd.14 weight 7.275
        item osd.15 weight 7.275
        item osd.16 weight 7.275
        item osd.17 weight 7.275
        item osd.18 weight 7.275
        item osd.19 weight 7.275
        item osd.20 weight 7.275
        item osd.21 weight 7.275
        item osd.22 weight 7.275
        item osd.23 weight 7.275
        item osd.37 weight 1.086
}
host pulpo-osd03 {
        id -4           # do not change unnecessarily
        # weight 88.391
        alg straw
        hash 0  # rjenkins1
        item osd.24 weight 7.275
        item osd.25 weight 7.275
        item osd.26 weight 7.275
        item osd.27 weight 7.275
        item osd.28 weight 7.275
        item osd.29 weight 7.275
        item osd.30 weight 7.275
        item osd.31 weight 7.275
        item osd.32 weight 7.275
        item osd.33 weight 7.275
        item osd.34 weight 7.275
        item osd.35 weight 7.275
        item osd.38 weight 1.086
}
root default {
        id -1           # do not change unnecessarily
        # weight 265.173
        alg straw
        hash 0  # rjenkins1
        item pulpo-osd01 weight 88.391
        item pulpo-osd02 weight 88.391
        item pulpo-osd03 weight 88.391
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map

We can see that at this point, only default CRUSH ruleset 0 exists. We can verify that the default pool rbd uses the default ruleset 0:

[root@pulpo-admin Pulpos]# ceph osd pool get rbd crush_ruleset
crush_ruleset: 0

3) Edit the CRUSH map. Here is the new CRUSH map in decompiled form (crushmap-2.txt):

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host pulpo-osd01-hdd {
        id -2           # do not change unnecessarily
        # weight 87.3
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 7.275
        item osd.1 weight 7.275
        item osd.2 weight 7.275
        item osd.3 weight 7.275
        item osd.4 weight 7.275
        item osd.5 weight 7.275
        item osd.6 weight 7.275
        item osd.7 weight 7.275
        item osd.8 weight 7.275
        item osd.9 weight 7.275
        item osd.10 weight 7.275
        item osd.11 weight 7.275
}
host pulpo-osd02-hdd {
        id -3           # do not change unnecessarily
        # weight 87.3
        alg straw
        hash 0  # rjenkins1
        item osd.12 weight 7.275
        item osd.13 weight 7.275
        item osd.14 weight 7.275
        item osd.15 weight 7.275
        item osd.16 weight 7.275
        item osd.17 weight 7.275
        item osd.18 weight 7.275
        item osd.19 weight 7.275
        item osd.20 weight 7.275
        item osd.21 weight 7.275
        item osd.22 weight 7.275
        item osd.23 weight 7.275
}
host pulpo-osd03-hdd {
        id -4           # do not change unnecessarily
        # weight 87.3
        alg straw
        hash 0  # rjenkins1
        item osd.24 weight 7.275
        item osd.25 weight 7.275
        item osd.26 weight 7.275
        item osd.27 weight 7.275
        item osd.28 weight 7.275
        item osd.29 weight 7.275
        item osd.30 weight 7.275
        item osd.31 weight 7.275
        item osd.32 weight 7.275
        item osd.33 weight 7.275
        item osd.34 weight 7.275
        item osd.35 weight 7.275
}
host pulpo-osd01-nvme {
        id -5           # do not change unnecessarily
        # weight 1.086
        alg straw
        hash 0  # rjenkins1
        item osd.36 weight 1.086
}
host pulpo-osd02-nvme {
        id -6           # do not change unnecessarily
        # weight 1.086
        alg straw
        hash 0  # rjenkins1
        item osd.37 weight 1.086
}
host pulpo-osd03-nvme {
        id -7           # do not change unnecessarily
        # weight 1.086
        alg straw
        hash 0  # rjenkins1
        item osd.38 weight 1.086
}
root hdd {
        id -1           # do not change unnecessarily
        # weight 261.9
        alg straw
        hash 0  # rjenkins1
        item pulpo-osd01-hdd weight 87.3
        item pulpo-osd02-hdd weight 87.3
        item pulpo-osd03-hdd weight 87.3
}
root nvme {
        id -8           # do not change unnecessarily
        # weight 3.258
        alg straw
        hash 0  # rjenkins1
        item pulpo-osd01-nvme weight 1.086
        item pulpo-osd02-nvme weight 1.086
        item pulpo-osd03-nvme weight 1.086
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take hdd
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map

A quick summary of the modifications:

We replace the old host bucket pulpo-osd01 (which contained both HDD OSDs and NVMe OSD) with pulpo-osd01-hdd (which only contains HDD OSDs) and pulpo-osd01-nvme (which contains the single NVMe OSD in the node);
We make similar modifications for pulpo-osd02 & pulpo-osd03;
We remove the old root bucket default;
We add a new root bucket hdd, which contains all OSDs backed by HDDs
We add a new root bucket nvme, which contains all OSDs backed by NVMes
We modify the default replicated_ruleset to take the root bucket hdd (so it’ll only use OSDs backed by HDDs).

4) Compile the new CRUSH map:

[root@pulpo-admin Pulpos]# crushtool -c crushmap-2.txt -o crushmap-2.bin

5) Set the new CRUSH map for the cluster:

[root@pulpo-admin Pulpos]# ceph osd setcrushmap -i crushmap-2.bin
set crush map

Let’s verify the new CRUSH tree:

[root@pulpo-admin Pulpos]# ceph osd tree
ID WEIGHT    TYPE NAME                 UP/DOWN REWEIGHT PRIMARY-AFFINITY
-8   3.25800 root nvme
-5   1.08600     host pulpo-osd01-nvme
36   1.08600         osd.36                 up  1.00000          1.00000
-6   1.08600     host pulpo-osd02-nvme
37   1.08600         osd.37                 up  1.00000          1.00000
-7   1.08600     host pulpo-osd03-nvme
38   1.08600         osd.38                 up  1.00000          1.00000
-1 261.90002 root hdd
-2  87.30000     host pulpo-osd01-hdd
 0   7.27499         osd.0                  up  1.00000          1.00000
 1   7.27499         osd.1                  up  1.00000          1.00000
 2   7.27499         osd.2                  up  1.00000          1.00000
 3   7.27499         osd.3                  up  1.00000          1.00000
 4   7.27499         osd.4                  up  1.00000          1.00000
 5   7.27499         osd.5                  up  1.00000          1.00000
 6   7.27499         osd.6                  up  1.00000          1.00000
 7   7.27499         osd.7                  up  1.00000          1.00000
 8   7.27499         osd.8                  up  1.00000          1.00000
 9   7.27499         osd.9                  up  1.00000          1.00000
10   7.27499         osd.10                 up  1.00000          1.00000
11   7.27499         osd.11                 up  1.00000          1.00000
-3  87.30000     host pulpo-osd02-hdd
12   7.27499         osd.12                 up  1.00000          1.00000
13   7.27499         osd.13                 up  1.00000          1.00000
14   7.27499         osd.14                 up  1.00000          1.00000
15   7.27499         osd.15                 up  1.00000          1.00000
16   7.27499         osd.16                 up  1.00000          1.00000
17   7.27499         osd.17                 up  1.00000          1.00000
18   7.27499         osd.18                 up  1.00000          1.00000
19   7.27499         osd.19                 up  1.00000          1.00000
20   7.27499         osd.20                 up  1.00000          1.00000
21   7.27499         osd.21                 up  1.00000          1.00000
22   7.27499         osd.22                 up  1.00000          1.00000
23   7.27499         osd.23                 up  1.00000          1.00000
-4  87.30000     host pulpo-osd03-hdd
24   7.27499         osd.24                 up  1.00000          1.00000
25   7.27499         osd.25                 up  1.00000          1.00000
26   7.27499         osd.26                 up  1.00000          1.00000
27   7.27499         osd.27                 up  1.00000          1.00000
28   7.27499         osd.28                 up  1.00000          1.00000
29   7.27499         osd.29                 up  1.00000          1.00000
30   7.27499         osd.30                 up  1.00000          1.00000
31   7.27499         osd.31                 up  1.00000          1.00000
32   7.27499         osd.32                 up  1.00000          1.00000
33   7.27499         osd.33                 up  1.00000          1.00000
34   7.27499         osd.34                 up  1.00000          1.00000
35   7.27499         osd.35                 up  1.00000          1.00000

and verify the ruleset replicated_ruleset:

[root@pulpo-admin Pulpos]# ceph osd crush rule dump replicated_ruleset
{
    "rule_id": 0,
    "rule_name": "replicated_ruleset",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "hdd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

They both look about right!

6) Disable OSD CRUSH update on start:

Add the following 2 lines to ceph.conf:

[osd]
        osd_crush_update_on_start = false

Send the updated ceph.conf to all hosts:

[root@pulpo-admin Pulpos]# ceph-deploy --overwrite-conf config push pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03

7) Further readings:

Adding an MDS

Add a Metadata server:

[root@pulpo-admin Pulpos]# ceph-deploy mds create pulpo-mds01

The goal is to create a Ceph Filesystem (CephFS), using 3 RADOS pools:

an Erasure Code data pool on the OSDs backed by HDDs
a replicated metadata pool on the OSDs backed by HDDs
a replicated pool on the OSDs backed by NVMes, as the cache tier of the Erasure Code data pool

Creating an Erasure Code data pool

The default erasure code profile sustains the loss of a single OSD.

[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile ls
default
[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile get default
k=2
m=1
plugin=jerasure
technique=reed_sol_van

Let’s create a new erasure code profile pulpo_ec:

[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile set pulpo_ec k=2 m=1 ruleset-root=hdd plugin=jerasure technique=reed_sol_van

[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile ls
default
pulpo_ec

[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile get pulpo_ec
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
ruleset-failure-domain=host
ruleset-root=hdd
technique=reed_sol_van
w=8

The important parameter is ruleset-root=hdd, which set hdd as the root bucket for the CRUSH ruleset. So a pool created with this profile will only use the OSDs backed by HDDs.

Create the Erasure Code data pool for CephFS, with the pulpo_ec profile:

[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_data 1024 1024 erasure pulpo_ec
pool 'cephfs_data' created

which also generates a new CRUSH ruleset with the same name cephfs_data.

Let’s check the ruleset for pool cephfs_data:

[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data crush_ruleset
crush_ruleset: 1

[root@pulpo-admin Pulpos]# ceph osd dump | grep "^pool" | grep "crush_ruleset 1"
pool 1 'cephfs_data' erasure size 3 min_size 3 crush_ruleset 1 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 200 flags hashpspool stripe_width 4096

[root@pulpo-admin Pulpos]# ceph osd crush rule ls
[
    "replicated_ruleset",
    "cephfs_data"
]

[root@pulpo-admin Pulpos]# ceph osd crush rule dump cephfs_data
{
    "rule_id": 1,
    "rule_name": "cephfs_data",
    "ruleset": 1,
    "type": 3,
    "min_size": 3,
    "max_size": 3,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -1,
            "item_name": "hdd"
        },
        {
            "op": "chooseleaf_indep",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

Creating a replicated metadata pool

Create the replicated metadata pool for CephFS, using the default CRUSH ruleset replicated_ruleset:

[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_metadata 1024 1024 replicated
pool 'cephfs_metadata' created

Let’s verify the ruleset for pool cephfs_metadata:

[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata crush_ruleset
crush_ruleset: 0

[root@pulpo-admin Pulpos]# ceph osd dump | grep "^pool" | grep "crush_ruleset 0"
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 185 flags hashpspool stripe_width 0
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 202 flags hashpspool stripe_width 0

[root@pulpo-admin Pulpos]# ceph osd crush rule dump replicated_ruleset
{
    "rule_id": 0,
    "rule_name": "replicated_ruleset",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "hdd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

NOTE at this point, if we try to create a CephFS, we’ll get an error!

[root@pulpo-admin Pulpos]# ceph fs new pulpos cephfs_metadata cephfs_data
Error EINVAL: pool 'cephfs_data' (id '1') is an erasure-code pool

So in Kraken, if the data pool is erasure-coded, it is required to add a ‘cache tier’ to the data pool for CephFS to work!

Adding Cache Tiering to the data pool

The goal is to create a replicated pool on the OSDs backed by the NVMes, as the cache tier of the Erasure Code data pool for the CephFS.

1) Create a new CRUSH ruleset for the cache pool:

[root@pulpo-admin Pulpos]# ceph osd crush rule create-simple replicated_nvme nvme host
[root@pulpo-admin Pulpos]# ceph osd crush rule list
[
    "replicated_ruleset",
    "cephfs_data",
    "replicated_nvme"
]

[root@pulpo-admin Pulpos]# ceph osd crush rule dump replicated_nvme
{
    "rule_id": 2,
    "rule_name": "replicated_nvme",
    "ruleset": 2,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -8,
            "item_name": "nvme"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

2) Create the replicated cache pool:

[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_cache 128 128 replicated replicated_nvme
pool 'cephfs_cache' created

By default, the replication size is 3. But 2 is sufficient for the cache pool.

[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_cache size
size: 3
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache size 2
set pool 3 size to 2

One can list all the placement groups of the cache pool (pool 3):

# ceph pg dump | grep '^3\.'

and get the placement group map for a particular placement group:

#  ceph pg map 3.5c
osdmap e215 pg 3.5c (3.5c) -> up [38,37] acting [38,37]

3) Create the cache tier:

[root@pulpo-admin Pulpos]# ceph osd tier add cephfs_data cephfs_cache
pool 'cephfs_cache' is now (or already was) a tier of 'cephfs_data'

[root@pulpo-admin Pulpos]# ceph osd tier cache-mode cephfs_cache writeback
set cache-mode for pool 'cephfs_cache' to writeback
[root@pulpo-admin Pulpos]# ceph osd tier set-overlay cephfs_data cephfs_cache
overlay for 'cephfs_data' is now (or already was) 'cephfs_cache'
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache hit_set_type bloom
set pool 3 hit_set_type to bloom
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache hit_set_count 12
set pool 3 hit_set_count to 12
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache hit_set_period 14400
set pool 3 hit_set_period to 14400

[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache target_max_bytes 1099511627776
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache target_max_objects 1000000

Creating CephFS

Now we can create the Ceph Filesystem:

[root@pulpo-admin Pulpos]# ceph fs new pulpos cephfs_metadata cephfs_data
new fs with metadata pool 2 and data pool 1

Mounting CephFS on clients

There are 2 ways to mount CephFS on a client: using either the kernel CephFS driver, or ceph-fuse. The fuse client is the easiest way to get up to date code, while the kernel client will often give better performance.

On a client, e.g., pulpo-dtn, create the mount point:

[root@pulpo-dtn ~]# mkdir /mnt/pulpos

Kernel CephFS driver

The Ceph Storage Cluster runs with authentication turned on by default. We need a file containing the secret key (i.e., not the keyring itself).

0) Create the secret file, and save it as /etc/ceph/admin.secret.

1) We can use the mount command to mount CephFS with kernel driver

[root@pulpo-dtn ~]# mount -t ceph 128.114.86.4:6789:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret

or more redundantly

[root@pulpo-dtn ~]# mount -t ceph 128.114.86.4:6789,128.114.86.5:6789,128.114.86.2:6789:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret

With this method, we need to specify the monitor host IP address(es) and port number(s).

2) Or we can use the simple helper mount.ceph, which resolve monitor hostname(s) into IP address(es):

[root@pulpo-dtn ~]# mount.ceph pulpo-mon01:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret

or more redundantly

[root@pulpo-dtn ~]# mount.ceph pulpo-mon01,pulpo-mds01,pulpo-admin:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret

3) To mount CephFS automatically on startup, we can add the following to /etc/fstab:

128.114.86.4:6789,128.114.86.5:6789,128.114.86.2:6789:/  /mnt/pulpos  ceph  name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev  0  2

ceph-fuse

Make sure the ceph-fuse package is installed. We’ve already installed the package on pulpo-dtn, using Ansible.

cephx authentication is on by default. Ensure that the client host has a copy of the Ceph configuration file and a keyring with CAPS for the Ceph metadata server. pulpo-dtn already has a copy of these 2 files. NOTE ceph-fuse uses the keyring rather than a secret file for authentication!

Then we can use the ceph-fuse command to mount the CephFS as a FUSE (Filesystem in Userspace): on pulpo-dtn:

[root@pulpo-dtn ~]# ceph-fuse -m 128.114.86.4:6789 /mnt/pulpos
ceph-fuse[11424]: starting ceph client2017-09-04 11:35:32.792972 7f0949207f00 -1 init, newargv = 0x7f09537e8c60 newargc=11

ceph-fuse[11424]: starting fuse

or more redundantly:

[root@pulpo-dtn ~]# ceph-fuse -m pulpo-mon01:6789,pulpo-mds01:6789,pulpo-admin:6789 /mnt/pulpos

There are 2 options to automate mounting ceph-fuse: fstab or systemd.

1) We can add the following to /etc/fstab (see http://docs.ceph.com/docs/kraken/cephfs/fstab/#fuse):

id=admin  /mnt/pulpos  fuse.ceph  defaults,_netdev  0  0

2) ceph-fuse@.service and ceph-fuse.target systemd units are available. To mount CephFS as a FUSE on /mnt/pulpos, using systemctl:

[root@pulpo-dtn ~]# systemctl start ceph-fuse@/mnt/pulpos.service

To create a persistent mount point:

[root@pulpo-dtn ~]# systemctl enable ceph-fuse.target
Created symlink from /etc/systemd/system/remote-fs.target.wants/ceph-fuse.target to /usr/lib/systemd/system/ceph-fuse.target.
Created symlink from /etc/systemd/system/ceph.target.wants/ceph-fuse.target to /usr/lib/systemd/system/ceph-fuse.target.

[root@pulpo-dtn ~]# systemctl enable ceph-fuse@-mnt-pulpos
Created symlink from /etc/systemd/system/ceph-fuse.target.wants/ceph-fuse@-mnt-pulpos.service to /usr/lib/systemd/system/ceph-fuse@.service.

NOTE here the command must be systemctl enable ceph-fuse@-mnt-pulpos. If we run systemctl enable ceph-fuse@/mnt/pulpos instead, we’ll get an error “Failed to execute operation: Unit name pulpos is not valid.” However, when starting the service, we can run either systemctl start ceph-fuse@/mnt/pulpos or systemctl start ceph-fuse@-mnt-pulpos!