Installing Ceph v12.2.1 (Luminous) on Pulpos

October 9, 2017 | Ceph Storage Provisioning

In this post, we describe how we cleanly installed Ceph v12.2.1 (codename Luminous) on the Pulpos cluster.

We installed Ceph v12.2.0 on Pulpos in late Auguest. We created a Ceph Filesystem (CephFS), using 3 RADOS pools:

  1. an Erasure Code data pool on the OSDs backed by HDDs
  2. a replicated metadata pool on the OSDs backed by HDDs
  3. a replicated pool on the OSDs backed by NVMes, as the cache tier of the Erasure Code data pool

It worked remarkably well. However, the cache tier added a lot of complexity to the architecture, but didn’t appear to add much to performance! When Ceph v12.2.1 was release on September 28, 2017, I decided to wipe the slate clean, and create from scratch a much simpler Ceph Filesystem, using just 2 RADOS pools:

  1. a replicated data pool on the OSDs backed by HDDs, using partitions on NVMes as the WAL and DB devices
  2. a replicated metadata pool on the OSDs backed by NVMes

Purging old Ceph installation

In order to start with a clean slate, we ran the following Bash script (start-over.sh) on the admin node (pulpo-admin) to purge old Ceph packages, and erase old Ceph data and configuration:

#!/bin/bash

cd ~/Pulpos

# Uninstall ceph-fuse on the client
ssh pulpo-dtn "umount /mnt/pulpos; yum erase -y ceph-fuse"

# http://docs.ceph.com/docs/master/start/quick-ceph-deploy/#starting-over
ceph-deploy purge pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
ceph-deploy purgedata pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
ceph-deploy forgetkeys

# More cleanups
ansible -m command -a "yum erase -y libcephfs2 python-cephfs librados2 python-rados librbd1 python-rbd librgw2 python-rgw" all
ansible -m shell -a "rm -rf /etc/systemd/system/ceph*.target.wants" all
yum erase -y ceph-deploy
rm -f ceph*

We then rebooted all the nodes.

Installing Luminous packages

A new kernel and many updated packages have been released in the interim. We first used a simple Ansible playbook to upgrade all packages.

Then we used a simple Ansible playbook to install Ceph v12.2.1 (Luminous) packages on all nodes in Pulpos, performing the following tasks:

1) Add a Yum repository for Luminous (/etc/yum.repos.d/ceph.repo) on all the nodes:

[Ceph]
name=Ceph packages for $basearch
baseurl=https://download.ceph.com/rpm-luminous/el7/$basearch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2

[Ceph-noarch]
name=Ceph noarch packages
baseurl=https://download.ceph.com/rpm-luminous/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2

[ceph-source]
name=Ceph source packages
baseurl=https://download.ceph.com/rpm-luminous/el7/SRPMS
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2

2) Install Ceph RPM packages on all the nodes;

3) Install ceph-fuse on the client node (pulpo-dtn);

4) Install ceph-deploy on the admin node (pulpo-admin).

Let’s verify that Luminous is installed on all the nodes:

[root@pulpo-admin ~]# ansible -m command -a "ceph --version" all
pulpo-dtn.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

pulpo-admin.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

pulpo-mds01.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

pulpo-mon01.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

pulpo-osd01.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

pulpo-osd02.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

pulpo-osd03.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

Ceph-Deploy

We use ceph-deploy to deploy Luminous on the Pulpos cluster. ceph-deploy is a easy and quick tool to set up and take down a Ceph cluster. It uses ssh to gain access to other Ceph nodes from the admin node (pulpo-admin), and then uses the underlying Python scripts to automate the manual process of Ceph installation on each node. One can also use a generic deployment system, such as Puppet, Chef or Ansible, to deploy Ceph. I am particularly interested in ceph-ansible, the Ansible playbook for Ceph; and may try it in the near future.

1) We use the directory /root/Pulpos on the admin node to maintain the configuration files and keys

[root@pulpo-admin ~]# cd ~/Pulpos/

2) Create a cluster, with pulpo-mon01 as the initial monitor node (We’ll add 2 monitors shortly):

[root@pulpo-admin Pulpos]# ceph-deploy new pulpo-mon01

which generates ceph.conf & ceph.mon.keyring in the directory.

3) Append the following 2 lines to ceph.conf

public_network = 128.114.86.0/24
cluster_network = 192.168.40.0/24

The public_network is 10 Gb/s and cluster_network 40 Gb/s (see Pulpos Networks)

4) Append the following 2 lines to ceph.conf (to allow deletion of pools):

[mon]
mon_allow_pool_delete = true

5) Deploy the initial monitor(s) and gather the keys:

[root@pulpo-admin Pulpos]# ceph-deploy mon create-initial

which generated ceph.client.admin.keyring, ceph.bootstrap-osd.keyring, ceph.bootstrap-mds.keyring & ceph.bootstrap-rgw.keyringmon.keyring in the directory.

6) Copy the configuration file and admin key to all the nodes

[root@pulpo-admin Pulpos]# ceph-deploy admin pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03

which copies ceph.client.admin.keyring & ceph.conf to the directory /etc/ceph on all the nodes.

7) Add 2 more monitors, on pulpo-mds01 & pulpo-admin, respectively:

[root@pulpo-admin Pulpos]# ceph-deploy mon add pulpo-mds01
[root@pulpo-admin Pulpos]# ceph-deploy mon add pulpo-admin

It seems that we can only add one at a time.

8) Deploy a manager daemon on each of the monitor nodes:

[root@pulpo-admin Pulpos]# ceph-deploy mgr create pulpo-mon01 pulpo-mds01 pulpo-admin

ceph-mgr is new daemon introduced in Luminous, and is a required part of any Luminous deployment.

Adding OSDs

1) List the disks on the OSD nodes:

[root@pulpo-admin Pulpos]# ssh pulpo-osd01 ceph-disk list

or

[root@pulpo-admin Pulpos]# ssh pulpo-osd01 lsblk

2) We use the following Bash script (zap-disks.sh) to zap the disks on the OSD nodes (Caution: device names can and do change!):

#!/bin/bash

for i in {1..3}
do
  for x in {a..l}
  do
     # zap the twelve 8TB SATA drives
     ceph-deploy disk zap pulpo-osd0${i}:sd${x}
  done
  for j in {0..1}
  do
     # zap the two NVMe SSDs
     ceph-deploy disk zap pulpo-osd0${i}:nvme${j}n1
  done
done

3) We then use the following Bash script (create-osd.sh) to create OSDs on the OSD nodes:

#!/bin/bash

### HDDs
for i in {1..3}
do
  for x in {a..l}
  do
     ceph-deploy osd prepare --bluestore --block-db /dev/nvme0n1 --block-wal /dev/nvme0n1 pulpo-osd0${i}:sd${x}
     ceph-deploy osd activate pulpo-osd0${i}:sd${x}1
     sleep 10
  done
done

### NVMe
for i in {1..3}
do
  ceph-deploy osd prepare --bluestore pulpo-osd0${i}:nvme1n1
  ceph-deploy osd activate pulpo-osd0${i}:nvme1n1p1
  sleep 10
done

The goals were to (on each of the OSD nodes):

  • Create an OSD on each of the 8TB SATA HDDs, using the new bluestore backend;
  • Use a partition on the first NVMe SSD (/dev/nvme1n0) as the WAL device and another partition as the DB device for each of the OSDs on the HDDs;
  • Create an OSD on the second NVMe SSD (/dev/nvme1n1), using the new bluetore backend.

Let’s verify we have achieved our goals:

[root@pulpo-admin Pulpos]# ssh pulpo-osd01 ceph-disk list
/dev/nvme0n1 :
 /dev/nvme0n1p1 ceph block.db, for /dev/sda1
 /dev/nvme0n1p10 ceph block.wal, for /dev/sde1
 /dev/nvme0n1p11 ceph block.db, for /dev/sdf1
 /dev/nvme0n1p12 ceph block.wal, for /dev/sdf1
 /dev/nvme0n1p13 ceph block.db, for /dev/sdg1
 /dev/nvme0n1p14 ceph block.wal, for /dev/sdg1
 /dev/nvme0n1p15 ceph block.db, for /dev/sdh1
 /dev/nvme0n1p16 ceph block.wal, for /dev/sdh1
 /dev/nvme0n1p17 ceph block.db, for /dev/sdi1
 /dev/nvme0n1p18 ceph block.wal, for /dev/sdi1
 /dev/nvme0n1p19 ceph block.db, for /dev/sdj1
 /dev/nvme0n1p2 ceph block.wal, for /dev/sda1
 /dev/nvme0n1p20 ceph block.wal, for /dev/sdj1
 /dev/nvme0n1p21 ceph block.db, for /dev/sdk1
 /dev/nvme0n1p22 ceph block.wal, for /dev/sdk1
 /dev/nvme0n1p23 ceph block.db, for /dev/sdl1
 /dev/nvme0n1p24 ceph block.wal, for /dev/sdl1
 /dev/nvme0n1p3 ceph block.db, for /dev/sdb1
 /dev/nvme0n1p4 ceph block.wal, for /dev/sdb1
 /dev/nvme0n1p5 ceph block.db, for /dev/sdc1
 /dev/nvme0n1p6 ceph block.wal, for /dev/sdc1
 /dev/nvme0n1p7 ceph block.db, for /dev/sdd1
 /dev/nvme0n1p8 ceph block.wal, for /dev/sdd1
 /dev/nvme0n1p9 ceph block.db, for /dev/sde1
/dev/nvme1n1 :
 /dev/nvme1n1p1 ceph data, active, cluster ceph, osd.36, block /dev/nvme1n1p2
 /dev/nvme1n1p2 ceph block, for /dev/nvme1n1p1
/dev/sda :
 /dev/sda1 ceph data, active, cluster ceph, osd.0, block /dev/sda2, block.db /dev/nvme0n1p1, block.wal /dev/nvme0n1p2
 /dev/sda2 ceph block, for /dev/sda1
/dev/sdb :
 /dev/sdb1 ceph data, active, cluster ceph, osd.1, block /dev/sdb2, block.db /dev/nvme0n1p3, block.wal /dev/nvme0n1p4
 /dev/sdb2 ceph block, for /dev/sdb1
/dev/sdc :
 /dev/sdc1 ceph data, active, cluster ceph, osd.2, block /dev/sdc2, block.db /dev/nvme0n1p5, block.wal /dev/nvme0n1p6
 /dev/sdc2 ceph block, for /dev/sdc1
/dev/sdd :
 /dev/sdd1 ceph data, active, cluster ceph, osd.3, block /dev/sdd2, block.db /dev/nvme0n1p7, block.wal /dev/nvme0n1p8
 /dev/sdd2 ceph block, for /dev/sdd1
/dev/sde :
 /dev/sde1 ceph data, active, cluster ceph, osd.4, block /dev/sde2, block.db /dev/nvme0n1p9, block.wal /dev/nvme0n1p10
 /dev/sde2 ceph block, for /dev/sde1
/dev/sdf :
 /dev/sdf1 ceph data, active, cluster ceph, osd.5, block /dev/sdf2, block.db /dev/nvme0n1p11, block.wal /dev/nvme0n1p12
 /dev/sdf2 ceph block, for /dev/sdf1
/dev/sdg :
 /dev/sdg1 ceph data, active, cluster ceph, osd.6, block /dev/sdg2, block.db /dev/nvme0n1p13, block.wal /dev/nvme0n1p14
 /dev/sdg2 ceph block, for /dev/sdg1
/dev/sdh :
 /dev/sdh1 ceph data, active, cluster ceph, osd.7, block /dev/sdh2, block.db /dev/nvme0n1p15, block.wal /dev/nvme0n1p16
 /dev/sdh2 ceph block, for /dev/sdh1
/dev/sdi :
 /dev/sdi1 ceph data, active, cluster ceph, osd.8, block /dev/sdi2, block.db /dev/nvme0n1p17, block.wal /dev/nvme0n1p18
 /dev/sdi2 ceph block, for /dev/sdi1
/dev/sdj :
 /dev/sdj1 ceph data, active, cluster ceph, osd.9, block /dev/sdj2, block.db /dev/nvme0n1p19, block.wal /dev/nvme0n1p20
 /dev/sdj2 ceph block, for /dev/sdj1
/dev/sdk :
 /dev/sdk1 ceph data, active, cluster ceph, osd.10, block /dev/sdk2, block.db /dev/nvme0n1p21, block.wal /dev/nvme0n1p22
 /dev/sdk2 ceph block, for /dev/sdk1
/dev/sdl :
 /dev/sdl1 ceph data, active, cluster ceph, osd.11, block /dev/sdl2, block.db /dev/nvme0n1p23, block.wal /dev/nvme0n1p24
 /dev/sdl2 ceph block, for /dev/sdl1

Each DB partition is only 1GB in size and each WAL partition is only 576MB. So there is plenty of space left on the first NVMe SSD (the total capacity is 1.1TB). We may create a new partition there to benchmark the NVMe SSD in the near future.

Let’s chech the cluster status:

[root@pulpo-admin Pulpos]# ceph -s
cluster:
  id:     5f675e57-a4dc-4425-ab4e-2e46f605411d
  health: HEALTH_OK

services:
  mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
  mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
  osd: 39 osds: 39 up, 39 in

data:
  pools:   0 pools, 0 pgs
  objects: 0 objects, 0 bytes
  usage:   0 kB used, 0 kB / 0 kB avail
  pgs:

Further readings on BlueStore:

CRUSH device class

One nice new feature introduced in Luminous is CRUSH device class.

[root@pulpo-admin Pulpos]# ceph osd tree
ID CLASS WEIGHT    TYPE NAME            STATUS REWEIGHT PRI-AFF
-1       265.29291 root default
-3        88.43097     host pulpo-osd01
 0   hdd   7.27829         osd.0            up  1.00000 1.00000
 1   hdd   7.27829         osd.1            up  1.00000 1.00000
 2   hdd   7.27829         osd.2            up  1.00000 1.00000
 3   hdd   7.27829         osd.3            up  1.00000 1.00000
 4   hdd   7.27829         osd.4            up  1.00000 1.00000
 5   hdd   7.27829         osd.5            up  1.00000 1.00000
 6   hdd   7.27829         osd.6            up  1.00000 1.00000
 7   hdd   7.27829         osd.7            up  1.00000 1.00000
 8   hdd   7.27829         osd.8            up  1.00000 1.00000
 9   hdd   7.27829         osd.9            up  1.00000 1.00000
10   hdd   7.27829         osd.10           up  1.00000 1.00000
11   hdd   7.27829         osd.11           up  1.00000 1.00000
36  nvme   1.09149         osd.36           up  1.00000 1.00000
-5        88.43097     host pulpo-osd02
12   hdd   7.27829         osd.12           up  1.00000 1.00000
13   hdd   7.27829         osd.13           up  1.00000 1.00000
14   hdd   7.27829         osd.14           up  1.00000 1.00000
15   hdd   7.27829         osd.15           up  1.00000 1.00000
16   hdd   7.27829         osd.16           up  1.00000 1.00000
17   hdd   7.27829         osd.17           up  1.00000 1.00000
18   hdd   7.27829         osd.18           up  1.00000 1.00000
19   hdd   7.27829         osd.19           up  1.00000 1.00000
20   hdd   7.27829         osd.20           up  1.00000 1.00000
21   hdd   7.27829         osd.21           up  1.00000 1.00000
22   hdd   7.27829         osd.22           up  1.00000 1.00000
23   hdd   7.27829         osd.23           up  1.00000 1.00000
37  nvme   1.09149         osd.37           up  1.00000 1.00000
-7        88.43097     host pulpo-osd03
24   hdd   7.27829         osd.24           up  1.00000 1.00000
25   hdd   7.27829         osd.25           up  1.00000 1.00000
26   hdd   7.27829         osd.26           up  1.00000 1.00000
27   hdd   7.27829         osd.27           up  1.00000 1.00000
28   hdd   7.27829         osd.28           up  1.00000 1.00000
29   hdd   7.27829         osd.29           up  1.00000 1.00000
30   hdd   7.27829         osd.30           up  1.00000 1.00000
31   hdd   7.27829         osd.31           up  1.00000 1.00000
32   hdd   7.27829         osd.32           up  1.00000 1.00000
33   hdd   7.27829         osd.33           up  1.00000 1.00000
34   hdd   7.27829         osd.34           up  1.00000 1.00000
35   hdd   7.27829         osd.35           up  1.00000 1.00000
38  nvme   1.09149         osd.38           up  1.00000 1.00000

Luminous automatically associate the OSDs backed by HDDs with the hdd device class; and the OSDs backed by NVMes with the nvme device class. So we no longer need to manually modify CRUSH map (as in kraken and earlier Ceph releases) in order to place different pools on different OSDs!

[root@pulpo-admin Pulpos]# ceph osd crush class ls
[
    "hdd",
    "nvme"
]

Adding an MDS

Add a Metadata server:

[root@pulpo-admin Pulpos]# ceph-deploy mds create pulpo-mds01

As alluded to earlier, the goal is to create a Ceph Filesystem (CephFS), using 2 RADOS pools:

  1. a replicated data pool on the OSDs backed by HDDs
  2. a replicated metadata pool on the OSDs backed by NVMes

Creating a replicated data pool

However, the default CRUSH rule for replicated pool, replicated_rule, will use all types of OSDs, no matter whether they are backed by HDDs, or by NVMes:

[root@pulpo-admin Pulpos]# ceph osd crush rule dump replicated_rule
{
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

Here is the syntax for creating a new replication rule:

osd crush rule create-replicated <name>      create crush rule <name> for replicated
 <root> <type> {<class>}                      pool to start from <root>, replicate
                                              across buckets of type <type>, using a
                                              choose mode of <firstn|indep> (default
                                              firstn; indep best for erasure pools)

Let’s create a new replication rule, pulpo_hdd, that targets the hdd device class (root is default and bucket type host):

[root@pulpo-admin Pulpos]# ceph osd crush rule create-replicated pulpo_hdd default host hdd

Check the rule:

[root@pulpo-admin Pulpos]# ceph osd crush rule dump pulpo_hdd
{
    "rule_id": 1,
    "rule_name": "pulpo_hdd",
    "ruleset": 1,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -2,
            "item_name": "default~hdd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

We can now create the metadata pool using the CRUSH rule pulpo_hdd:

[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_data 1024 1024 replicated pulpo_hdd
pool 'cephfs_data' created

Here we set placement groups number to be 1024, which was a bit too low; as we got a warning from ceph -s:

[root@pulpo-admin Pulpos]# ceph -s
  cluster:
    id:     5f675e57-a4dc-4425-ab4e-2e46f605411d
    health: HEALTH_WARN
            Reduced data availability: 1152 pgs inactive
            Degraded data redundancy: 1152 pgs unclean
            too few PGs per OSD (29 < min 30)

  services:
    mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
    mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
    mds: pulpos-1/1/1 up  {0=pulpo-mds01=up:active}
    osd: 39 osds: 39 up, 39 in

  data:
    pools:   2 pools, 1152 pgs
    objects: 0 objects, 0 bytes
    usage:   0 kB used, 0 kB / 0 kB avail
    pgs:     100.000% pgs unknown
             1152 unknown

Let’s double the PG (Placement Groups) number:

[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data pg_num
pg_num: 1024
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data pgp_num
pgp_num: 1024
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_data pg_num 2048
set pool 1 pg_num to 2048
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_data pgp_num 2048
set pool 1 pgp_num to 2048

Let’s verify that pool cephfs_data indeed uses rule pulpo_hdd:

[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data crush_rule
crush_rule: pulpo_hdd

By default, the replication size is 3. But 2 is sufficient for the cache pool.

[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data size
size: 3
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_data size 2
set pool 1 size to 2

Creating a replicated metadata pool

1) Create a new replication rule, pulpo_nvme, that targets the nvme device class (root is default and bucket type host):

[root@pulpo-admin Pulpos]# ceph osd crush rule create-replicated pulpo_nvme default host nvme

Check the rule:

[root@pulpo-admin ~]# ceph osd crush rule ls
replicated_rule
pulpo_hdd
pulpo_nvme

[root@pulpo-admin ~]# ceph osd crush rule dump pulpo_nvme
{
    "rule_id": 2,
    "rule_name": "pulpo_nvme",
    "ruleset": 2,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -12,
            "item_name": "default~nvme"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

2) Create the replicated metadata pool:

[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_metadata 128 128 replicated pulpo_nvme
pool 'cephfs_metadata' created

By default, the replication size is 3. But 2 is sufficient for the metadata pool.

[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata size
size: 3
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_metadata size 2
set pool 2 size to 2

Let’s double the PG number:

[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata pg_num
pg_num: 128
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata pgp_num
pgp_num: 128
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_metadata pg_num 256
set pool 2 pg_num to 256
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_metadata pgp_num 256
set pool 2 pgp_num to 256

Let’s check the status of our Ceph cluster again:

[root@pulpo-admin Pulpos]# ceph -s
  cluster:
    id:     5f675e57-a4dc-4425-ab4e-2e46f605411d
    health: HEALTH_WARN
            Reduced data availability: 2176 pgs inactive
            Degraded data redundancy: 2176 pgs unclean

  services:
    mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
    mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
    mds: pulpos-1/1/1 up  {0=pulpo-mds01=up:active}
    osd: 39 osds: 39 up, 39 in

  data:
    pools:   2 pools, 2304 pgs
    objects: 0 objects, 0 bytes
    usage:   0 kB used, 0 kB / 0 kB avail
    pgs:     100.000% pgs unknown
             2304 unknown

One can list all the placement groups of the metadata pool (pool 2):

[root@pulpo-admin Pulpos]# ceph pg dump | grep '^2\.'

and get the placement group map for a particular placement group:

[root@pulpo-admin Pulpos]# ceph pg map 2.1
osdmap e169 pg 2.1 (2.1) -> up [36,38] acting [36,38]

Let’s verify that pool cephfs_metadata indeed uses rule pulpo_nvme:

[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata crush_rule
crush_rule: pulpo_nvme

Creating CephFS

Now we are ready to create the Ceph Filesystem:

[root@pulpo-admin Pulpos]# ceph fs new pulpos cephfs_metadata cephfs_data
new fs with metadata pool 2 and data pool 1

By the way, the default maximum file size is 1TiB:

[root@pulpo-admin ~]# ceph fs get pulpos | grep 1099511627776
max_file_size   1099511627776

Let’s raise it to 2TiB:

[root@pulpo-admin ~]# ceph fs set pulpos max_file_size 2199023255552
[root@pulpo-admin ~]# ceph fs get pulpos | grep max_file_size
max_file_size   2199023255552

Limitations

Unfortunately, there is a serious bug lurking in the current version of Luminous (v12.2.1)! If we check the status of the Ceph cluster, we are told that all placement groups are both inactive and unclean!

[root@pulpo-admin Pulpos]# ceph -s
cluster:
  id:     5f675e57-a4dc-4425-ab4e-2e46f605411d
  health: HEALTH_WARN
          Reduced data availability: 2176 pgs inactive
          Degraded data redundancy: 2176 pgs unclean

services:
  mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
  mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
  mds: pulpos-1/1/1 up  {0=pulpo-mds01=up:active}
  osd: 39 osds: 39 up, 39 in

data:
  pools:   2 pools, 2304 pgs
  objects: 0 objects, 0 bytes
  usage:   0 kB used, 0 kB / 0 kB avail
  pgs:     100.000% pgs unknown
           2304 unknown

Same with ceph health:

[root@pulpo-admin Pulpos]# ceph health detail
HEALTH_WARN Reduced data availability: 2304 pgs inactive; Degraded data redundancy: 2304 pgs unclean
PG_AVAILABILITY Reduced data availability: 2304 pgs inactive
    pg 1.7cd is stuck inactive for 223.224011, current state unknown, last acting []
    pg 1.7ce is stuck inactive for 223.224011, current state unknown, last acting []
    pg 1.7cf is stuck inactive for 223.224011, current state unknown, last acting []
...
PG_DEGRADED Degraded data redundancy: 2304 pgs unclean
    pg 1.7cd is stuck unclean for 223.224011, current state unknown, last acting []
    pg 1.7ce is stuck unclean for 223.224011, current state unknown, last acting []
    pg 1.7cf is stuck unclean for 223.224011, current state unknown, last acting []
...

However, if we query any placement group that is supposedly inactive and unclean, we find it to be actually both active and clean. Take, for example, pg 1.7cd:

[root@pulpo-admin Pulpos]# ceph pg 1.7cd query
{
    "state": "active+clean",
    "snap_trimq": "[]",
    "epoch": 178,
    "up": [
        5,
        13
    ],
    "acting": [
        5,
        13
    ],
    ...
}

Mounting CephFS on clients

There are 2 ways to mount CephFS on a client: using either the kernel CephFS driver, or ceph-fuse. The fuse client is the easiest way to get up to date code, while the kernel client will often give better performance.

On a client, e.g., pulpo-dtn, create the mount point:

[root@pulpo-dtn ~]# mkdir /mnt/pulpos

Kernel CephFS driver

The Ceph Storage Cluster runs with authentication turned on by default. We need a file containing the secret key (i.e., not the keyring itself).

0) Create the secret file, and save it as /etc/ceph/admin.secret.

1) We can use the mount command to mount CephFS with kernel driver

[root@pulpo-dtn ~]# mount -t ceph 128.114.86.4:6789:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret

or more redundantly

[root@pulpo-dtn ~]# mount -t ceph 128.114.86.4:6789,128.114.86.5:6789,128.114.86.2:6789:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret

With this method, we need to specify the monitor host IP address(es) and port number(s).

2) Or we can use the simple helper mount.ceph, which resolve monitor hostname(s) into IP address(es):

[root@pulpo-dtn ~]# mount.ceph pulpo-mon01:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret

or more redundantly

[root@pulpo-dtn ~]# mount.ceph pulpo-mon01,pulpo-mds01,pulpo-admin:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret

3) To mount CephFS automatically on startup, we can add the following to /etc/fstab:

128.114.86.4:6789,128.114.86.5:6789,128.114.86.2:6789:/  /mnt/pulpos  ceph  name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev  0  2

And here is another bug in current version of Luminous (v12.2.1): when CephFS is mounted, the mount point doesn’t show up in the output of df; and although we can list the mount point specifically with df -h /mnt/pulpos, the size of the filesystem is reported as 0!

[root@pulpo-dtn ~]# df -h /mnt/pulpos
Filesystem                                Size  Used Avail Use% Mounted on
128.114.86.4,128.114.86.5,128.114.86.2:/     0     0     0    - /mnt/pulpos

Nonetheless, we can read and write to the CephFS just fine!

ceph-fuse

Make sure the ceph-fuse package is installed. We’ve already installed the package on pulpo-dtn, using Ansible.

cephx authentication is on by default. Ensure that the client host has a copy of the Ceph configuration file and a keyring with CAPS for the Ceph metadata server. pulpo-dtn already has a copy of these 2 files. NOTE ceph-fuse uses the keyring rather than a secret file for authentication!

Then we can use the ceph-fuse command to mount the CephFS as a FUSE (Filesystem in Userspace): on pulpo-dtn:

[root@pulpo-dtn ~]# ceph-fuse -m 128.114.86.4:6789 /mnt/pulpos
ceph-fuse[3699]: starting ceph client2017-09-05 11:13:45.398103 7f8891150040 -1 init, newargv = 0x7f889b82ee40 newargc=9

ceph-fuse[3699]: starting fuse

or more redundantly:

[root@pulpo-dtn ~]# ceph-fuse -m pulpo-mon01:6789,pulpo-mds01:6789,pulpo-admin:6789 /mnt/pulpos

There are 2 options to automate mounting ceph-fuse: fstab or systemd.

1) We can add the following to /etc/fstab (see http://docs.ceph.com/docs/luminous/cephfs/fstab/#fuse):

none  /mnt/pulpos  fuse.ceph  ceph.id=admin,defaults,_netdev  0  0

2) ceph-fuse@.service and ceph-fuse.target systemd units are available. To mount CephFS as a FUSE on /mnt/pulpos, using systemctl:

[root@pulpo-dtn ~]# systemctl start ceph-fuse@/mnt/pulpos.service

To create a persistent mount point:

[root@pulpo-dtn ~]# systemctl enable ceph-fuse.target
Created symlink from /etc/systemd/system/remote-fs.target.wants/ceph-fuse.target to /usr/lib/systemd/system/ceph-fuse.target.
Created symlink from /etc/systemd/system/ceph.target.wants/ceph-fuse.target to /usr/lib/systemd/system/ceph-fuse.target.

[root@pulpo-dtn ~]# systemctl enable ceph-fuse@-mnt-pulpos
Created symlink from /etc/systemd/system/ceph-fuse.target.wants/ceph-fuse@-mnt-pulpos.service to /usr/lib/systemd/system/ceph-fuse@.service.

NOTE here the command must be systemctl enable ceph-fuse@-mnt-pulpos. If we run systemctl enable ceph-fuse@/mnt/pulpos instead, we’ll get an error “Failed to execute operation: Unit name pulpos is not valid.” However, when starting the service, we can run either systemctl start ceph-fuse@/mnt/pulpos or systemctl start ceph-fuse@-mnt-pulpos!

Lastly, we note the same bug in current version of Luminous (v12.2.1): when CephFS is mounted using ceph-fuse, the mount point doesn’t show up in the output of df; and although we can list the mount point specifically with df -h /mnt/pulpos, the size of the filesystem is reported as 0!

[root@pulpo-dtn ~]# df -h /mnt/pulpos
Filesystem      Size  Used Avail Use% Mounted on
ceph-fuse          0     0     0    - /mnt/pulpos

After a few days, it will be OK!

Upgrading to v12.2.2

Ceph v12.2.2 (Luminous) was released on December 1, 2017. The upgrade is a breeze:

[root@pulpo-admin ~]# yum clean all
[root@pulpo-admin ~]# yum -y update

[root@pulpo-admin ~]# tentakel "yum clean all"
[root@pulpo-admin ~]# tentakel "yum -y update"

[root@pulpo-admin ~]# reboot
[root@pulpo-admin ~]# tentakel reboot