Installing Ceph v12.2 (Luminous) on Pulpos
August 30, 2017 | Ceph Storage ProvisioningIn this post, we describe how we installed Ceph v12.2.0 (codename Luminous) on the Pulpos cluster.
In a surprising move, Red Hat released Ceph 12.2.0 on August 29, 2017, way ahead of their original schedule — Luminous was originally planned for release in Spring 2018! Luminous is the current Long Term Stable (LTS) release of Ceph, replacing both previous stable release Kraken (Ceph v11.2) and previous LTS release Jewel (Ceph v10.2). Luminous has introduced many major changes from Kraken and Jewel; upgrade from earlier release is non-trivial. So we’ll perform a clean re-installation of Luminous on Pulpos.
- Purging old Ceph installation
- Installing Luminous packages
- Ceph-Deploy
- Adding OSDs
- CRUSH device class
- Adding an MDS
- Creating an Erasure Code data pool
- Creating a replicated metadata pool
- Adding Cache Tiering to the data pool
- Creating CephFS
- A serious bug!
- Mounting CephFS on clients
Purging old Ceph installation
In order to start with a clean slate, we ran the following Bash script (start-over.sh
) on the admin node (pulpo-admin) to purge old Ceph packages, and erase old Ceph data and configuration:
#!/bin/bash
cd ~/Pulpos
# Uninstall ceph-fuse on the client
ssh pulpo-dtn "umount /mnt/pulpos; yum erase -y ceph-fuse"
# http://docs.ceph.com/docs/master/start/quick-ceph-deploy/#starting-over
ceph-deploy purge pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
ceph-deploy purgedata pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
ceph-deploy forgetkeys
# More cleanups
ansible -m command -a "yum erase -y libcephfs2 python-cephfs librados2 python-rados librbd1 python-rbd librgw2 python-rgw" all
ansible -m shell -a "rm -rf /etc/systemd/system/ceph*.target.wants" all
yum erase -y ceph-deploy
rm -f ceph*
We then rebooted all the nodes.
Installing Luminous packages
We use a simple Ansible playbook to install Ceph v12.2 (Luminous) packages on all nodes in Pulpos, performing the following tasks:
1) Add a Yum repository for Luminous (/etc/yum.repos.d/ceph.repo
) on all the nodes:
[Ceph]
name=Ceph packages for $basearch
baseurl=https://download.ceph.com/rpm-luminous/el7/$basearch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2
[Ceph-noarch]
name=Ceph noarch packages
baseurl=https://download.ceph.com/rpm-luminous/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2
[ceph-source]
name=Ceph source packages
baseurl=https://download.ceph.com/rpm-luminous/el7/SRPMS
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2
2) Install Ceph RPM packages on all the nodes;
3) Install ceph-fuse
on the client node (pulpo-dtn);
4) Install ceph-deploy
on the admin node (pulpo-admin).
Let’s verify that Luminous is installed on all the nodes:
[root@pulpo-admin ~]# ansible -m command -a "ceph --version" all
pulpo-admin.local | SUCCESS | rc=0 >>
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
pulpo-mon01.local | SUCCESS | rc=0 >>
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
pulpo-dtn.local | SUCCESS | rc=0 >>
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
pulpo-mds01.local | SUCCESS | rc=0 >>
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
pulpo-osd01.local | SUCCESS | rc=0 >>
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
pulpo-osd02.local | SUCCESS | rc=0 >>
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
pulpo-osd03.local | SUCCESS | rc=0 >>
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
Ceph-Deploy
We use ceph-deploy to deploy Luminous on the Pulpos cluster. ceph-deploy
is a easy and quick tool to set up and take down a Ceph cluster. It uses ssh to gain access to other Ceph nodes from the admin node (pulpo-admin), and then uses the underlying Python scripts to automate the manual process of Ceph installation on each node. One can also use a generic deployment system, such as Puppet, Chef or Ansible, to deploy Ceph. I am particularly interested in ceph-ansible, the Ansible playbook for Ceph; and may try it in the near future.
1) We use the directory /root/Pulpos
on the admin node to maintain the configuration files and keys
[root@pulpo-admin ~]# cd ~/Pulpos/
2) Create a cluster, with pulpo-mon01
as the initial monitor node (We’ll add 2 monitors shortly):
[root@pulpo-admin Pulpos]# ceph-deploy new pulpo-mon01
which generates ceph.conf
& ceph.mon.keyring
in the directory.
3) Append the following 2 lines to ceph.conf
public_network = 128.114.86.0/24
cluster_network = 192.168.40.0/24
The public_network
is 10 Gb/s and cluster_network
40 Gb/s (see Pulpos Networks)
4) Append the following 2 lines to ceph.conf
(to allow deletion of pools):
[mon]
mon_allow_pool_delete = true
5) Deploy the initial monitor(s) and gather the keys:
[root@pulpo-admin Pulpos]# ceph-deploy mon create-initial
which generated ceph.client.admin.keyring
, ceph.bootstrap-osd.keyring
, ceph.bootstrap-mds.keyring
& ceph.bootstrap-rgw.keyringmon.keyring
in the directory.
6) Copy the configuration file and admin key to all the nodes
[root@pulpo-admin Pulpos]# ceph-deploy admin pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
which copies ceph.client.admin.keyring
& ceph.conf
to the directory /etc/ceph
on all the nodes.
7) Add 2 more monitors, on pulpo-mds01 & pulpo-admin, respectively:
[root@pulpo-admin Pulpos]# ceph-deploy mon add pulpo-mds01
[root@pulpo-admin Pulpos]# ceph-deploy mon add pulpo-admin
It seems that we can only add one at a time.
8) Deploy a manager daemon on each of the monitor nodes:
[root@pulpo-admin Pulpos]# ceph-deploy mgr create pulpo-mon01 pulpo-mds01 pulpo-admin
ceph-mgr is new daemon introduced in Luminous, and is a required part of any Luminous deployment.
Adding OSDs
1) List the disks on the OSD nodes:
[root@pulpo-admin Pulpos]# ssh pulpo-osd01 ceph-disk list
or
[root@pulpo-admin Pulpos]# ssh pulpo-osd01 lsblk
2) We use the following Bash script (zap-disks.sh
) to zap the disks on the OSD nodes (Caution: device names can and do change!):
#!/bin/bash
for i in {1..3}
do
for x in {a..l}
do
# zap the twelve 8TB SATA drives
ceph-deploy disk zap pulpo-osd0${i}:sd${x}
done
for j in {0..1}
do
# zap the two NVMe SSDs
ceph-deploy disk zap pulpo-osd0${i}:nvme${j}n1
done
done
3) We then use the following Bash script (create-osd.sh
) to create OSDs on the OSD nodes:
#!/bin/bash
### HDDs
for i in {1..3}
do
for x in {a..l}
do
ceph-deploy osd prepare --bluestore --block-db /dev/nvme0n1 --block-wal /dev/nvme0n1 pulpo-osd0${i}:sd${x}
ceph-deploy osd activate pulpo-osd0${i}:sd${x}1
sleep 10
done
done
### NVMe
for i in {1..3}
do
ceph-deploy osd prepare --bluestore pulpo-osd0${i}:nvme1n1
ceph-deploy osd activate pulpo-osd0${i}:nvme1n1p1
sleep 10
done
The goals were to (on each of the OSD nodes):
- Create an OSD on each of the 8TB SATA HDDs, using the new bluestore backend;
- Use a partition on the first NVMe SSD (
/dev/nvme1n0
) as the WAL device and another partition as the DB device for each of the OSDs on the HDDs; - Create an OSD on the second NVMe SSD (
/dev/nvme1n1
), using the new bluetore backend.
Let’s verify we have achieved our goals:
[root@pulpo-admin Pulpos]# ssh pulpo-osd01 ceph-disk list
/dev/nvme0n1 :
/dev/nvme0n1p1 ceph block.db, for /dev/sda1
/dev/nvme0n1p10 ceph block.wal, for /dev/sde1
/dev/nvme0n1p11 ceph block.db, for /dev/sdf1
/dev/nvme0n1p12 ceph block.wal, for /dev/sdf1
/dev/nvme0n1p13 ceph block.db, for /dev/sdg1
/dev/nvme0n1p14 ceph block.wal, for /dev/sdg1
/dev/nvme0n1p15 ceph block.db, for /dev/sdh1
/dev/nvme0n1p16 ceph block.wal, for /dev/sdh1
/dev/nvme0n1p17 ceph block.db, for /dev/sdi1
/dev/nvme0n1p18 ceph block.wal, for /dev/sdi1
/dev/nvme0n1p19 ceph block.db, for /dev/sdj1
/dev/nvme0n1p2 ceph block.wal, for /dev/sda1
/dev/nvme0n1p20 ceph block.wal, for /dev/sdj1
/dev/nvme0n1p21 ceph block.db, for /dev/sdk1
/dev/nvme0n1p22 ceph block.wal, for /dev/sdk1
/dev/nvme0n1p23 ceph block.db, for /dev/sdl1
/dev/nvme0n1p24 ceph block.wal, for /dev/sdl1
/dev/nvme0n1p3 ceph block.db, for /dev/sdb1
/dev/nvme0n1p4 ceph block.wal, for /dev/sdb1
/dev/nvme0n1p5 ceph block.db, for /dev/sdc1
/dev/nvme0n1p6 ceph block.wal, for /dev/sdc1
/dev/nvme0n1p7 ceph block.db, for /dev/sdd1
/dev/nvme0n1p8 ceph block.wal, for /dev/sdd1
/dev/nvme0n1p9 ceph block.db, for /dev/sde1
/dev/nvme1n1 :
/dev/nvme1n1p1 ceph data, active, cluster ceph, osd.36, block /dev/nvme1n1p2
/dev/nvme1n1p2 ceph block, for /dev/nvme1n1p1
/dev/sda :
/dev/sda1 ceph data, active, cluster ceph, osd.0, block /dev/sda2, block.db /dev/nvme0n1p1, block.wal /dev/nvme0n1p2
/dev/sda2 ceph block, for /dev/sda1
/dev/sdb :
/dev/sdb1 ceph data, active, cluster ceph, osd.1, block /dev/sdb2, block.db /dev/nvme0n1p3, block.wal /dev/nvme0n1p4
/dev/sdb2 ceph block, for /dev/sdb1
/dev/sdc :
/dev/sdc1 ceph data, active, cluster ceph, osd.2, block /dev/sdc2, block.db /dev/nvme0n1p5, block.wal /dev/nvme0n1p6
/dev/sdc2 ceph block, for /dev/sdc1
/dev/sdd :
/dev/sdd1 ceph data, active, cluster ceph, osd.3, block /dev/sdd2, block.db /dev/nvme0n1p7, block.wal /dev/nvme0n1p8
/dev/sdd2 ceph block, for /dev/sdd1
/dev/sde :
/dev/sde1 ceph data, active, cluster ceph, osd.4, block /dev/sde2, block.db /dev/nvme0n1p9, block.wal /dev/nvme0n1p10
/dev/sde2 ceph block, for /dev/sde1
/dev/sdf :
/dev/sdf1 ceph data, active, cluster ceph, osd.5, block /dev/sdf2, block.db /dev/nvme0n1p11, block.wal /dev/nvme0n1p12
/dev/sdf2 ceph block, for /dev/sdf1
/dev/sdg :
/dev/sdg1 ceph data, active, cluster ceph, osd.6, block /dev/sdg2, block.db /dev/nvme0n1p13, block.wal /dev/nvme0n1p14
/dev/sdg2 ceph block, for /dev/sdg1
/dev/sdh :
/dev/sdh1 ceph data, active, cluster ceph, osd.7, block /dev/sdh2, block.db /dev/nvme0n1p15, block.wal /dev/nvme0n1p16
/dev/sdh2 ceph block, for /dev/sdh1
/dev/sdi :
/dev/sdi1 ceph data, active, cluster ceph, osd.8, block /dev/sdi2, block.db /dev/nvme0n1p17, block.wal /dev/nvme0n1p18
/dev/sdi2 ceph block, for /dev/sdi1
/dev/sdj :
/dev/sdj1 ceph data, active, cluster ceph, osd.9, block /dev/sdj2, block.db /dev/nvme0n1p19, block.wal /dev/nvme0n1p20
/dev/sdj2 ceph block, for /dev/sdj1
/dev/sdk :
/dev/sdk1 ceph data, active, cluster ceph, osd.10, block /dev/sdk2, block.db /dev/nvme0n1p21, block.wal /dev/nvme0n1p22
/dev/sdk2 ceph block, for /dev/sdk1
/dev/sdl :
/dev/sdl1 ceph data, active, cluster ceph, osd.11, block /dev/sdl2, block.db /dev/nvme0n1p23, block.wal /dev/nvme0n1p24
/dev/sdl2 ceph block, for /dev/sdl1
Each DB partition is only 1GB in size and each WAL partition is only 576MB. So there is plenty of space left on the first NVMe SSD (the total capacity is 1.1TB). We may create a new partition there to benchmark the NVMe SSD in the near future.
Let’s chech the cluster status:
[root@pulpo-admin Pulpos]# ceph -s
cluster:
id: e18516bf-39cb-4670-9f13-88ccb7d19769
health: HEALTH_OK
services:
mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
osd: 39 osds: 39 up, 39 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 bytes
usage: 0 kB used, 0 kB / 0 kB avail
pgs:
Further readings on BlueStore:
CRUSH device class
One nice new feature introduced in Luminous is CRUSH device class.
[root@pulpo-admin Pulpos]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 265.29291 root default
-3 88.43097 host pulpo-osd01
0 hdd 7.27829 osd.0 up 1.00000 1.00000
1 hdd 7.27829 osd.1 up 1.00000 1.00000
2 hdd 7.27829 osd.2 up 1.00000 1.00000
3 hdd 7.27829 osd.3 up 1.00000 1.00000
4 hdd 7.27829 osd.4 up 1.00000 1.00000
5 hdd 7.27829 osd.5 up 1.00000 1.00000
6 hdd 7.27829 osd.6 up 1.00000 1.00000
7 hdd 7.27829 osd.7 up 1.00000 1.00000
8 hdd 7.27829 osd.8 up 1.00000 1.00000
9 hdd 7.27829 osd.9 up 1.00000 1.00000
10 hdd 7.27829 osd.10 up 1.00000 1.00000
11 hdd 7.27829 osd.11 up 1.00000 1.00000
36 nvme 1.09149 osd.36 up 1.00000 1.00000
-5 88.43097 host pulpo-osd02
12 hdd 7.27829 osd.12 up 1.00000 1.00000
13 hdd 7.27829 osd.13 up 1.00000 1.00000
14 hdd 7.27829 osd.14 up 1.00000 1.00000
15 hdd 7.27829 osd.15 up 1.00000 1.00000
16 hdd 7.27829 osd.16 up 1.00000 1.00000
17 hdd 7.27829 osd.17 up 1.00000 1.00000
18 hdd 7.27829 osd.18 up 1.00000 1.00000
19 hdd 7.27829 osd.19 up 1.00000 1.00000
20 hdd 7.27829 osd.20 up 1.00000 1.00000
21 hdd 7.27829 osd.21 up 1.00000 1.00000
22 hdd 7.27829 osd.22 up 1.00000 1.00000
23 hdd 7.27829 osd.23 up 1.00000 1.00000
37 nvme 1.09149 osd.37 up 1.00000 1.00000
-7 88.43097 host pulpo-osd03
24 hdd 7.27829 osd.24 up 1.00000 1.00000
25 hdd 7.27829 osd.25 up 1.00000 1.00000
26 hdd 7.27829 osd.26 up 1.00000 1.00000
27 hdd 7.27829 osd.27 up 1.00000 1.00000
28 hdd 7.27829 osd.28 up 1.00000 1.00000
29 hdd 7.27829 osd.29 up 1.00000 1.00000
30 hdd 7.27829 osd.30 up 1.00000 1.00000
31 hdd 7.27829 osd.31 up 1.00000 1.00000
32 hdd 7.27829 osd.32 up 1.00000 1.00000
33 hdd 7.27829 osd.33 up 1.00000 1.00000
34 hdd 7.27829 osd.34 up 1.00000 1.00000
35 hdd 7.27829 osd.35 up 1.00000 1.00000
38 nvme 1.09149 osd.38 up 1.00000 1.00000
Luminous automatically associate the OSDs backed by HDDs with the hdd
device class; and the OSDs backed by NVMes with the nvme
device class. So we no longer need to manually modify CRUSH map (as in kraken and earlier Ceph releases) in order to place different pools on different OSDs!
[root@pulpo-admin Pulpos]# ceph osd crush class ls
[
"hdd",
"nvme"
]
Adding an MDS
Add a Metadata server:
[root@pulpo-admin Pulpos]# ceph-deploy mds create pulpo-mds01
The goal is to create a Ceph Filesystem (CephFS), using 3 RADOS pools:
- an Erasure Code data pool on the OSDs backed by HDDs
- a replicated metadata pool on the OSDs backed by HDDs
- a replicated pool on the OSDs backed by NVMes, as the cache tier of the Erasure Code data pool
Creating an Erasure Code data pool
The default erasure code profile sustains the loss of a single OSD.
[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile ls
default
[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile get default
k=2
m=1
plugin=jerasure
technique=reed_sol_van
Let’s create a new erasure code profile pulpo_ec
:
[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile set pulpo_ec k=2 m=1 crush-device-class=hdd plugin=jerasure technique=reed_sol_van
[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile ls
default
pulpo_ec
[root@pulpo-admin Pulpos]# ceph osd erasure-code-profile get pulpo_ec
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=2
m=1
plugin=jerasure
technique=reed_sol_van
w=8
The important parameter is crush-device-class=hdd
, which set hdd
as the device class for the profile. So a pool created with this profile will only use the OSDs backed by HDDs.
Create the Erasure Code data pool for CephFS, with the pulpo_ec
profile:
[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_data 1024 1024 erasure pulpo_ec
pool 'cephfs_data' created
which also generates a new CRUSH rule with the same name cephfs_data
. We note in passing a terminology change: what’s called CRUSH ruleset
in Kraken and earlier is now called CRUSH rule
; and the parameter crush_ruleset
in the old ceph command is now replaced with crush_rule
!
Let’s check the CRUSH rule for pool cephfs_data
:
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data crush_rule
crush_rule: cephfs_data
[root@pulpo-admin Pulpos]# ceph osd dump | grep "^pool" | grep "crush_rule 1"
pool 1 'cephfs_data' erasure size 3 min_size 3 crush_rule 1 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 160 flags hashpspool stripe_width 8192
[root@pulpo-admin Pulpos]# ceph osd crush rule ls
replicated_rule
cephfs_data
[root@pulpo-admin Pulpos]# ceph osd crush rule dump cephfs_data
{
"rule_id": 1,
"rule_name": "cephfs_data",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 3,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
By default, erasure coded pools only work with uses like RGW that perform full object writes and appends. A new feature introduced in Luminous allows partial writes for an erasure coded pool, which may be enabled with a per-pool setting. This lets RBD and CephFS store their data in an erasure coded pool! Let’s enable overwrites for the pool cephfs_data
:
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_data allow_ec_overwrites true
set pool 1 allow_ec_overwrites to true
Creating a replicated metadata pool
As stated earlier, the goal is to create a replicated metadata pool for CephFS on the OSDs backed by HDDs.
However, the default CRUSH rule for replicated pool, replicated_rule
, will use all types of OSDs, no matter whether they are backed by HDDs, or by NVMes:
[root@pulpo-admin Pulpos]# ceph osd crush rule dump replicated_rule
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
Here is the syntax for creating a new replication rule:
osd crush rule create-replicated <name> create crush rule <name> for replicated
<root> <type> {<class>} pool to start from <root>, replicate
across buckets of type <type>, using a
choose mode of <firstn|indep> (default
firstn; indep best for erasure pools)
Let’s create a new replication rule, pulpo_hdd
, that targets the hdd
device class (root is default
and bucket type host
):
[root@pulpo-admin Pulpos]# ceph osd crush rule create-replicated pulpo_hdd default host hdd
Check the rule:
[root@pulpo-admin Pulpos]# ceph osd crush rule dump pulpo_hdd
{
"rule_id": 2,
"rule_name": "pulpo_hdd",
"ruleset": 2,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
We can now create the metadata pool using the CRUSH rule pulpo_hdd
:
[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_metadata 1024 1024 replicated pulpo_hdd
pool 'cephfs_metadata' created
Let’s verify that pool cephfs_metadata
indeed uses rule pulpo_hdd
:
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata crush_rule
crush_rule: pulpo_hdd
We note in passing that because we have enabled overwrites for the Erasure Code data pool, we could create a CephFS at this point:
# ceph fs new pulpos cephfs_metadata cephfs_data
which is a marked improvement over Kraken. We, however, will delay the creation of CephFS until we’ve added a cache tier to the data pool.
Adding Cache Tiering to the data pool
The goal is to create a replicated pool on the OSDs backed by the NVMes, as the cache tier of the Erasure Code data pool for the CephFS.
1) Create a new replication rule, pulpo_nvme
, that targets the nvme
device class (root is default
and bucket type host
):
[root@pulpo-admin Pulpos]# ceph osd crush rule create-replicated pulpo_nvme default host nvme
Check the rule:
[root@pulpo-admin ~]# ceph osd crush rule ls
replicated_rule
cephfs_data
pulpo_hdd
pulpo_nvme
[root@pulpo-admin ~]# ceph osd crush rule dump pulpo_nvme
{
"rule_id": 3,
"rule_name": "pulpo_nvme",
"ruleset": 3,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -12,
"item_name": "default~nvme"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
2) Create the replicated cache pool:
[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_cache 128 128 replicated pulpo_nvme
pool 'cephfs_cache' created
By default, the replication size is 3. But 2 is sufficient for the cache pool.
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_cache size
size: 3
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache size 2
set pool 3 size to 2
One can list all the placement groups of the cache pool (pool 3):
# ceph pg dump | grep '^3\.'
and get the placement group map for a particular placement group:
# ceph pg map 3.5c
osdmap e173 pg 3.5c (3.5c) -> up [36,38] acting [36,38]
3) Create the cache tier:
[root@pulpo-admin Pulpos]# ceph osd tier add cephfs_data cephfs_cache
pool 'cephfs_cache' is now (or already was) a tier of 'cephfs_data'
[root@pulpo-admin Pulpos]# ceph osd tier cache-mode cephfs_cache writeback
set cache-mode for pool 'cephfs_cache' to writeback
[root@pulpo-admin Pulpos]# ceph osd tier set-overlay cephfs_data cephfs_cache
overlay for 'cephfs_data' is now (or already was) 'cephfs_cache'
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache hit_set_type bloom
set pool 3 hit_set_type to bloom
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache hit_set_count 12
set pool 3 hit_set_count to 12
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache hit_set_period 14400
set pool 3 hit_set_period to 14400
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache target_max_bytes 1099511627776
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_cache target_max_objects 1000000
Creating CephFS
Now we are ready to create the Ceph Filesystem:
[root@pulpo-admin Pulpos]# ceph fs new pulpos cephfs_metadata cephfs_data
new fs with metadata pool 2 and data pool 1
A serious bug!
Unfortunately, there is a serious bug lurking in the current version of Luminous (v12.2.0)! If we check the status of the Ceph cluster, we are told that all placement groups are both inactive and unclean!
[root@pulpo-admin Pulpos]# ceph -s
cluster:
id: e18516bf-39cb-4670-9f13-88ccb7d19769
health: HEALTH_WARN
Reduced data availability: 2176 pgs inactive
Degraded data redundancy: 2176 pgs unclean
services:
mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
mds: pulpos-1/1/1 up {0=pulpo-mds01=up:active}
osd: 39 osds: 39 up, 39 in
data:
pools: 3 pools, 2176 pgs
objects: 0 objects, 0 bytes
usage: 0 kB used, 0 kB / 0 kB avail
pgs: 100.000% pgs unknown
2176 unknown
Same with ceph health
:
[root@pulpo-admin Pulpos]# ceph health detail
HEALTH_WARN Reduced data availability: 2176 pgs inactive; Degraded data redundancy: 2176 pgs unclean
PG_AVAILABILITY Reduced data availability: 2176 pgs inactive
pg 1.31d is stuck inactive for 65861.646865, current state unknown, last acting []
pg 1.31e is stuck inactive for 65861.646865, current state unknown, last acting []
pg 1.31f is stuck inactive for 65861.646865, current state unknown, last acting []
...
PG_DEGRADED Degraded data redundancy: 2176 pgs unclean
pg 1.31d is stuck unclean for 65861.646865, current state unknown, last acting []
pg 1.31e is stuck unclean for 65861.646865, current state unknown, last acting []
pg 1.31f is stuck unclean for 65861.646865, current state unknown, last acting []
...
However, if we query any placement group that is supposedly inactive and unclean, we find it to be actually both active and clean. Take, for example, pg 1.31d:
[root@pulpo-admin Pulpos]# ceph pg 1.31d query
{
"state": "active+clean",
"snap_trimq": "[]",
"epoch": 180,
"up": [
17,
25,
6
],
"acting": [
17,
25,
6
],
...
}
We hope this bug will be fixed soon!
Mounting CephFS on clients
There are 2 ways to mount CephFS on a client: using either the kernel CephFS driver, or ceph-fuse. The fuse client is the easiest way to get up to date code, while the kernel client will often give better performance.
On a client, e.g., pulpo-dtn, create the mount point:
[root@pulpo-dtn ~]# mkdir /mnt/pulpos
Kernel CephFS driver
The Ceph Storage Cluster runs with authentication turned on by default. We need a file containing the secret key (i.e., not the keyring itself).
0) Create the secret file, and save it as /etc/ceph/admin.secret
.
1) We can use the mount
command to mount CephFS with kernel driver
[root@pulpo-dtn ~]# mount -t ceph 128.114.86.4:6789:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret
or more redundantly
[root@pulpo-dtn ~]# mount -t ceph 128.114.86.4:6789,128.114.86.5:6789,128.114.86.2:6789:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret
With this method, we need to specify the monitor host IP address(es) and port number(s).
2) Or we can use the simple helper mount.ceph, which resolve monitor hostname(s) into IP address(es):
[root@pulpo-dtn ~]# mount.ceph pulpo-mon01:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret
or more redundantly
[root@pulpo-dtn ~]# mount.ceph pulpo-mon01,pulpo-mds01,pulpo-admin:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret
3) To mount CephFS automatically on startup, we can add the following to /etc/fstab
:
128.114.86.4:6789,128.114.86.5:6789,128.114.86.2:6789:/ /mnt/pulpos ceph name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev 0 2
And here is another bug in current version of Luminous (v12.2.0): when CephFS is mounted, the mount point doesn’t show up in the output of df
; and although we can list the mount point specifically with df -h /mnt/pulpos
, the size of the filesystem is reported as 0!
[root@pulpo-dtn ~]# df -h /mnt/pulpos
Filesystem Size Used Avail Use% Mounted on
128.114.86.4,128.114.86.5,128.114.86.2:/ 0 0 0 - /mnt/pulpos
Nonetheless, we can read and write to the CephFS just fine!
ceph-fuse
Make sure the ceph-fuse
package is installed. We’ve already installed the package on pulpo-dtn, using Ansible.
cephx
authentication is on by default. Ensure that the client host has a copy of the Ceph configuration file and a keyring with CAPS for the Ceph metadata server. pulpo-dtn already has a copy of these 2 files. NOTE ceph-fuse uses the keyring rather than a secret file for authentication!
Then we can use the ceph-fuse
command to mount the CephFS as a FUSE (Filesystem in Userspace):
on pulpo-dtn:
[root@pulpo-dtn ~]# ceph-fuse -m 128.114.86.4:6789 /mnt/pulpos
ceph-fuse[3699]: starting ceph client2017-09-05 11:13:45.398103 7f8891150040 -1 init, newargv = 0x7f889b82ee40 newargc=9
ceph-fuse[3699]: starting fuse
or more redundantly:
[root@pulpo-dtn ~]# ceph-fuse -m pulpo-mon01:6789,pulpo-mds01:6789,pulpo-admin:6789 /mnt/pulpos
There are 2 options to automate mounting ceph-fuse: fstab
or systemd
.
1) We can add the following to /etc/fstab
(see http://docs.ceph.com/docs/luminous/cephfs/fstab/#fuse):
none /mnt/pulpos fuse.ceph ceph.id=admin,defaults,_netdev 0 0
2) ceph-fuse@.service
and ceph-fuse.target
systemd units are available. To mount CephFS as a FUSE on /mnt/pulpos
, using systemctl:
[root@pulpo-dtn ~]# systemctl start ceph-fuse@/mnt/pulpos.service
To create a persistent mount point:
[root@pulpo-dtn ~]# systemctl enable ceph-fuse.target
Created symlink from /etc/systemd/system/remote-fs.target.wants/ceph-fuse.target to /usr/lib/systemd/system/ceph-fuse.target.
Created symlink from /etc/systemd/system/ceph.target.wants/ceph-fuse.target to /usr/lib/systemd/system/ceph-fuse.target.
[root@pulpo-dtn ~]# systemctl enable ceph-fuse@-mnt-pulpos
Created symlink from /etc/systemd/system/ceph-fuse.target.wants/ceph-fuse@-mnt-pulpos.service to /usr/lib/systemd/system/ceph-fuse@.service.
NOTE here the command must be systemctl enable ceph-fuse@-mnt-pulpos
. If we run systemctl enable ceph-fuse@/mnt/pulpos
instead, we’ll get an error “Failed to execute operation: Unit name pulpos is not valid.” However, when starting the service, we can run either systemctl start ceph-fuse@/mnt/pulpos
or systemctl start ceph-fuse@-mnt-pulpos
!
Lastly, we note the same bug in current version of Luminous (v12.2.0): when CephFS is mounted using ceph-fuse, the mount point doesn’t show up in the output of df
; and although we can list the mount point specifically with df -h /mnt/pulpos
, the size of the filesystem is reported as 0!
[root@pulpo-dtn ~]# df -h /mnt/pulpos
Filesystem Size Used Avail Use% Mounted on
ceph-fuse 0 0 0 - /mnt/pulpos