Installing Ceph v12.2.1 (Luminous) on Pulpos
October 9, 2017 | Ceph Storage ProvisioningIn this post, we describe how we cleanly installed Ceph v12.2.1 (codename Luminous) on the Pulpos cluster.
We installed Ceph v12.2.0 on Pulpos in late Auguest. We created a Ceph Filesystem (CephFS), using 3 RADOS pools:
- an Erasure Code data pool on the OSDs backed by HDDs
- a replicated metadata pool on the OSDs backed by HDDs
- a replicated pool on the OSDs backed by NVMes, as the cache tier of the Erasure Code data pool
It worked remarkably well. However, the cache tier added a lot of complexity to the architecture, but didn’t appear to add much to performance! When Ceph v12.2.1 was release on September 28, 2017, I decided to wipe the slate clean, and create from scratch a much simpler Ceph Filesystem, using just 2 RADOS pools:
- a replicated data pool on the OSDs backed by HDDs, using partitions on NVMes as the WAL and DB devices
- a replicated metadata pool on the OSDs backed by NVMes
- Purging old Ceph installation
- Installing Luminous packages
- Ceph-Deploy
- Adding OSDs
- CRUSH device class
- Adding an MDS
- Creating a replicated data pool
- Creating a replicated metadata pool
- Creating CephFS
- Limitations
- Mounting CephFS on clients
- Upgrading to v12.2.2
Purging old Ceph installation
In order to start with a clean slate, we ran the following Bash script (start-over.sh
) on the admin node (pulpo-admin) to purge old Ceph packages, and erase old Ceph data and configuration:
#!/bin/bash
cd ~/Pulpos
# Uninstall ceph-fuse on the client
ssh pulpo-dtn "umount /mnt/pulpos; yum erase -y ceph-fuse"
# http://docs.ceph.com/docs/master/start/quick-ceph-deploy/#starting-over
ceph-deploy purge pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
ceph-deploy purgedata pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
ceph-deploy forgetkeys
# More cleanups
ansible -m command -a "yum erase -y libcephfs2 python-cephfs librados2 python-rados librbd1 python-rbd librgw2 python-rgw" all
ansible -m shell -a "rm -rf /etc/systemd/system/ceph*.target.wants" all
yum erase -y ceph-deploy
rm -f ceph*
We then rebooted all the nodes.
Installing Luminous packages
A new kernel and many updated packages have been released in the interim. We first used a simple Ansible playbook to upgrade all packages.
Then we used a simple Ansible playbook to install Ceph v12.2.1 (Luminous) packages on all nodes in Pulpos, performing the following tasks:
1) Add a Yum repository for Luminous (/etc/yum.repos.d/ceph.repo
) on all the nodes:
[Ceph]
name=Ceph packages for $basearch
baseurl=https://download.ceph.com/rpm-luminous/el7/$basearch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2
[Ceph-noarch]
name=Ceph noarch packages
baseurl=https://download.ceph.com/rpm-luminous/el7/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2
[ceph-source]
name=Ceph source packages
baseurl=https://download.ceph.com/rpm-luminous/el7/SRPMS
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://download.ceph.com/keys/release.asc
priority=2
2) Install Ceph RPM packages on all the nodes;
3) Install ceph-fuse
on the client node (pulpo-dtn);
4) Install ceph-deploy
on the admin node (pulpo-admin).
Let’s verify that Luminous is installed on all the nodes:
[root@pulpo-admin ~]# ansible -m command -a "ceph --version" all
pulpo-dtn.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
pulpo-admin.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
pulpo-mds01.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
pulpo-mon01.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
pulpo-osd01.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
pulpo-osd02.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
pulpo-osd03.local | SUCCESS | rc=0 >>
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
Ceph-Deploy
We use ceph-deploy to deploy Luminous on the Pulpos cluster. ceph-deploy
is a easy and quick tool to set up and take down a Ceph cluster. It uses ssh to gain access to other Ceph nodes from the admin node (pulpo-admin), and then uses the underlying Python scripts to automate the manual process of Ceph installation on each node. One can also use a generic deployment system, such as Puppet, Chef or Ansible, to deploy Ceph. I am particularly interested in ceph-ansible, the Ansible playbook for Ceph; and may try it in the near future.
1) We use the directory /root/Pulpos
on the admin node to maintain the configuration files and keys
[root@pulpo-admin ~]# cd ~/Pulpos/
2) Create a cluster, with pulpo-mon01
as the initial monitor node (We’ll add 2 monitors shortly):
[root@pulpo-admin Pulpos]# ceph-deploy new pulpo-mon01
which generates ceph.conf
& ceph.mon.keyring
in the directory.
3) Append the following 2 lines to ceph.conf
public_network = 128.114.86.0/24
cluster_network = 192.168.40.0/24
The public_network
is 10 Gb/s and cluster_network
40 Gb/s (see Pulpos Networks)
4) Append the following 2 lines to ceph.conf
(to allow deletion of pools):
[mon]
mon_allow_pool_delete = true
5) Deploy the initial monitor(s) and gather the keys:
[root@pulpo-admin Pulpos]# ceph-deploy mon create-initial
which generated ceph.client.admin.keyring
, ceph.bootstrap-osd.keyring
, ceph.bootstrap-mds.keyring
& ceph.bootstrap-rgw.keyringmon.keyring
in the directory.
6) Copy the configuration file and admin key to all the nodes
[root@pulpo-admin Pulpos]# ceph-deploy admin pulpo-admin pulpo-dtn pulpo-mon01 pulpo-mds01 pulpo-osd01 pulpo-osd02 pulpo-osd03
which copies ceph.client.admin.keyring
& ceph.conf
to the directory /etc/ceph
on all the nodes.
7) Add 2 more monitors, on pulpo-mds01 & pulpo-admin, respectively:
[root@pulpo-admin Pulpos]# ceph-deploy mon add pulpo-mds01
[root@pulpo-admin Pulpos]# ceph-deploy mon add pulpo-admin
It seems that we can only add one at a time.
8) Deploy a manager daemon on each of the monitor nodes:
[root@pulpo-admin Pulpos]# ceph-deploy mgr create pulpo-mon01 pulpo-mds01 pulpo-admin
ceph-mgr is new daemon introduced in Luminous, and is a required part of any Luminous deployment.
Adding OSDs
1) List the disks on the OSD nodes:
[root@pulpo-admin Pulpos]# ssh pulpo-osd01 ceph-disk list
or
[root@pulpo-admin Pulpos]# ssh pulpo-osd01 lsblk
2) We use the following Bash script (zap-disks.sh
) to zap the disks on the OSD nodes (Caution: device names can and do change!):
#!/bin/bash
for i in {1..3}
do
for x in {a..l}
do
# zap the twelve 8TB SATA drives
ceph-deploy disk zap pulpo-osd0${i}:sd${x}
done
for j in {0..1}
do
# zap the two NVMe SSDs
ceph-deploy disk zap pulpo-osd0${i}:nvme${j}n1
done
done
3) We then use the following Bash script (create-osd.sh
) to create OSDs on the OSD nodes:
#!/bin/bash
### HDDs
for i in {1..3}
do
for x in {a..l}
do
ceph-deploy osd prepare --bluestore --block-db /dev/nvme0n1 --block-wal /dev/nvme0n1 pulpo-osd0${i}:sd${x}
ceph-deploy osd activate pulpo-osd0${i}:sd${x}1
sleep 10
done
done
### NVMe
for i in {1..3}
do
ceph-deploy osd prepare --bluestore pulpo-osd0${i}:nvme1n1
ceph-deploy osd activate pulpo-osd0${i}:nvme1n1p1
sleep 10
done
The goals were to (on each of the OSD nodes):
- Create an OSD on each of the 8TB SATA HDDs, using the new bluestore backend;
- Use a partition on the first NVMe SSD (
/dev/nvme1n0
) as the WAL device and another partition as the DB device for each of the OSDs on the HDDs; - Create an OSD on the second NVMe SSD (
/dev/nvme1n1
), using the new bluetore backend.
Let’s verify we have achieved our goals:
[root@pulpo-admin Pulpos]# ssh pulpo-osd01 ceph-disk list
/dev/nvme0n1 :
/dev/nvme0n1p1 ceph block.db, for /dev/sda1
/dev/nvme0n1p10 ceph block.wal, for /dev/sde1
/dev/nvme0n1p11 ceph block.db, for /dev/sdf1
/dev/nvme0n1p12 ceph block.wal, for /dev/sdf1
/dev/nvme0n1p13 ceph block.db, for /dev/sdg1
/dev/nvme0n1p14 ceph block.wal, for /dev/sdg1
/dev/nvme0n1p15 ceph block.db, for /dev/sdh1
/dev/nvme0n1p16 ceph block.wal, for /dev/sdh1
/dev/nvme0n1p17 ceph block.db, for /dev/sdi1
/dev/nvme0n1p18 ceph block.wal, for /dev/sdi1
/dev/nvme0n1p19 ceph block.db, for /dev/sdj1
/dev/nvme0n1p2 ceph block.wal, for /dev/sda1
/dev/nvme0n1p20 ceph block.wal, for /dev/sdj1
/dev/nvme0n1p21 ceph block.db, for /dev/sdk1
/dev/nvme0n1p22 ceph block.wal, for /dev/sdk1
/dev/nvme0n1p23 ceph block.db, for /dev/sdl1
/dev/nvme0n1p24 ceph block.wal, for /dev/sdl1
/dev/nvme0n1p3 ceph block.db, for /dev/sdb1
/dev/nvme0n1p4 ceph block.wal, for /dev/sdb1
/dev/nvme0n1p5 ceph block.db, for /dev/sdc1
/dev/nvme0n1p6 ceph block.wal, for /dev/sdc1
/dev/nvme0n1p7 ceph block.db, for /dev/sdd1
/dev/nvme0n1p8 ceph block.wal, for /dev/sdd1
/dev/nvme0n1p9 ceph block.db, for /dev/sde1
/dev/nvme1n1 :
/dev/nvme1n1p1 ceph data, active, cluster ceph, osd.36, block /dev/nvme1n1p2
/dev/nvme1n1p2 ceph block, for /dev/nvme1n1p1
/dev/sda :
/dev/sda1 ceph data, active, cluster ceph, osd.0, block /dev/sda2, block.db /dev/nvme0n1p1, block.wal /dev/nvme0n1p2
/dev/sda2 ceph block, for /dev/sda1
/dev/sdb :
/dev/sdb1 ceph data, active, cluster ceph, osd.1, block /dev/sdb2, block.db /dev/nvme0n1p3, block.wal /dev/nvme0n1p4
/dev/sdb2 ceph block, for /dev/sdb1
/dev/sdc :
/dev/sdc1 ceph data, active, cluster ceph, osd.2, block /dev/sdc2, block.db /dev/nvme0n1p5, block.wal /dev/nvme0n1p6
/dev/sdc2 ceph block, for /dev/sdc1
/dev/sdd :
/dev/sdd1 ceph data, active, cluster ceph, osd.3, block /dev/sdd2, block.db /dev/nvme0n1p7, block.wal /dev/nvme0n1p8
/dev/sdd2 ceph block, for /dev/sdd1
/dev/sde :
/dev/sde1 ceph data, active, cluster ceph, osd.4, block /dev/sde2, block.db /dev/nvme0n1p9, block.wal /dev/nvme0n1p10
/dev/sde2 ceph block, for /dev/sde1
/dev/sdf :
/dev/sdf1 ceph data, active, cluster ceph, osd.5, block /dev/sdf2, block.db /dev/nvme0n1p11, block.wal /dev/nvme0n1p12
/dev/sdf2 ceph block, for /dev/sdf1
/dev/sdg :
/dev/sdg1 ceph data, active, cluster ceph, osd.6, block /dev/sdg2, block.db /dev/nvme0n1p13, block.wal /dev/nvme0n1p14
/dev/sdg2 ceph block, for /dev/sdg1
/dev/sdh :
/dev/sdh1 ceph data, active, cluster ceph, osd.7, block /dev/sdh2, block.db /dev/nvme0n1p15, block.wal /dev/nvme0n1p16
/dev/sdh2 ceph block, for /dev/sdh1
/dev/sdi :
/dev/sdi1 ceph data, active, cluster ceph, osd.8, block /dev/sdi2, block.db /dev/nvme0n1p17, block.wal /dev/nvme0n1p18
/dev/sdi2 ceph block, for /dev/sdi1
/dev/sdj :
/dev/sdj1 ceph data, active, cluster ceph, osd.9, block /dev/sdj2, block.db /dev/nvme0n1p19, block.wal /dev/nvme0n1p20
/dev/sdj2 ceph block, for /dev/sdj1
/dev/sdk :
/dev/sdk1 ceph data, active, cluster ceph, osd.10, block /dev/sdk2, block.db /dev/nvme0n1p21, block.wal /dev/nvme0n1p22
/dev/sdk2 ceph block, for /dev/sdk1
/dev/sdl :
/dev/sdl1 ceph data, active, cluster ceph, osd.11, block /dev/sdl2, block.db /dev/nvme0n1p23, block.wal /dev/nvme0n1p24
/dev/sdl2 ceph block, for /dev/sdl1
Each DB partition is only 1GB in size and each WAL partition is only 576MB. So there is plenty of space left on the first NVMe SSD (the total capacity is 1.1TB). We may create a new partition there to benchmark the NVMe SSD in the near future.
Let’s chech the cluster status:
[root@pulpo-admin Pulpos]# ceph -s
cluster:
id: 5f675e57-a4dc-4425-ab4e-2e46f605411d
health: HEALTH_OK
services:
mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
osd: 39 osds: 39 up, 39 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 bytes
usage: 0 kB used, 0 kB / 0 kB avail
pgs:
Further readings on BlueStore:
CRUSH device class
One nice new feature introduced in Luminous is CRUSH device class.
[root@pulpo-admin Pulpos]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 265.29291 root default
-3 88.43097 host pulpo-osd01
0 hdd 7.27829 osd.0 up 1.00000 1.00000
1 hdd 7.27829 osd.1 up 1.00000 1.00000
2 hdd 7.27829 osd.2 up 1.00000 1.00000
3 hdd 7.27829 osd.3 up 1.00000 1.00000
4 hdd 7.27829 osd.4 up 1.00000 1.00000
5 hdd 7.27829 osd.5 up 1.00000 1.00000
6 hdd 7.27829 osd.6 up 1.00000 1.00000
7 hdd 7.27829 osd.7 up 1.00000 1.00000
8 hdd 7.27829 osd.8 up 1.00000 1.00000
9 hdd 7.27829 osd.9 up 1.00000 1.00000
10 hdd 7.27829 osd.10 up 1.00000 1.00000
11 hdd 7.27829 osd.11 up 1.00000 1.00000
36 nvme 1.09149 osd.36 up 1.00000 1.00000
-5 88.43097 host pulpo-osd02
12 hdd 7.27829 osd.12 up 1.00000 1.00000
13 hdd 7.27829 osd.13 up 1.00000 1.00000
14 hdd 7.27829 osd.14 up 1.00000 1.00000
15 hdd 7.27829 osd.15 up 1.00000 1.00000
16 hdd 7.27829 osd.16 up 1.00000 1.00000
17 hdd 7.27829 osd.17 up 1.00000 1.00000
18 hdd 7.27829 osd.18 up 1.00000 1.00000
19 hdd 7.27829 osd.19 up 1.00000 1.00000
20 hdd 7.27829 osd.20 up 1.00000 1.00000
21 hdd 7.27829 osd.21 up 1.00000 1.00000
22 hdd 7.27829 osd.22 up 1.00000 1.00000
23 hdd 7.27829 osd.23 up 1.00000 1.00000
37 nvme 1.09149 osd.37 up 1.00000 1.00000
-7 88.43097 host pulpo-osd03
24 hdd 7.27829 osd.24 up 1.00000 1.00000
25 hdd 7.27829 osd.25 up 1.00000 1.00000
26 hdd 7.27829 osd.26 up 1.00000 1.00000
27 hdd 7.27829 osd.27 up 1.00000 1.00000
28 hdd 7.27829 osd.28 up 1.00000 1.00000
29 hdd 7.27829 osd.29 up 1.00000 1.00000
30 hdd 7.27829 osd.30 up 1.00000 1.00000
31 hdd 7.27829 osd.31 up 1.00000 1.00000
32 hdd 7.27829 osd.32 up 1.00000 1.00000
33 hdd 7.27829 osd.33 up 1.00000 1.00000
34 hdd 7.27829 osd.34 up 1.00000 1.00000
35 hdd 7.27829 osd.35 up 1.00000 1.00000
38 nvme 1.09149 osd.38 up 1.00000 1.00000
Luminous automatically associate the OSDs backed by HDDs with the hdd
device class; and the OSDs backed by NVMes with the nvme
device class. So we no longer need to manually modify CRUSH map (as in kraken and earlier Ceph releases) in order to place different pools on different OSDs!
[root@pulpo-admin Pulpos]# ceph osd crush class ls
[
"hdd",
"nvme"
]
Adding an MDS
Add a Metadata server:
[root@pulpo-admin Pulpos]# ceph-deploy mds create pulpo-mds01
As alluded to earlier, the goal is to create a Ceph Filesystem (CephFS), using 2 RADOS pools:
- a replicated data pool on the OSDs backed by HDDs
- a replicated metadata pool on the OSDs backed by NVMes
Creating a replicated data pool
However, the default CRUSH rule for replicated pool, replicated_rule
, will use all types of OSDs, no matter whether they are backed by HDDs, or by NVMes:
[root@pulpo-admin Pulpos]# ceph osd crush rule dump replicated_rule
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
Here is the syntax for creating a new replication rule:
osd crush rule create-replicated <name> create crush rule <name> for replicated
<root> <type> {<class>} pool to start from <root>, replicate
across buckets of type <type>, using a
choose mode of <firstn|indep> (default
firstn; indep best for erasure pools)
Let’s create a new replication rule, pulpo_hdd
, that targets the hdd
device class (root is default
and bucket type host
):
[root@pulpo-admin Pulpos]# ceph osd crush rule create-replicated pulpo_hdd default host hdd
Check the rule:
[root@pulpo-admin Pulpos]# ceph osd crush rule dump pulpo_hdd
{
"rule_id": 1,
"rule_name": "pulpo_hdd",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
We can now create the metadata pool using the CRUSH rule pulpo_hdd
:
[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_data 1024 1024 replicated pulpo_hdd
pool 'cephfs_data' created
Here we set placement groups number to be 1024, which was a bit too low; as we got a warning from ceph -s
:
[root@pulpo-admin Pulpos]# ceph -s
cluster:
id: 5f675e57-a4dc-4425-ab4e-2e46f605411d
health: HEALTH_WARN
Reduced data availability: 1152 pgs inactive
Degraded data redundancy: 1152 pgs unclean
too few PGs per OSD (29 < min 30)
services:
mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
mds: pulpos-1/1/1 up {0=pulpo-mds01=up:active}
osd: 39 osds: 39 up, 39 in
data:
pools: 2 pools, 1152 pgs
objects: 0 objects, 0 bytes
usage: 0 kB used, 0 kB / 0 kB avail
pgs: 100.000% pgs unknown
1152 unknown
Let’s double the PG (Placement Groups) number:
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data pg_num
pg_num: 1024
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data pgp_num
pgp_num: 1024
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_data pg_num 2048
set pool 1 pg_num to 2048
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_data pgp_num 2048
set pool 1 pgp_num to 2048
Let’s verify that pool cephfs_data
indeed uses rule pulpo_hdd
:
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data crush_rule
crush_rule: pulpo_hdd
By default, the replication size is 3. But 2 is sufficient for the cache pool.
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_data size
size: 3
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_data size 2
set pool 1 size to 2
Creating a replicated metadata pool
1) Create a new replication rule, pulpo_nvme
, that targets the nvme
device class (root is default
and bucket type host
):
[root@pulpo-admin Pulpos]# ceph osd crush rule create-replicated pulpo_nvme default host nvme
Check the rule:
[root@pulpo-admin ~]# ceph osd crush rule ls
replicated_rule
pulpo_hdd
pulpo_nvme
[root@pulpo-admin ~]# ceph osd crush rule dump pulpo_nvme
{
"rule_id": 2,
"rule_name": "pulpo_nvme",
"ruleset": 2,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -12,
"item_name": "default~nvme"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
2) Create the replicated metadata pool:
[root@pulpo-admin Pulpos]# ceph osd pool create cephfs_metadata 128 128 replicated pulpo_nvme
pool 'cephfs_metadata' created
By default, the replication size is 3. But 2 is sufficient for the metadata pool.
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata size
size: 3
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_metadata size 2
set pool 2 size to 2
Let’s double the PG number:
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata pg_num
pg_num: 128
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata pgp_num
pgp_num: 128
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_metadata pg_num 256
set pool 2 pg_num to 256
[root@pulpo-admin Pulpos]# ceph osd pool set cephfs_metadata pgp_num 256
set pool 2 pgp_num to 256
Let’s check the status of our Ceph cluster again:
[root@pulpo-admin Pulpos]# ceph -s
cluster:
id: 5f675e57-a4dc-4425-ab4e-2e46f605411d
health: HEALTH_WARN
Reduced data availability: 2176 pgs inactive
Degraded data redundancy: 2176 pgs unclean
services:
mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
mds: pulpos-1/1/1 up {0=pulpo-mds01=up:active}
osd: 39 osds: 39 up, 39 in
data:
pools: 2 pools, 2304 pgs
objects: 0 objects, 0 bytes
usage: 0 kB used, 0 kB / 0 kB avail
pgs: 100.000% pgs unknown
2304 unknown
One can list all the placement groups of the metadata pool (pool 2):
[root@pulpo-admin Pulpos]# ceph pg dump | grep '^2\.'
and get the placement group map for a particular placement group:
[root@pulpo-admin Pulpos]# ceph pg map 2.1
osdmap e169 pg 2.1 (2.1) -> up [36,38] acting [36,38]
Let’s verify that pool cephfs_metadata
indeed uses rule pulpo_nvme
:
[root@pulpo-admin Pulpos]# ceph osd pool get cephfs_metadata crush_rule
crush_rule: pulpo_nvme
Creating CephFS
Now we are ready to create the Ceph Filesystem:
[root@pulpo-admin Pulpos]# ceph fs new pulpos cephfs_metadata cephfs_data
new fs with metadata pool 2 and data pool 1
By the way, the default maximum file size is 1TiB:
[root@pulpo-admin ~]# ceph fs get pulpos | grep 1099511627776
max_file_size 1099511627776
Let’s raise it to 2TiB:
[root@pulpo-admin ~]# ceph fs set pulpos max_file_size 2199023255552
[root@pulpo-admin ~]# ceph fs get pulpos | grep max_file_size
max_file_size 2199023255552
Limitations
Unfortunately, there is a serious bug lurking in the current version of Luminous (v12.2.1)! If we check the status of the Ceph cluster, we are told that all placement groups are both inactive and unclean!
[root@pulpo-admin Pulpos]# ceph -s
cluster:
id: 5f675e57-a4dc-4425-ab4e-2e46f605411d
health: HEALTH_WARN
Reduced data availability: 2176 pgs inactive
Degraded data redundancy: 2176 pgs unclean
services:
mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-mds01
mgr: pulpo-mon01(active), standbys: pulpo-mds01, pulpo-admin
mds: pulpos-1/1/1 up {0=pulpo-mds01=up:active}
osd: 39 osds: 39 up, 39 in
data:
pools: 2 pools, 2304 pgs
objects: 0 objects, 0 bytes
usage: 0 kB used, 0 kB / 0 kB avail
pgs: 100.000% pgs unknown
2304 unknown
Same with ceph health
:
[root@pulpo-admin Pulpos]# ceph health detail
HEALTH_WARN Reduced data availability: 2304 pgs inactive; Degraded data redundancy: 2304 pgs unclean
PG_AVAILABILITY Reduced data availability: 2304 pgs inactive
pg 1.7cd is stuck inactive for 223.224011, current state unknown, last acting []
pg 1.7ce is stuck inactive for 223.224011, current state unknown, last acting []
pg 1.7cf is stuck inactive for 223.224011, current state unknown, last acting []
...
PG_DEGRADED Degraded data redundancy: 2304 pgs unclean
pg 1.7cd is stuck unclean for 223.224011, current state unknown, last acting []
pg 1.7ce is stuck unclean for 223.224011, current state unknown, last acting []
pg 1.7cf is stuck unclean for 223.224011, current state unknown, last acting []
...
However, if we query any placement group that is supposedly inactive and unclean, we find it to be actually both active and clean. Take, for example, pg 1.7cd:
[root@pulpo-admin Pulpos]# ceph pg 1.7cd query
{
"state": "active+clean",
"snap_trimq": "[]",
"epoch": 178,
"up": [
5,
13
],
"acting": [
5,
13
],
...
}
Mounting CephFS on clients
There are 2 ways to mount CephFS on a client: using either the kernel CephFS driver, or ceph-fuse. The fuse client is the easiest way to get up to date code, while the kernel client will often give better performance.
On a client, e.g., pulpo-dtn, create the mount point:
[root@pulpo-dtn ~]# mkdir /mnt/pulpos
Kernel CephFS driver
The Ceph Storage Cluster runs with authentication turned on by default. We need a file containing the secret key (i.e., not the keyring itself).
0) Create the secret file, and save it as /etc/ceph/admin.secret
.
1) We can use the mount
command to mount CephFS with kernel driver
[root@pulpo-dtn ~]# mount -t ceph 128.114.86.4:6789:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret
or more redundantly
[root@pulpo-dtn ~]# mount -t ceph 128.114.86.4:6789,128.114.86.5:6789,128.114.86.2:6789:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret
With this method, we need to specify the monitor host IP address(es) and port number(s).
2) Or we can use the simple helper mount.ceph, which resolve monitor hostname(s) into IP address(es):
[root@pulpo-dtn ~]# mount.ceph pulpo-mon01:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret
or more redundantly
[root@pulpo-dtn ~]# mount.ceph pulpo-mon01,pulpo-mds01,pulpo-admin:/ /mnt/pulpos -o name=admin,secretfile=/etc/ceph/admin.secret
3) To mount CephFS automatically on startup, we can add the following to /etc/fstab
:
128.114.86.4:6789,128.114.86.5:6789,128.114.86.2:6789:/ /mnt/pulpos ceph name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev 0 2
And here is another bug in current version of Luminous (v12.2.1): when CephFS is mounted, the mount point doesn’t show up in the output of df
; and although we can list the mount point specifically with df -h /mnt/pulpos
, the size of the filesystem is reported as 0!
[root@pulpo-dtn ~]# df -h /mnt/pulpos
Filesystem Size Used Avail Use% Mounted on
128.114.86.4,128.114.86.5,128.114.86.2:/ 0 0 0 - /mnt/pulpos
Nonetheless, we can read and write to the CephFS just fine!
ceph-fuse
Make sure the ceph-fuse
package is installed. We’ve already installed the package on pulpo-dtn, using Ansible.
cephx
authentication is on by default. Ensure that the client host has a copy of the Ceph configuration file and a keyring with CAPS for the Ceph metadata server. pulpo-dtn already has a copy of these 2 files. NOTE ceph-fuse uses the keyring rather than a secret file for authentication!
Then we can use the ceph-fuse
command to mount the CephFS as a FUSE (Filesystem in Userspace):
on pulpo-dtn:
[root@pulpo-dtn ~]# ceph-fuse -m 128.114.86.4:6789 /mnt/pulpos
ceph-fuse[3699]: starting ceph client2017-09-05 11:13:45.398103 7f8891150040 -1 init, newargv = 0x7f889b82ee40 newargc=9
ceph-fuse[3699]: starting fuse
or more redundantly:
[root@pulpo-dtn ~]# ceph-fuse -m pulpo-mon01:6789,pulpo-mds01:6789,pulpo-admin:6789 /mnt/pulpos
There are 2 options to automate mounting ceph-fuse: fstab
or systemd
.
1) We can add the following to /etc/fstab
(see http://docs.ceph.com/docs/luminous/cephfs/fstab/#fuse):
none /mnt/pulpos fuse.ceph ceph.id=admin,defaults,_netdev 0 0
2) ceph-fuse@.service
and ceph-fuse.target
systemd units are available. To mount CephFS as a FUSE on /mnt/pulpos
, using systemctl:
[root@pulpo-dtn ~]# systemctl start ceph-fuse@/mnt/pulpos.service
To create a persistent mount point:
[root@pulpo-dtn ~]# systemctl enable ceph-fuse.target
Created symlink from /etc/systemd/system/remote-fs.target.wants/ceph-fuse.target to /usr/lib/systemd/system/ceph-fuse.target.
Created symlink from /etc/systemd/system/ceph.target.wants/ceph-fuse.target to /usr/lib/systemd/system/ceph-fuse.target.
[root@pulpo-dtn ~]# systemctl enable ceph-fuse@-mnt-pulpos
Created symlink from /etc/systemd/system/ceph-fuse.target.wants/ceph-fuse@-mnt-pulpos.service to /usr/lib/systemd/system/ceph-fuse@.service.
NOTE here the command must be systemctl enable ceph-fuse@-mnt-pulpos
. If we run systemctl enable ceph-fuse@/mnt/pulpos
instead, we’ll get an error “Failed to execute operation: Unit name pulpos is not valid.” However, when starting the service, we can run either systemctl start ceph-fuse@/mnt/pulpos
or systemctl start ceph-fuse@-mnt-pulpos
!
Lastly, we note the same bug in current version of Luminous (v12.2.1): when CephFS is mounted using ceph-fuse, the mount point doesn’t show up in the output of df
; and although we can list the mount point specifically with df -h /mnt/pulpos
, the size of the filesystem is reported as 0!
[root@pulpo-dtn ~]# df -h /mnt/pulpos
Filesystem Size Used Avail Use% Mounted on
ceph-fuse 0 0 0 - /mnt/pulpos
After a few days, it will be OK!
Upgrading to v12.2.2
Ceph v12.2.2 (Luminous) was released on December 1, 2017. The upgrade is a breeze:
[root@pulpo-admin ~]# yum clean all
[root@pulpo-admin ~]# yum -y update
[root@pulpo-admin ~]# tentakel "yum clean all"
[root@pulpo-admin ~]# tentakel "yum -y update"
[root@pulpo-admin ~]# reboot
[root@pulpo-admin ~]# tentakel reboot