Notice

This document is for a development version of Ceph.

Overview

The ceph-volume tool aims to be a single purpose command line tool to deploy logical volumes as OSDs, trying to maintain a similar API to ceph-disk when preparing, activating, and creating OSDs.

It deviates from ceph-disk by not interacting or relying on the udev rules that come installed for Ceph. These rules allow automatic detection of previously set up devices that are in turn fed into ceph-disk to activate them.

Cephadm shell

Do not run ceph-volume from a container session that was started with cephadm shell while relying on that shell’s default bind mounts. By design, cephadm shell omits several host bind mounts that ceph-volume expects (for example /run/udev, /run/lvm, etc.). Invoking ceph-volume in that environment is likely to fail or behave incorrectly.

Running ceph-volume yourself, outside of what ceph orch / cephadm drives, is not the normal operational path: it is mainly for debugging, testing, or development.

Note

Advanced use only

If you truly understand the implications, you can extend the default container environment by passing cephadm shell a single -m (or --mount) option followed by every bind mount you need, for example:

cephadm shell -m /dev:/dev /run/udev:/run/udev /sys:/sys /run/lvm:/run/lvm /run/lock/lvm:/run/lock/lvm /:/rootfs

From inside that shell, if you still need the client.bootstrap-osd keyring (cephadm shell does not expose it by default), you can obtain it with the cluster tools available in the container, for example:

ceph auth get client.bootstrap-osd -o /var/lib/ceph/bootstrap-osd/ceph.keyring

Prefer doing this inside the enriched shell rather than generating key material on the host and bind-mounting it in: the latter is easy to get wrong, can leave sensitive files behind on the host, and is generally more intrusive than running the same command from within the shell session.

Replacing ceph-disk

The ceph-disk tool was created at a time when the project was required to support many different types of init systems (upstart, sysvinit, etc.) while being able to discover devices. This caused the tool to concentrate initially (and exclusively afterwards) on GPT partitions. Specifically on GPT GUIDs, which were used to label devices in a unique way to answer questions like:

  • is this device a Journal?

  • an encrypted data partition?

  • was the device left partially prepared?

To solve these, it used UDEV rules to match the GUIDs, that would call ceph-disk, and end up in a back and forth between the ceph-disk systemd unit and the ceph-disk executable. The process was very unreliable and time consuming (a timeout of close to three hours per OSD had to be put in place), and would cause OSDs to not come up at all during the boot process of a node.

It was hard to debug, or even replicate these problems given the asynchronous behavior of UDEV.

Since the worldview of ceph-disk had to be GPT partitions exclusively, it meant that it couldn’t work with other technologies like LVM, or similar device mapper devices. It was ultimately decided to create something modular, starting with LVM support, and the ability to expand on other technologies as needed.

GPT partitions are simple?

Although partitions in general are simple to reason about, ceph-disk partitions were not simple by any means. It required a tremendous amount of special flags in order to get them to work correctly with the device discovery workflow. Here is an example call to create a data partition:

/sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:f0fc39fd-eeb2-49f1-b922-a11939cf8a0f --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sdb

Not only creating these was hard, but these partitions required devices to be exclusively owned by Ceph. For example, in some cases a special partition would be created when devices were encrypted, which would contain unencrypted keys. This was ceph-disk domain knowledge, which would not translate to a “GPT partitions are simple” understanding. Here is an example of that special partition being created:

/sbin/sgdisk --new=5:0:+10M --change-name=5:ceph lockbox --partition-guid=5:None --typecode=5:fb3aabf9-d25f-47cc-bf5e-721d181642be --mbrtogpt -- /dev/sdad

Modularity

ceph-volume was designed to be a modular tool because we anticipate that there are going to be lots of ways that people provision the hardware devices that we need to consider. There are already two: legacy ceph-disk devices that are still in use and have GPT partitions (handled by simple), and LVM. SPDK devices where we manage NVMe devices directly from userspace are on the immediate horizon, where LVM won’t work there since the kernel isn’t involved at all.

ceph-volume lvm

By making use of LVM tags, the lvm subcommand is able to store and later re-discover and query devices associated with OSDs so that they can later be activated.

LVM performance penalty

In short: we haven’t been able to notice any significant performance penalties associated with the change to LVM. By being able to work closely with LVM, the ability to work with other device mapper technologies was a given: there is no technical difficulty in working with anything that can sit below a Logical Volume.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.