This document is for a development version of Ceph.
By default, Ceph pools are created with the type “replicated”. In replicated-type pools, every object is copied to multiple disks (this multiple copying is the “replication”).
In contrast, erasure-coded pools use a method of data protection that is different from replication. In erasure coding, data is broken into fragments of two kinds: data blocks and parity blocks. If a drive fails or becomes corrupted, the parity blocks are used to rebuild the data. At scale, erasure coding saves space relative to replication.
In this documentation, data blocks are referred to as “data chunks” and parity blocks are referred to as “encoding chunks”.
Erasure codes are also called “forward error correction codes”. The first forward error correction code was developed in 1950 by Richard Hamming at Bell Laboratories.
Creating a sample erasure coded pool
The simplest erasure coded pool is equivalent to RAID5 and requires at least three hosts:
ceph osd pool create ecpool erasure
pool 'ecpool' created
echo ABCDEFGHI | rados --pool ecpool put NYAN - rados --pool ecpool get NYAN -
Erasure code profiles
The default erasure code profile can sustain the loss of two OSDs. This erasure code profile is equivalent to a replicated pool of size three, but requires 2TB to store 1TB of data instead of 3TB to store 1TB of data. The default profile can be displayed with this command:
ceph osd erasure-code-profile get default
k=2 m=2 plugin=jerasure crush-failure-domain=host technique=reed_sol_van
The default erasure-coded pool, the profile of which is displayed here, is not the same as the simplest erasure-coded pool.
The default erasure-coded pool has two data chunks (k) and two coding chunks (m). The profile of the default erasure-coded pool is “k=2 m=2”.
The simplest erasure-coded pool has two data chunks (k) and one coding chunk (m). The profile of the simplest erasure-coded pool is “k=2 m=1”.
Choosing the right profile is important because the profile cannot be modified after the pool is created. If you find that you need an erasure-coded pool with a profile different than the one you have created, you must create a new pool with a different (and presumably more carefully-considered) profile. When the new pool is created, all objects from the wrongly-configured pool must be moved to the newly-created pool. There is no way to alter the profile of a pool after its creation.
The most important parameters of the profile are K, M and crush-failure-domain because they define the storage overhead and the data durability. For example, if the desired architecture must sustain the loss of two racks with a storage overhead of 67% overhead, the following profile can be defined:
ceph osd erasure-code-profile set myprofile \ k=3 \ m=2 \ crush-failure-domain=rack ceph osd pool create ecpool erasure myprofile echo ABCDEFGHI | rados --pool ecpool put NYAN - rados --pool ecpool get NYAN -
The NYAN object will be divided in three (K=3) and two additional chunks will be created (M=2). The value of M defines how many OSD can be lost simultaneously without losing any data. The crush-failure-domain=rack will create a CRUSH rule that ensures no two chunks are stored in the same rack.
More information can be found in the erasure code profiles documentation.
Erasure Coding with Overwrites
By default, erasure coded pools only work with uses like RGW that perform full object writes and appends.
Since Luminous, partial writes for an erasure coded pool may be enabled with a per-pool setting. This lets RBD and CephFS store their data in an erasure coded pool:
ceph osd pool set ec_pool allow_ec_overwrites true
This can only be enabled on a pool residing on bluestore OSDs, since bluestore’s checksumming is used to detect bitrot or other corruption during deep-scrub. In addition to being unsafe, using filestore with ec overwrites yields low performance compared to bluestore.
Erasure coded pools do not support omap, so to use them with RBD and
CephFS you must instruct them to store their data in an ec pool, and
their metadata in a replicated pool. For RBD, this means using the
erasure coded pool as the
--data-pool during image creation:
rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
For CephFS, an erasure coded pool can be set as the default data pool during file system creation or via file layouts.
Erasure coded pool and cache tiering
Erasure coded pools require more resources than replicated pools and lack some functionalities such as omap. To overcome these limitations, one can set up a cache tier before the erasure coded pool.
For instance, if the pool hot-storage is made of fast storage:
ceph osd tier add ecpool hot-storage ceph osd tier cache-mode hot-storage writeback ceph osd tier set-overlay ecpool hot-storage
will place the hot-storage pool as tier of ecpool in writeback mode so that every write and read to the ecpool are actually using the hot-storage and benefit from its flexibility and speed.
More information can be found in the cache tiering documentation.
Erasure coded pool recovery
If an erasure coded pool loses some shards, it must recover them from the others. This generally involves reading from the remaining shards, reconstructing the data, and writing it to the new peer. In Octopus, erasure coded pools can recover as long as there are at least K shards available. (With fewer than K shards, you have actually lost data!)
Prior to Octopus, erasure coded pools required at least min_size shards to be available, even if min_size is greater than K. (We generally recommend min_size be K+2 or more to prevent loss of writes and data.) This conservative decision was made out of an abundance of caution when designing the new pool mode but also meant pools with lost OSDs but no data loss were unable to recover and go active without manual intervention to change the min_size.
when the encoding function is called, it returns chunks of the same size. Data chunks which can be concatenated to reconstruct the original object and coding chunks which can be used to rebuild a lost chunk.
the number of data chunks, i.e. the number of chunks in which the original object is divided. For instance if K = 2 a 10KB object will be divided into K objects of 5KB each.
the number of coding chunks, i.e. the number of additional chunks computed by the encoding functions. If there are 2 coding chunks, it means 2 OSDs can be out without losing data.