Notice

This document is for a development version of Ceph.

Config Settings

See Block Device for additional details.

Generic IO Settings

rbd_compression_hint

Hint to send to the OSDs on write operations. If set to compressible and the OSD bluestore_compression_mode setting is passive, the OSD will attempt to compress data. If set to incompressible and the OSD compression setting is aggressive, the OSD will not attempt to compress data.

type:

str

runtime updatable:

true

default:

none

valid choices:

none

compressible

incompressible

rbd_read_from_replica_policy

Policy for determining which OSD will receive read operations. If set to default, each PG’s primary OSD will always be used for read operations. If set to balance, read operations will be sent to a randomly selected OSD within the replica set for replica pools or directly to the shard storing the data for erasure-coded pools. If set to localize, read operations will be sent to the closest OSD as determined by the CRUSH map. Unlike rbd_balance_snap_reads and rbd_localize_snap_reads or rbd_balance_parent_reads and rbd_localize_parent_reads, it affects all read operations, not just snap or parent. Note: this feature requires the cluster to be configured with a minimum compatible OSD release of Octopus.

type:

str

runtime updatable:

true

default:

default

valid choices:

default

balance

localize

rbd_default_order

This configures the default object size for new images. The value is used as a power of two, meaning default_object_size = 2 ^ rbd_default_order. Configure a value between 12 and 25 (inclusive), translating to 4KiB lower and 32MiB upper limit.

type:

uint

runtime updatable:

true

default:

22

Cache Settings

The user space implementation of the Ceph block device (i.e., librbd) cannot take advantage of the Linux page cache, so it includes its own in-memory caching, called “RBD caching.” RBD caching behaves just like well-behaved hard disk caching. When the OS sends a barrier or a flush request, all dirty data is written to the OSDs. This means that using write-back caching is just as safe as using a well-behaved physical hard disk with a VM that properly sends flushes (i.e. Linux kernel >= 2.6.32). The cache uses a Least Recently Used (LRU) algorithm, and in write-back mode it can coalesce contiguous requests for better throughput.

The librbd cache is enabled by default and supports three different cache policies: write-around, write-back, and write-through. Writes return immediately under both the write-around and write-back policies, unless there are more than rbd_cache_max_dirty unwritten bytes to the storage cluster. The write-around policy differs from the write-back policy in that it does not attempt to service read requests from the cache, unlike the write-back policy, and is therefore faster for high performance write workloads. Under the write-through policy, writes return only when the data is on disk on all replicas, but reads may come from the cache.

Prior to receiving a flush request, the cache behaves like a write-through cache to ensure safe operation for older operating systems that do not send flushes to ensure crash consistent behavior.

If the librbd cache is disabled, writes and reads go directly to the storage cluster, and writes return only when the data is on disk on all replicas.

Note

The cache is in memory on the client, and each RBD image has its own. Since the cache is local to the client, there’s no coherency if there are others accessing the image. Running GFS or OCFS on top of RBD will not work with caching enabled.

Option settings for RBD should be set in the [client] section of your configuration file or the central config store. These settings include:

rbd_cache

Enable caching for RADOS Block Device (RBD).

type:

bool

runtime updatable:

true

default:

true

rbd_cache_policy

Select the caching policy for librbd.

type:

str

runtime updatable:

false

default:

writearound

valid choices:

writethrough

writeback

writearound

rbd_cache_writethrough_until_flush

Start out in writethrough mode, and switch to writeback after the first flush request is received. Enabling is a conservative but safe strategy in case VMs running on RBD volumes are too old to send flushes, like the virtio driver in Linux kernels older than 2.6.32.

type:

bool

runtime updatable:

true

default:

true

rbd_cache_size

The per-volume RBD client cache size in bytes.

type:

size

runtime updatable:

true

default:

32Mi

policies:

write-back and write-through

rbd_cache_max_dirty

The dirty limit in bytes at which the cache triggers write-back. If 0, uses write-through caching.

type:

size

runtime updatable:

true

default:

24Mi

constraint:

Must be less than rbd_cache_size.

policies:

write-around and write-back

rbd_cache_target_dirty

The dirty target before the cache begins writing data to the data storage. Does not block writes to the cache.

type:

size

runtime updatable:

true

default:

16Mi

constraint:

Must be less than rbd_cache_max_dirty.

policies:

write-back

rbd_cache_max_dirty_age

The number of seconds dirty data is in the cache before writeback starts.

type:

float

runtime updatable:

true

default:

1.0

policies:

write-back

Read-ahead Settings

librbd supports read-ahead/prefetching to optimize small, sequential reads. This should normally be handled by the guest OS in the case of a VM, but boot loaders may not issue efficient reads. Read-ahead is automatically disabled if caching is disabled or if the policy is write-around.

rbd_readahead_trigger_requests

number of sequential requests necessary to trigger readahead

type:

uint

runtime updatable:

true

default:

10

rbd_readahead_max_bytes

Maximum size of a read-ahead request. If zero, read-ahead is disabled.

type:

size

runtime updatable:

true

default:

512Ki

rbd_readahead_disable_after_bytes

After this many bytes have been read from an RBD image, read-ahead is disabled for that image until it is closed. This allows the guest OS to take over read-ahead once it is booted. If zero, read-ahead stays enabled.

type:

size

runtime updatable:

true

default:

50Mi

Image Features

RBD supports advanced features which can be specified via the command line when creating images or the default features can be configured via rbd_default_features = <sum of feature numeric values> or rbd_default_features = <comma-delimited list of CLI values>.

Layering

Description:: Layering enables cloning.
Internal value:: 1
CLI value:: layering
Added in:: v0.52 (Bobtail)
KRBD support:: since v3.10
Default:: yes

Striping v2

Description:: Striping spreads data across multiple objects. Striping helps with parallelism for sequential read/write workloads.
Internal value:: 2
CLI value:: striping
Added in:: v0.55 (Bobtail)
KRBD support:: since v3.10 (default striping only, “fancy” striping added in v4.17)
Default:: yes

Exclusive locking

Description:: When enabled, it requires a client to acquire a lock on an object before making a write. Exclusive lock should only be enabled when a single client is accessing an image at any given time.
Internal value:: 4
CLI value:: exclusive-lock
Added in:: v0.92 (Hammer)
KRBD support:: since v4.9
Default:: yes

Object map

Description:: Object map support depends on exclusive lock support. Block devices are thin provisioned, which means that they only store data that actually has been written, ie. they are sparse. Object map support helps track which objects actually exist (have data stored on a device). Enabling object map support speeds up I/O operations for cloning, importing and exporting a sparsely populated image, and deleting.
Internal value:: 8
CLI value:: object-map
Added in:: v0.93 (Hammer)
KRBD support:: since v5.3
Default:: yes

Fast-diff

Description:: Fast-diff support depends on object map support and exclusive lock support. It adds another property to the object map, which makes it much faster to generate diffs between snapshots of an image. It is also much faster to calculate the actual data usage of a snapshot or volume (rbd du).
Internal value:: 16
CLI value:: fast-diff
Added in:: v9.0.1 (Infernalis)
KRBD support:: since v5.3
Default:: yes

Deep-flatten

Description:: Deep-flatten enables rbd flatten to work on all snapshots of an image, in addition to the image itself. Without it, snapshots of an image will still rely on the parent, so the parent cannot be deleted until the snapshots are first deleted. Deep-flatten makes a parent independent of its clones, even if they have snapshots, at the expense of using additional OSD device space.
Internal value:: 32
CLI value:: deep-flatten
Added in:: v9.0.2 (Infernalis)
KRBD support:: since v5.1
Default:: yes

Journaling

Description:: Journaling support depends on exclusive lock support. Journaling records all modifications to an image in the order they occur. RBD mirroring can utilize the journal to replicate a crash-consistent image to a remote cluster. It is best to let rbd-mirror manage this feature only as needed, as enabling it long term may result in substantial additional OSD space consumption.
Internal value:: 64
CLI value:: journaling
Added in:: v10.0.1 (Jewel)
KRBD support:: no
Default:: no

Data pool

Description:: On erasure-coded pools, the image data block objects need to be stored on a separate pool from the image metadata.
Internal value:: 128
Added in:: v11.1.0 (Kraken)
KRBD support:: since v4.11
Default:: no

Operations

Description:: Used to restrict older clients from performing certain maintenance operations against an image (e.g. clone, snap create).
Internal value:: 256
Added in:: v13.0.2 (Mimic)
KRBD support:: since v4.16

Migrating

Description:: Used to restrict older clients from opening an image when it is in migration state.
Internal value:: 512
Added in:: v14.0.1 (Nautilus)
KRBD support:: no

Non-primary

Description:: Used to restrict changes to non-primary images using snapshot-based mirroring.
Internal value:: 1024
Added in:: v15.2.0 (Octopus)
KRBD support:: no

Clone Settings

rbd_default_clone_format

This sets the internal format for tracking cloned images. The value of 1 requires attaching to protected snapshots that cannot be removed until the clone is removed or flattened. The value of 2 will allow clones to be attached to any snapshot and permits removing in-use parent snapshots but requires Mimic or later clients. The default value of auto will use the v2 format if the cluster is configured to require Mimic or later clients.

type:

str

runtime updatable:

true

default:

auto

valid choices:

1

2

auto

QoS Settings

librbd supports limiting per-image IO in several ways. These all apply to a given image within a given process - the same image used in multiple places, e.g. two separate VMs, would have independent limits.

IOPS: number of I/Os per second (any type of I/O)
read IOPS: number of read I/Os per second
write IOPS: number of write I/Os per second
bps: bytes per second (any type of I/O)
read bps: bytes per second read
write bps: bytes per second written

Each of these limits operates independently of each other. They are all off by default. Every type of limit throttles I/O using a token bucket algorithm, with the ability to configure the limit (average speed over time) and potential for a higher rate (a burst) for a short period of time (burst_seconds). When any of these limits is reached, and there is no burst capacity left, librbd reduces the rate of that type of I/O to the limit.

For example, if a read bps limit of 100MB was configured, but writes were not limited, writes could proceed as quickly as possible, while reads would be throttled to 100MB/s on average. If a read bps burst of 150MB was set, and read burst seconds was set to five seconds, reads could proceed at 150MB/s for up to five seconds before dropping back to the 100MB/s limit.

The following options configure these throttles:

rbd_qos_iops_limit

the desired limit of IO operations per second

type:

uint

runtime updatable:

true

default:

0

rbd_qos_iops_burst

the desired burst limit of IO operations

type:

uint

runtime updatable:

true

default:

0

rbd_qos_iops_burst_seconds

the desired burst duration in seconds of IO operations

type:

uint

runtime updatable:

true

default:

1

min:

1

rbd_qos_read_iops_limit

the desired limit of read operations per second

type:

uint

runtime updatable:

true

default:

0

rbd_qos_read_iops_burst

the desired burst limit of read operations

type:

uint

runtime updatable:

true

default:

0

rbd_qos_read_iops_burst_seconds

the desired burst duration in seconds of read operations

type:

uint

runtime updatable:

true

default:

1

min:

1

rbd_qos_write_iops_limit

the desired limit of write operations per second

type:

uint

runtime updatable:

true

default:

0

rbd_qos_write_iops_burst

the desired burst limit of write operations

type:

uint

runtime updatable:

true

default:

0

rbd_qos_write_iops_burst_seconds

the desired burst duration in seconds of write operations

type:

uint

runtime updatable:

true

default:

1

min:

1

rbd_qos_bps_limit

the desired limit of IO bytes per second

type:

uint

runtime updatable:

true

default:

0

rbd_qos_bps_burst

the desired burst limit of IO bytes

type:

uint

runtime updatable:

true

default:

0

rbd_qos_bps_burst_seconds

the desired burst duration in seconds of IO bytes

type:

uint

runtime updatable:

true

default:

1

min:

1

rbd_qos_read_bps_limit

the desired limit of read bytes per second

type:

uint

runtime updatable:

true

default:

0

rbd_qos_read_bps_burst

the desired burst limit of read bytes

type:

uint

runtime updatable:

true

default:

0

rbd_qos_read_bps_burst_seconds

the desired burst duration in seconds of read bytes

type:

uint

runtime updatable:

true

default:

1

min:

1

rbd_qos_write_bps_limit

the desired limit of write bytes per second

type:

uint

runtime updatable:

true

default:

0

rbd_qos_write_bps_burst

the desired burst limit of write bytes

type:

uint

runtime updatable:

true

default:

0

rbd_qos_write_bps_burst_seconds

the desired burst duration in seconds of write bytes

type:

uint

runtime updatable:

true

default:

1

min:

1

rbd_qos_schedule_tick_min

This determines the minimum time (in milliseconds) at which I/Os can become unblocked if the limit of a throttle is hit. In terms of the token bucket algorithm, this is the minimum interval at which tokens are added to the bucket.

type:

uint

runtime updatable:

true

default:

50

min:

1

rbd_qos_exclude_ops

Optionally exclude ops from QoS. This setting accepts either an integer bitmask value or comma-delimited string of op names. This setting is always internally stored as an integer bitmask value. The mapping between op bitmask value and op name is as follows: +1 -> read, +2 -> write, +4 -> discard, +8 -> write_same, +16 -> compare_and_write

type:

str

runtime updatable:

true

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.