Notice

This document is for a development version of Ceph.

Support for RBD, RGW and CephFS in Erasure Coded Pools

Introduction

This document covers the design for enabling omap (object map) support and synchronous read operations in erasure-coded pools. These enhancements enable EC pools to support Cls methods, as well as RBD, RGW, and CephFS workloads without the need for a separate replica pool for metadata.

Current Limitations

Erasure-coded pools have previously been limited in their support for metadata operations. Specifically:

  • Omap operations (key-value metadata storage on objects) were not supported, limiting the use of EC pools for workloads requiring metadata.

  • Cls operations (server-side object class methods) were not available, preventing RBD and other advanced features from working with EC pools.

  • Synchronous read operations were not implemented in the EC backend, which are required for Cls operations to function correctly.

These limitations prevented EC pools from being used for many important workloads, particularly RBD (RADOS Block Device) which relies heavily on both omap and Cls operations.

Feature Relationships

The two main features in this design are independent but complementary:

  • Omap Support: Enables key-value metadata storage on EC pool objects through replication across primary-capable shards with journal-based updates managed by the primary OSD.

  • Synchronous Reads: Provides synchronous read semantics in the EC backend using Boost pull-type coroutines, enabling synchronous operations without blocking threads.

Together, these features enable full support for RBD, RGW and CephFS on erasure-coded pools.

Omap Support for EC Pools

Current Limitations

In the original EC pool implementation, omap operations were not supported due to the complexity of maintaining consistent key-value metadata across erasure- coded shards. Unlike replicated pools where each replica maintains a complete copy of the omap data, EC pools distribute data across multiple shards, making metadata management more complex.

The primary challenges include:

  • Ensuring consistency of metadata across shards

  • Handling partial updates and failures

  • Maintaining performance for metadata operations

  • Supporting recovery and reconstruction scenarios

Design Approach

The omap implementation for EC pools uses a replication-based approach:

  • Omap data is replicated across all primary-capable shards in a PG

  • A journal is used to store omap updates before they are committed

  • Updates are applied atomically across all primary-capable shards

  • Consistency is maintained through the journal commit protocol

This approach provides:

  • Strong consistency guarantees for metadata operations

  • Efficient recovery through journal replay

  • Compatibility with existing omap APIs

  • Minimal impact on data path performance

Omap Architecture

Shard Distribution and Primary-Capable Shards

In an erasure-coded pool with k data shards and m parity shards, the primary-capable shards are:

  • The first data shard

  • All m parity shards

This means there are m + 1 primary-capable shards in total. For example, in a k=4, m=2 configuration, shards 0, 4, and 5 are primary-capable.

Each primary-capable shard maintains a complete copy of the omap data for objects in the PG. This replication ensures that omap data remains available even if some shards fail, and allows any primary-capable shard to serve omap read requests when acting as primary.

Journal Implementation

The ECOmapJournal is maintained on the primary OSD. The journal:

  • Records all omap updates before they are applied

  • Ensures atomic application of updates across all primary-capable replicas

  • Enables recovery in case of failures during update operations

  • Provides a consistent view of omap state during recovery

  • Handles object deletion and recreation

Object State Map

The journal maintains an object_state_map to track objects that are in the process of being deleted. This map is critical for ensuring that omap updates are written to the correct object generation when objects are deleted and recreated.

The object_state_map:

  • Tracks outstanding deletes: When a delete operation is appended to the journal, the object’s version number is added to the map along with a boolean indicating whether it’s a lost delete

  • Manages version lifecycle: The version number remains in the map until the delete is trimmed from the PG log

  • Determines generation numbers: The version number is used to calculate the generation number for any outstanding omap updates, ensuring updates are applied to the correct object generation

  • Handles object recreation: If an object is deleted and then recreated before the delete is trimmed, the map ensures omap updates target the appropriate generation

When get_generation() is called for an object, it returns:

  • The lowest version number from the object_state_map if any deletes are outstanding

  • A boolean indicating whether the delete was lost

  • NO_GEN if no deletes are outstanding for the object

This mechanism prevents omap updates from being applied to the wrong generation of an object, which could occur if an object is deleted and recreated while omap updates are still in flight.

Operations Using append_delete/trim_delete:

The object_state_map’s append_delete and trim_delete sequence is used by several operations that involve object deletion and recreation:

  • Explicit deletes: Direct object deletion operations

  • REPLACE operations: copy_from operations that atomically delete and recreate objects (see below)

  • Clone operations to non-snapshot objects: When cloning to a non-snapshot object, the target object is effectively deleted and recreated with the cloned content, requiring the same generation tracking as other delete-and-recreate operations

All of these operations follow the same pattern: a delete is appended to the object_state_map when the operation is logged, and the version is trimmed when the PG log entry is eventually removed. This ensures consistent generation tracking regardless of which operation causes the object lifecycle transition.

Clone Operations and Outstanding Omap Updates:

Clone operations require special handling to ensure that outstanding omap updates are properly applied to the cloned object. When a clone operation is performed:

  1. A visitor pattern is used to traverse the PG log and accumulate all outstanding omap updates for the source object

  2. These accumulated omap updates are collected from journal entries that have not yet been applied to the object store

  3. All accumulated updates are then applied to the clone transaction, ensuring the cloned object receives the complete, up-to-date omap state

  4. This process ensures that the clone includes not just the omap data from the object store, but also any in-flight updates that exist only in the journal

This visitor-based approach is necessary because the journal may contain omap updates that have been logged but not yet applied to the source object’s persistent storage. Without accumulating these updates, the clone would have stale omap data, missing recent modifications. By applying all outstanding updates to the clone transaction, the system ensures that clones are created with a consistent and complete view of the object’s omap state, including both persisted data and in-flight journal entries.

Trimmer Architecture for Post-Removal Scenarios:

The EC omap implementation uses a trimmer to determine when to apply journal updates to the underlying BlueStore. However, a special case arises when an object has just been deleted: we must avoid applying omap updates to an object that no longer exists. To handle this, a specialized trimmer architecture has been implemented:

  • TrimmerPostRemove: A base class that performs all standard trimming actions except EC omap operations. This trimmer handles PG log cleanup without attempting to apply omap updates to the object store.

  • Trimmer: Inherits from TrimmerPostRemove and overrides the ec_omap method to add EC omap update application. This is the standard trimmer used during normal PG log maintenance.

  • trim_after_remove(): A function that uses TrimmerPostRemove to trim the PG log immediately after an object has been removed. This is called in PGLog.h after a remove operation (rollbacker->trim(i)).

The key insight is that when trimming after a remove operation, we want to clean up the PG log entries but we must not apply any omap updates from those entries to BlueStore, since the object has just been deleted. By using TrimmerPostRemove (which skips the ec_omap step), the system ensures that:

  1. PG log entries are properly trimmed after object removal

  2. Omap updates from those entries are not applied to the now-deleted object

  3. The journal’s object_state_map is updated appropriately via trim_delete

  4. Normal trimming (using the full Trimmer class) continues to apply omap updates for objects that still exist

This design prevents attempting to write omap data to deleted objects while maintaining proper journal cleanup and state tracking.

REPLACE Operation Type

To properly support the object_state_map mechanism, a new REPLACE operation type has been added to pg_log_entry_t. This operation type is used in place of the MODIFY operation type for copy_from operations where an object is deleted and recreated.

Why REPLACE is Necessary:

The copy_from operation atomically deletes an existing object and recreates it with new content. Without the REPLACE operation type, this would be logged as a simple MODIFY operation, which would not trigger the journal to track the deletion in the object_state_map. This creates a critical problem:

  • If omap updates are in flight when a copy_from occurs, they could be applied to the wrong generation of the object

  • The journal would not know that the object was deleted and recreated

  • Outstanding omap updates would target the old generation, causing data corruption or inconsistency

How REPLACE Works:

When a REPLACE operation is logged:

  1. The journal appends a delete to the object_state_map, recording the version number

  2. This ensures that any outstanding omap updates will use the correct generation number

  3. The object is then recreated with new content

  4. When the delete is eventually trimmed from the PG log, the version is removed from the object_state_map

This approach ensures that copy_from operations, which are commonly used for object cloning and migration, correctly interact with the omap journal’s generation tracking mechanism. Without REPLACE, the object_state_map would not be aware of the implicit delete, leading to potential data corruption when omap updates are applied to recreated objects.

Two-Phase Update Design

For EC pools, omap updates are persisted in PG log entries first and are then applied to the object store once all copies have been updated and the transaction can no longer be rolled back. This two-phase update approach is more efficient than reading and saving the old omap data in case the transaction has to be rolled back.

To avoid omap reads having to search PG log entries for recent updates, the ECOmapJournal tracks updates that have not yet been applied to the object store in memory. The ECOmapJournal provides a fast way of locating recent omap updates, ensuring efficient read operations while updates are in flight.

Journal entries contain:

  • List of omap Updates:
    • Operation type (set, remove, clear)

    • Bufferlist containing details about the operation (e.g. key/value pairs)

  • An optional omap header

  • A ‘clear omap’ boolean

  • The object version

The journals on primary-capable shards that are not the primary shard store the object deletion information, but not the omap updates. This allows for updates to be committed to the correct object generation.

Journal Persistence and Peering

The ECOmapJournal does not need to be persistent because the updates are also stored in the PG log entries. The journal is short-lived and volatile, containing only entries for in-flight writes that are updating an omap. Whenever a new peering interval starts, the journal is discarded. After any disruption, the Peering process will roll forward or backward each outstanding entry in the PG log so the object store will be up to date, eliminating the need for complicated reconciliation of the log and journal.

Commit Protocol

The omap commit protocol ensures that updates are applied consistently across all primary-capable shards. The protocol is divided into two phases: Apply Update and Complete Update.

Apply Update Phase

This is entered when the PG needs to apply omap updates. The primary adds the updates to its journal and replicates them via the PG log:

Complete Update Phase

Triggered later (during PG log trim). The primary applies updates to object stores and coordinates completion across all shards:

Protocol Steps

Apply Update Phase:

  1. Primary adds the omap updates to its local ECOmapJournal as an ECOmapJournalEntry

  2. Primary encodes the omap updates into the PG log as a PG log entry

  3. Primary sends PG log entry to all other primary-capable shards

  4. Each shard stores the PG log entry in its local PG log

Complete Update Phase:

  1. Primary applies the omap updates from a PG log entry to its own object store

  2. Primary removes the corresponding journal entry

  3. Primary sends “Complete Update” messages to all primary-capable shards

  4. Each shard applies the update from its PG log to its object store

  5. Each shard sends an ACK back to the primary

This journal approach means that if a write fails before it is completed, there is nothing to rollback in the object stores. This means that it is not necessary to read and store old omap data just incase an update needs to be undone.

Recovery and Consistency

Omap recovery is integrated into the existing EC recovery loop. When a primary-capable shard recovers:

  • The recovering shard receives omap data from the current primary or another primary-capable shard

  • Omap data is transferred as part of the normal EC recovery process

During primary failover:

  • The new primary (which must be a primary-capable shard) already has a complete copy of the omap data

  • A new journal is initialized on the new primary

  • Operations can continue without data loss

This integration with existing recovery mechanisms simplifies the implementation and ensures consistency with EC pool recovery behavior.

Omap Operations

Supported Operations

The following omap operations are supported in EC pools:

  • omap_get_keys: Retrieve all keys in the omap

  • omap_get_vals: Retrieve all key-value pairs in the omap

  • omap_get_vals_by_keys: Retrieve specific key-value pairs

  • omap_set: Set one or more key-value pairs

  • omap_rm_keys: Remove one or more keys

  • omap_clear: Remove all key-value pairs

  • omap_get_header: Retrieve the omap header

  • omap_set_header: Set the omap header

  • omap_cmp: Compare omap values with other values

These operations provide the same semantics as in replicated pools, ensuring compatibility with existing applications.

Read Operation Flow

Read operations follow a simple flow:

Read operations are served from the primary OSD by:

  1. Reading the stored omap data from the local replica

  2. Applying any pending updates from the ECOmapJournal on top of the stored omap

  3. Returning the combined result to the client

Using a journal means that there is a lag between an omap update and the update being applied to the object store. Therefore, it is important that modifications in the journal are considered during client omap reads, to ensure that the correct data is returned.

The journal updates are applied in-memory during the read operation, providing low-latency access to the omap data while maintaining consistency.

Journal Overhead

The journal introduces some performance overhead:

  • Journal Updates: Each omap update requires the journal to be updated

  • Latency: Omap operations require the primary osd to check the journal for updates

  • Storage: Journal entries consume memory on the primary osd

However, this overhead is acceptable given the consistency guarantees provided. Performance testing will quantify the impact and guide optimisation efforts.

Replication Impact

Replicating omap data across primary-capable shards has performance implications:

  • Network Traffic: Updates generate network traffic to multiple shards

  • Storage: Each primary-capable shard stores a complete omap replica

  • CPU: Applying updates on multiple shards consumes CPU

These costs are offset by the benefits of high availability and fast reads.

Crimson-Specific Considerations

The Crimson implementation will need to:

  • Implement omap support using Crimson’s asynchronous architecture

  • Integrate with Crimson’s seastar-based I/O framework

  • Adapt the journal mechanism to Crimson’s storage backend

  • Ensure compatibility with the classic OSD implementation

Synchronous Reads

Motivation

Synchronous read operations are required to support Cls operations in EC pools. Cls methods must execute synchronously, meaning they must complete a read operation and receive the data before proceeding with their logic. The traditional asynchronous read path in the EC backend does not provide this capability.

Additionally, synchronous reads are beneficial for:

  • Simplifying certain code paths that require sequential operations

  • Enabling synchronous semantics without blocking threads

  • Supporting future features that require synchronous data access

The key challenge is implementing synchronous semantics without blocking threads, which would harm performance and scalability.

Implementation Design

The synchronous read implementation uses Boost pull-type coroutines to provide synchronous semantics without blocking threads:

  • A Boost coroutine is created for the synchronous read operation

  • The coroutine initiates an asynchronous read and yields control

  • When the asynchronous read completes, the coroutine is resumed

  • The coroutine returns the read data to the caller

This approach provides the synchronous semantics required by Cls operations while maintaining the performance benefits of asynchronous I/O and avoiding thread blocking.

Boost Pull-Type Coroutines

The implementation uses Boost.Coroutine2 pull-type coroutines to bridge asynchronous and synchronous code. The coroutine:

  • Yields control back to the caller when waiting for I/O

  • Allows the thread to process other work while I/O is in progress

  • Resumes execution when the asynchronous operation completes

  • Provides synchronous semantics to the Cls operation

This mechanism allows Cls operations to be written in a straightforward, synchronous style while the underlying I/O remains asynchronous and non-blocking.

Integration with EC Backend

The synchronous read path integrates with the existing EC backend:

  • The ECSwitch routes synchronous reads to the EC Backend

  • The handler uses the existing asynchronous read infrastructure

  • Results are returned synchronously to the caller using the coroutine

This integration minimizes code duplication and leverages existing, well-tested read logic.

Synchronous Operation Ordering Guarantees

Synchronous operations provide strong ordering guarantees:

  • Synchronous reads block concurrent operations to the same object

  • Multiple synchronous reads to the same object are serialized

Performance Impact

Latency Considerations

Synchronous reads introduce some latency overhead compared to pure asynchronous operations:

  • Coroutine Overhead: Creating and resuming coroutines has a small CPU cost

  • Context Switching: Yielding and resuming coroutines involves context switching overhead

However, these overheads are minimal compared to the I/O latency itself. In practice, the latency impact will be negligible for most workloads.

Throughput Implications

The throughput impact of synchronous reads depends on the workload:

  • Cls-Heavy Workloads: Workloads with many Cls operations may see some impact, but the coroutine approach minimizes this

  • Mixed Workloads: Workloads with a mix of synchronous and asynchronous operations will see minimal impact

  • Pure Data Workloads: Workloads without Cls operations will see very minimal impact

Performance testing will quantify these impacts and guide optimisation efforts.

Comparison with Async Operations

Synchronous reads are not intended to replace asynchronous operations. They serve a specific purpose for Cls operations and other use cases requiring synchronous semantics. The EC backend continues to use asynchronous reads for all other operations, maintaining optimal performance for the common case.

Cls, RBD, RGW and CephFS Support

Cls (Class) Support

Background

Cls (class) operations are server-side methods that execute on OSDs, enabling efficient data processing without client round-trips. Examples include:

  • RBD Operations: Image metadata management, snapshot operations

  • RGW Operations: Bucket index operations, object tagging

  • Custom Operations: User-defined server-side processing

Cls operations are inherently synchronous - they must read data, process it, and potentially write results, all within a single operation context. This synchronous nature is why they require synchronous read support in the EC backend.

Current Limitations in EC Pools

Without synchronous read support, Cls operations could not be implemented in EC pools. This prevented:

  • RBD from using EC pools for image storage

  • RGW from using EC pools for certain bucket operations

  • Custom Cls methods from working with EC pools

Enabling Cls in EC Pools

Technical Requirements

Enabling Cls in EC pools requires:

  1. Synchronous Read Support: Implemented via coroutines as described above

  2. Omap Support: Many Cls operations require omap for metadata storage

All of these requirements are met by the features described in this document.

Integration with Synchronous Reads

Cls operations use synchronous reads through a straightforward integration:

// Simplified Cls operation using synchronous reads
int cls_method(cls_method_context_t hctx)
{
    bufferlist bl;

    // Synchronous read - blocks until data is available
    int r = cls_cxx_read(hctx, 0, 1024, &bl);
    if (r < 0)
        return r;

    // Process data
    process_data(bl);

    // Write results
    return cls_cxx_write(hctx, 0, bl);
}

The cls_cxx_read function internally uses the synchronous read path, suspending the coroutine until data is available.

Integration with Omap Support

Many Cls operations require omap access for metadata:

// Cls operation using omap
int cls_method_with_omap(cls_method_context_t hctx)
{
    map<string, bufferlist> vals;

    // Read omap values
    int r = cls_cxx_map_get_vals(hctx, "", "", 100, &vals);
    if (r < 0)
        return r;

    // Process metadata
    process_metadata(vals);

    // Update omap
    return cls_cxx_map_set_vals(hctx, &vals);
}

The omap operations work seamlessly with Cls, providing the metadata storage required for complex operations.

RBD Support

RBD (RADOS Block Device) is a primary beneficiary of Cls support in EC pools. RBD uses Cls operations extensively for:

  • Image metadata management

  • Snapshot operations

  • Clone operations

  • Exclusive lock management

With Cls support, RBD can now use EC pools for metadata. This gives the user more flexibility about how they use RBD, including the option to use a single EC pool for data and metadata.

RGW Support

RGW (RADOS Gateway) benefits immensely from omap and Cls support in EC pools, as it heavily relies on these features for S3 and Swift object storage semantics. Traditionally, RGW required separate replicated pools for metadata and bucket indices. RGW uses omap and Cls operations extensively for:

  • Bucket index management (tracking objects within buckets)

  • Multipart upload state tracking and assembly

  • User quota and usage tracking

  • Object extended attributes and custom metadata

With omap and synchronous read support natively in EC pools, as well as a few tweaks to remove current restrictions, users will be able to use EC pools as metadata pools in RGW.

CephFS Support

CephFS (Ceph File System) requires robust omap support and strictly consistent reads to maintain POSIX-compliant file system semantics. Historically, the MDS (Metadata Server) required a dedicated replicated pool for its metadata backing store. CephFS relies on omap and synchronous operations for:

  • Directory object management (storing dentries as omap key-value pairs)

  • MDS journal and log storage

  • File extended attributes (xattrs) and layout metadata

  • Inode state management and lock tracking

By bringing omap and synchronous reads to EC pools, as well as a few tweaks to remove current retrictions, users will be able to use an EC pool as the metadata pool in CephFS.

Testing

Test Strategy Overview

The testing strategy for these features is comprehensive and multi-layered:

  • Omap Journal Unit Tests: Test the functionality of the journal in isolation

  • Omap Integration Tests: Test omap operations and recovery in a full rados cluster

  • Cls Integration Tests: Test cls method calls in a full rados cluster

  • EC Omap in Teuthology Tests: Allow omap operations in EC pools with ceph_test_rados

  • RBD in Teuthology Tests: Test RBD in teuthology with an EC metadata pool, and a single EC pool for data and metadata

  • RGW in Teuthology Tests: Test RGW in teuthology using EC metadata pools

  • CephFS in Teuthology Tests: Test CephFS in teuthology using an EC metadata pool

A key aspect of the testing strategy is the use of common test fixtures that enable running existing tests on both replicated and fast EC pools.

Common Test Class Approach

A common test class has been implemented that:

  • Provides a unified interface for test cases

  • Supports both replicated and FastEC pools

  • Allows existing test suites to run on EC pools with minimal modifications

  • Ensures consistent test coverage across pool types

  • Reduces code duplication

This approach significantly increases test coverage while minimizing test development effort.

Existing Test Suite Integration

Integration with existing test suites includes:

  • Running existing Cls tests on EC pools

  • Enabling omap operations in EC pools for ceph_test_rados

  • Changing all uses of RBD, RGW and CephFS to use just an EC pool in Teuthology

This integration ensures comprehensive coverage with minimal new test development.

Migration and Compatibility

Release Requirements

Umbrella Release Requirement

The ability to enable omap support on EC pools will be tied to the Umbrella release. This introduces important version requirements:

  • All OSDs must be running at least the Umbrella release before omap support can be enabled on any EC pool

  • Any OSDs added to the cluster in the future must also be running at least the Umbrella release

  • Attempting to enable omap support on a cluster with pre-Umbrella OSDs will fail with an error

This requirement ensures that all OSDs in the cluster have the necessary code to support omap operations on EC pools, preventing data corruption or inconsistencies that could arise from version mismatches.

Upgrade Path

Enabling Features on Existing Pools

Existing EC pools can be upgraded to support the new features:

  1. Upgrade all OSDs to the Umbrella release or later

  2. Verify that all OSDs in the cluster are at the required version

  3. Enable EC overwrites on the pool (if not already enabled)

  4. Enable EC optimisations on the pool (if not already enabled)

  5. Enable omap support on the pool (if desired)

  6. Existing data remains accessible throughout the upgrade

The upgrade process is designed to be non-disruptive, but the version requirement must be strictly enforced.

Backward Compatibility

Backward compatibility is maintained with important caveats:

  • Pools without omap support continue to work as before on any version

  • Clients that don’t use the new features are unaffected

  • Downgrade is not supported once omap has been enabled on an EC pool, as pre-Umbrella OSDs cannot handle omap data on EC pools

  • Pools with omap enabled require all OSDs to remain at Umbrella or later

This compatibility ensures that upgrades are safe, but downgrades are restricted to protect data integrity.

Configuration

Required Settings

To enable the new features, the following OSDMap pool settings are required:

  • ``allows_ec_overwrites = true`

  • allows_ec_optimizations = true

  • supports_omap = true

These settings can be configured per-pool. The cluster will enforce that all OSDs are at the Umbrella release before allowing omap support to be enabled.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.