======================================================
Support for RBD, RGW and CephFS in Erasure Coded Pools
======================================================

Introduction
============

This document covers the design for enabling omap (object map) support
and synchronous read operations in erasure-coded pools.
These enhancements enable EC pools to support Cls methods, as well as
RBD, RGW, and CephFS workloads without the need for a separate replica
pool for metadata.

Current Limitations
-------------------

Erasure-coded pools have previously been limited in their support for
metadata operations. Specifically:

- **Omap operations** (key-value metadata storage on objects) were not
  supported, limiting the use of EC pools for workloads requiring metadata.
- **Cls operations** (server-side object class methods) were not available,
  preventing RBD and other advanced features from working with EC pools.
- **Synchronous read operations** were not implemented in the EC backend,
  which are required for Cls operations to function correctly.

These limitations prevented EC pools from being used for many important
workloads, particularly RBD (RADOS Block Device) which relies heavily on both
omap and Cls operations.

Feature Relationships
---------------------

The two main features in this design are independent but complementary:

- **Omap Support**: Enables key-value metadata storage on EC pool objects
  through replication across primary-capable shards with journal-based updates
  managed by the primary OSD.
- **Synchronous Reads**: Provides synchronous read semantics in the EC backend
  using Boost pull-type coroutines, enabling synchronous operations without
  blocking threads.

Together, these features enable full support for RBD, RGW and CephFS
on erasure-coded pools.


Omap Support for EC Pools
==========================

Current Limitations
-------------------

In the original EC pool implementation, omap operations were not supported due
to the complexity of maintaining consistent key-value metadata across erasure-
coded shards. Unlike replicated pools where each replica maintains a complete
copy of the omap data, EC pools distribute data across multiple shards, making
metadata management more complex.

The primary challenges include:

- Ensuring consistency of metadata across shards
- Handling partial updates and failures
- Maintaining performance for metadata operations
- Supporting recovery and reconstruction scenarios

Design Approach
---------------

The omap implementation for EC pools uses a replication-based approach:

- Omap data is **replicated** across all primary-capable shards in a PG
- A **journal** is used to store omap updates before they are committed
- Updates are applied atomically across all primary-capable shards
- Consistency is maintained through the journal commit protocol

This approach provides:

- Strong consistency guarantees for metadata operations
- Efficient recovery through journal replay
- Compatibility with existing omap APIs
- Minimal impact on data path performance

Omap Architecture
-----------------

Shard Distribution and Primary-Capable Shards
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In an erasure-coded pool with k data shards and m parity shards, the
primary-capable shards are:

- The first data shard
- All m parity shards

This means there are **m + 1 primary-capable shards** in total. For example,
in a k=4, m=2 configuration, shards 0, 4, and 5 are primary-capable.

.. ditaa::

   EC Pool (k=4, m=2)

   +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
   | Shard  |  | Shard  |  | Shard  |  | Shard  |  | Shard  |  | Shard  |
   |   0    |  |   1    |  |   2    |  |   3    |  |   4    |  |   5    |
   | (Data) |  | (Data) |  | (Data) |  | (Data) |  | (Parity|  | (Parity|
   |PRIMARY |  |        |  |        |  |        |  |PRIMARY |  |PRIMARY |
   |CAPABLE |  |        |  |        |  |        |  |CAPABLE |  |CAPABLE |
   +--------+  +--------+  +--------+  +--------+  +--------+  +--------+
       |                                               |           |
       v                                               v           v
   +--------+                                      +--------+  +--------+
   |  Omap  |                                      |  Omap  |  |  Omap  |
   | Replica|                                      | Replica|  | Replica|
   +--------+                                      +--------+  +--------+

   Primary-capable shards (0, 4, 5) maintain omap replicas

Each primary-capable shard maintains a complete copy of the omap data for
objects in the PG. This replication ensures that omap data remains available
even if some shards fail, and allows any primary-capable shard to serve omap
read requests when acting as primary.

Journal Implementation
~~~~~~~~~~~~~~~~~~~~~~

The ECOmapJournal is maintained on the primary OSD. The journal:

- Records all omap updates before they are applied
- Ensures atomic application of updates across all primary-capable replicas
- Enables recovery in case of failures during update operations
- Provides a consistent view of omap state during recovery
- Handles object deletion and recreation

Object State Map
^^^^^^^^^^^^^^^^

The journal maintains an ``object_state_map`` to track objects that are in the
process of being deleted. This map is critical for ensuring that omap updates
are written to the correct object generation when objects are deleted and
recreated.

The object_state_map:

- **Tracks outstanding deletes**: When a delete operation is appended to the
  journal, the object's version number is added to the map along with a boolean
  indicating whether it's a lost delete
- **Manages version lifecycle**: The version number remains in the map until
  the delete is trimmed from the PG log
- **Determines generation numbers**: The version number is used to calculate
  the generation number for any outstanding omap updates, ensuring updates are
  applied to the correct object generation
- **Handles object recreation**: If an object is deleted and then recreated
  before the delete is trimmed, the map ensures omap updates target the
  appropriate generation

When ``get_generation()`` is called for an object, it returns:

- The lowest version number from the object_state_map if any deletes are
  outstanding
- A boolean indicating whether the delete was lost
- ``NO_GEN`` if no deletes are outstanding for the object

This mechanism prevents omap updates from being applied to the wrong generation
of an object, which could occur if an object is deleted and recreated while
omap updates are still in flight.

**Operations Using append_delete/trim_delete:**

The object_state_map's ``append_delete`` and ``trim_delete`` sequence is used
by several operations that involve object deletion and recreation:

- **Explicit deletes**: Direct object deletion operations
- **REPLACE operations**: ``copy_from`` operations that atomically delete and
  recreate objects (see below)
- **Clone operations to non-snapshot objects**: When cloning to a non-snapshot
  object, the target object is effectively deleted and recreated with the
  cloned content, requiring the same generation tracking as other
  delete-and-recreate operations

All of these operations follow the same pattern: a delete is appended to the
object_state_map when the operation is logged, and the version is trimmed when
the PG log entry is eventually removed. This ensures consistent generation
tracking regardless of which operation causes the object lifecycle transition.

**Clone Operations and Outstanding Omap Updates:**

Clone operations require special handling to ensure that outstanding omap
updates are properly applied to the cloned object. When a clone operation is
performed:

#. A visitor pattern is used to traverse the PG log and accumulate all
   outstanding omap updates for the source object
#. These accumulated omap updates are collected from journal entries that have
   not yet been applied to the object store
#. All accumulated updates are then applied to the clone transaction, ensuring
   the cloned object receives the complete, up-to-date omap state
#. This process ensures that the clone includes not just the omap data from the
   object store, but also any in-flight updates that exist only in the journal

This visitor-based approach is necessary because the journal may contain omap
updates that have been logged but not yet applied to the source object's
persistent storage. Without accumulating these updates, the clone would have
stale omap data, missing recent modifications. By applying all outstanding
updates to the clone transaction, the system ensures that clones are created
with a consistent and complete view of the object's omap state, including both
persisted data and in-flight journal entries.

**Trimmer Architecture for Post-Removal Scenarios:**

The EC omap implementation uses a trimmer to determine when to apply journal
updates to the underlying BlueStore. However, a special case arises when an
object has just been deleted: we must avoid applying omap updates to an object
that no longer exists. To handle this, a specialized trimmer architecture has
been implemented:

- **TrimmerPostRemove**: A base class that performs all standard trimming
  actions except EC omap operations. This trimmer handles PG log cleanup
  without attempting to apply omap updates to the object store.

- **Trimmer**: Inherits from TrimmerPostRemove and overrides the ``ec_omap``
  method to add EC omap update application. This is the standard trimmer used
  during normal PG log maintenance.

- **trim_after_remove()**: A function that uses TrimmerPostRemove to trim the
  PG log immediately after an object has been removed. This is called in
  PGLog.h after a remove operation (``rollbacker->trim(i)``).

The key insight is that when trimming after a remove operation, we want to
clean up the PG log entries but we must not apply any omap updates from those
entries to BlueStore, since the object has just been deleted. By using
TrimmerPostRemove (which skips the ``ec_omap`` step), the system ensures that:

#. PG log entries are properly trimmed after object removal
#. Omap updates from those entries are not applied to the now-deleted object
#. The journal's object_state_map is updated appropriately via trim_delete
#. Normal trimming (using the full Trimmer class) continues to apply omap
   updates for objects that still exist

This design prevents attempting to write omap data to deleted objects while
maintaining proper journal cleanup and state tracking.

REPLACE Operation Type
^^^^^^^^^^^^^^^^^^^^^^

To properly support the object_state_map mechanism, a new ``REPLACE`` operation
type has been added to ``pg_log_entry_t``. This operation type is used in place
of the ``MODIFY`` operation type for ``copy_from`` operations where an object
is deleted and recreated.

**Why REPLACE is Necessary:**

The ``copy_from`` operation atomically deletes an existing object and recreates
it with new content. Without the REPLACE operation type, this would be logged
as a simple MODIFY operation, which would not trigger the journal to track the
deletion in the object_state_map. This creates a critical problem:

- If omap updates are in flight when a ``copy_from`` occurs, they could be
  applied to the wrong generation of the object
- The journal would not know that the object was deleted and recreated
- Outstanding omap updates would target the old generation, causing data
  corruption or inconsistency

**How REPLACE Works:**

When a REPLACE operation is logged:

#. The journal appends a delete to the object_state_map, recording the version
   number
#. This ensures that any outstanding omap updates will use the correct
   generation number
#. The object is then recreated with new content
#. When the delete is eventually trimmed from the PG log, the version is
   removed from the object_state_map

This approach ensures that ``copy_from`` operations, which are commonly used
for object cloning and migration, correctly interact with the omap journal's
generation tracking mechanism. Without REPLACE, the object_state_map would not
be aware of the implicit delete, leading to potential data corruption when omap
updates are applied to recreated objects.

Two-Phase Update Design
^^^^^^^^^^^^^^^^^^^^^^^

For EC pools, omap updates are persisted in PG log entries first and are then
applied to the object store once all copies have been updated and the transaction
can no longer be rolled back. This two-phase update approach is more efficient
than reading and saving the old omap data in case the transaction has to be
rolled back.

To avoid omap reads having to search PG log entries for recent updates, the
ECOmapJournal tracks updates that have not yet been applied to the object store
in memory. The ECOmapJournal provides a fast way of locating recent omap updates,
ensuring efficient read operations while updates are in flight.

Journal entries contain:

- List of omap Updates:
    - Operation type (set, remove, clear)
    - Bufferlist containing details about the operation (e.g. key/value pairs)
- An optional omap header
- A 'clear omap' boolean
- The object version

The journals on primary-capable shards that are not the primary shard store
the object deletion information, but not the omap updates. This allows for
updates to be committed to the correct object generation.

Journal Persistence and Peering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ECOmapJournal does not need to be persistent because the updates are also
stored in the PG log entries. The journal is short-lived and volatile, containing
only entries for in-flight writes that are updating an omap. Whenever a new peering
interval starts, the journal is discarded. After any disruption, the Peering process
will roll forward or backward each outstanding entry in the PG log so the object
store will be up to date, eliminating the need for complicated reconciliation of the
log and journal.

Commit Protocol
~~~~~~~~~~~~~~~

The omap commit protocol ensures that updates are applied consistently across
all primary-capable shards. The protocol is divided into two phases: Apply
Update and Complete Update.

Apply Update Phase
^^^^^^^^^^^^^^^^^^

This is entered when the PG needs to apply omap updates. The primary adds the updates
to its journal and replicates them via the PG log:

.. ditaa::

   Primary                    Primary-Capable           Primary-Capable
                                 Shard 1                   Shard 2
      |                             |                          |
      | Store in PG Log             |                          |
      |------+                      |                          |
      |      |                      |                          |
      |<-----+                      |                          |
      |                             |                          |
      | Add to Journal              |                          |
      |------+                      |                          |
      |      |                      |                          |
      |<-----+                      |                          |
      |                             |                          |
      | Send PG Log Entry           |                          |
      |---------------------------->|                          |
      |                             |                          |
      | Send PG Log Entry           |                          |
      |------------------------------------------------------->|
      |                             |                          |
      |                             | Store in PG Log          |
      |                             |------+                   |
      |                             |      |                   |
      |                             |<-----+                   | Store in PG Log
      |                             |                          |------+
      |                         ACK |                          |      |
      |<----------------------------|                          |<-----+
      |                             |                          |
      |                             |                      ACK |
      |<-------------------------------------------------------|
      |                             |                          |

Complete Update Phase
^^^^^^^^^^^^^^^^^^^^^

Triggered later (during PG log trim). The primary applies updates to
object stores and coordinates completion across all shards:

.. ditaa::

   Primary                    Primary-Capable           Primary-Capable
                                 Shard 1                   Shard 2
      |                             |                          |
      | Apply to Object Store       |                          |
      |------+                      |                          |
      |      |                      |                          |
      |<-----+                      |                          |
      |                             |                          |
      | Remove Journal Entry        |                          |
      |------+                      |                          |
      |      |                      |                          |
      |<-----+                      |                          |
      |                             |                          |
      | Complete Update             |                          |
      |---------------------------->|                          |
      |                             |                          |
      | Complete Update             |                          |
      |------------------------------------------------------->|
      |                             |                          |
      |                             | Apply to Store           |
      |                             |------+                   |
      |                             |      |                   |
      |                             |<-----+                   | Apply to Store
      |                             |                          |------+
      |                         ACK |                          |      |
      |<----------------------------|                          |<-----+
      |                             |                          |
      |                             |                      ACK |
      |<-------------------------------------------------------|
      |                             |                          |

Protocol Steps
^^^^^^^^^^^^^^

**Apply Update Phase:**

#. Primary adds the omap updates to its local ECOmapJournal as an ECOmapJournalEntry
#. Primary encodes the omap updates into the PG log as a PG log entry
#. Primary sends PG log entry to all other primary-capable shards
#. Each shard stores the PG log entry in its local PG log

**Complete Update Phase:**

#. Primary applies the omap updates from a PG log entry to its own object store
#. Primary removes the corresponding journal entry
#. Primary sends "Complete Update" messages to all primary-capable shards
#. Each shard applies the update from its PG log to its object store
#. Each shard sends an ACK back to the primary

This journal approach means that if a write fails before it is completed,
there is nothing to rollback in the object stores. This means that it is
not necessary to read and store old omap data just incase an update needs
to be undone.

Recovery and Consistency
~~~~~~~~~~~~~~~~~~~~~~~~~

Omap recovery is integrated into the existing EC recovery loop. When a
primary-capable shard recovers:

- The recovering shard receives omap data from the current primary or another
  primary-capable shard
- Omap data is transferred as part of the normal EC recovery process

During primary failover:

- The new primary (which must be a primary-capable shard) already has a
  complete copy of the omap data
- A new journal is initialized on the new primary
- Operations can continue without data loss

This integration with existing recovery mechanisms simplifies the
implementation and ensures consistency with EC pool recovery behavior.

Omap Operations
---------------

Supported Operations
~~~~~~~~~~~~~~~~~~~~

The following omap operations are supported in EC pools:

- ``omap_get_keys``: Retrieve all keys in the omap
- ``omap_get_vals``: Retrieve all key-value pairs in the omap
- ``omap_get_vals_by_keys``: Retrieve specific key-value pairs
- ``omap_set``: Set one or more key-value pairs
- ``omap_rm_keys``: Remove one or more keys
- ``omap_clear``: Remove all key-value pairs
- ``omap_get_header``: Retrieve the omap header
- ``omap_set_header``: Set the omap header
- ``omap_cmp``: Compare omap values with other values

These operations provide the same semantics as in replicated pools, ensuring
compatibility with existing applications.

Read Operation Flow
~~~~~~~~~~~~~~

Read operations follow a simple flow:

.. ditaa::

   Client                Primary
      |                     |
      | Omap Read           |
      |-------------------->|
      |                     |
      |                     | Read Local Omap
      |                     |------+
      |                     |      |
      |                     |<-----+
      |                     |
      |                     | Apply Journal Updates
      |                     |------+
      |                     |      |
      |                     |<-----+
      |                     |
      | Return Data         |
      |<--------------------|

Read operations are served from the primary OSD by:

#. Reading the stored omap data from the local replica
#. Applying any pending updates from the ECOmapJournal on top of the stored
   omap
#. Returning the combined result to the client

Using a journal means that there is a lag between an omap update and the
update being applied to the object store. Therefore, it is important that
modifications in the journal are considered during client omap reads, to
ensure that the correct data is returned.

The journal updates are applied in-memory during the read
operation, providing low-latency access to the omap data while maintaining
consistency.

Journal Overhead
~~~~~~~~~~~~~~~~

The journal introduces some performance overhead:

- **Journal Updates**: Each omap update requires the journal to be updated
- **Latency**: Omap operations require the primary osd to check the journal for
  updates
- **Storage**: Journal entries consume memory on the primary osd

However, this overhead is acceptable given the consistency guarantees provided.
Performance testing will quantify the impact and guide optimisation efforts.

Replication Impact
~~~~~~~~~~~~~~~~~~

Replicating omap data across primary-capable shards has performance
implications:

- **Network Traffic**: Updates generate network traffic to multiple shards
- **Storage**: Each primary-capable shard stores a complete omap replica
- **CPU**: Applying updates on multiple shards consumes CPU

These costs are offset by the benefits of high availability and fast reads.

Crimson-Specific Considerations
--------------------------------

The Crimson implementation will need to:

- Implement omap support using Crimson's asynchronous architecture
- Integrate with Crimson's seastar-based I/O framework
- Adapt the journal mechanism to Crimson's storage backend
- Ensure compatibility with the classic OSD implementation


Synchronous Reads
=================

Motivation
----------

Synchronous read operations are required to support Cls operations in EC pools.
Cls methods must execute synchronously, meaning they must complete a read
operation and receive the data before proceeding with their logic. The
traditional asynchronous read path in the EC backend does not provide this
capability.

Additionally, synchronous reads are beneficial for:

- Simplifying certain code paths that require sequential operations
- Enabling synchronous semantics without blocking threads
- Supporting future features that require synchronous data access

The key challenge is implementing synchronous semantics without blocking
threads, which would harm performance and scalability.

Implementation Design
---------------------

The synchronous read implementation uses Boost pull-type coroutines to provide
synchronous semantics without blocking threads:

- A Boost coroutine is created for the synchronous read operation
- The coroutine initiates an asynchronous read and yields control
- When the asynchronous read completes, the coroutine is resumed
- The coroutine returns the read data to the caller

This approach provides the synchronous semantics required by Cls operations
while maintaining the performance benefits of asynchronous I/O and avoiding
thread blocking.

Boost Pull-Type Coroutines
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The implementation uses Boost.Coroutine2 pull-type coroutines to bridge
asynchronous and synchronous code. The coroutine:

- Yields control back to the caller when waiting for I/O
- Allows the thread to process other work while I/O is in progress
- Resumes execution when the asynchronous operation completes
- Provides synchronous semantics to the Cls operation

This mechanism allows Cls operations to be written in a straightforward,
synchronous style while the underlying I/O remains asynchronous and
non-blocking.

Integration with EC Backend
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The synchronous read path integrates with the existing EC backend:

- The ECSwitch routes synchronous reads to the EC Backend
- The handler uses the existing asynchronous read infrastructure
- Results are returned synchronously to the caller using the coroutine

This integration minimizes code duplication and leverages existing, well-tested
read logic.

Synchronous Operation Ordering Guarantees
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Synchronous operations provide strong ordering guarantees:

- Synchronous reads block concurrent operations to the same object
- Multiple synchronous reads to the same object are serialized


Performance Impact
------------------

Latency Considerations
~~~~~~~~~~~~~~~~~~~~~~

Synchronous reads introduce some latency overhead compared to pure asynchronous
operations:

- **Coroutine Overhead**: Creating and resuming coroutines has a small CPU cost
- **Context Switching**: Yielding and resuming coroutines involves context
  switching overhead

However, these overheads are minimal compared to the I/O latency itself. In
practice, the latency impact will be negligible for most workloads.

Throughput Implications
~~~~~~~~~~~~~~~~~~~~~~~~

The throughput impact of synchronous reads depends on the workload:

- **Cls-Heavy Workloads**: Workloads with many Cls operations may see some
  impact, but the coroutine approach minimizes this
- **Mixed Workloads**: Workloads with a mix of synchronous and asynchronous
  operations will see minimal impact
- **Pure Data Workloads**: Workloads without Cls operations will see very
  minimal impact

Performance testing will quantify these impacts and guide optimisation efforts.

Comparison with Async Operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Synchronous reads are not intended to replace asynchronous operations. They
serve a specific purpose for Cls operations and other use cases requiring
synchronous semantics. The EC backend continues to use asynchronous reads for
all other operations, maintaining optimal performance for the common case.


Cls, RBD, RGW and CephFS Support
================================

Cls (Class) Support
-------------------

Background
~~~~~~~~~~

Cls (class) operations are server-side methods that execute on OSDs, enabling
efficient data processing without client round-trips. Examples include:

- **RBD Operations**: Image metadata management, snapshot operations
- **RGW Operations**: Bucket index operations, object tagging
- **Custom Operations**: User-defined server-side processing

Cls operations are inherently synchronous - they must read data, process it,
and potentially write results, all within a single operation context. This
synchronous nature is why they require synchronous read support in the EC
backend.

Current Limitations in EC Pools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Without synchronous read support, Cls operations could not be implemented in
EC pools. This prevented:

- RBD from using EC pools for image storage
- RGW from using EC pools for certain bucket operations
- Custom Cls methods from working with EC pools

Enabling Cls in EC Pools
~~~~~~~~~~~~~~~~~~~~~~~~~

Technical Requirements
^^^^^^^^^^^^^^^^^^^^^^

Enabling Cls in EC pools requires:

#. **Synchronous Read Support**: Implemented via coroutines as described above
#. **Omap Support**: Many Cls operations require omap for metadata storage

All of these requirements are met by the features described in this document.

Integration with Synchronous Reads
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Cls operations use synchronous reads through a straightforward integration:

.. code-block:: cpp

   // Simplified Cls operation using synchronous reads
   int cls_method(cls_method_context_t hctx)
   {
       bufferlist bl;

       // Synchronous read - blocks until data is available
       int r = cls_cxx_read(hctx, 0, 1024, &bl);
       if (r < 0)
           return r;

       // Process data
       process_data(bl);

       // Write results
       return cls_cxx_write(hctx, 0, bl);
   }

The ``cls_cxx_read`` function internally uses the synchronous read path,
suspending the coroutine until data is available.

Integration with Omap Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Many Cls operations require omap access for metadata:

.. code-block:: cpp

   // Cls operation using omap
   int cls_method_with_omap(cls_method_context_t hctx)
   {
       map<string, bufferlist> vals;

       // Read omap values
       int r = cls_cxx_map_get_vals(hctx, "", "", 100, &vals);
       if (r < 0)
           return r;

       // Process metadata
       process_metadata(vals);

       // Update omap
       return cls_cxx_map_set_vals(hctx, &vals);
   }

The omap operations work seamlessly with Cls, providing the metadata storage
required for complex operations.

RBD Support
-----------

RBD (RADOS Block Device) is a primary beneficiary of Cls support in EC pools.
RBD uses Cls operations extensively for:

- Image metadata management
- Snapshot operations
- Clone operations
- Exclusive lock management

With Cls support, RBD can now use EC pools for metadata. This gives the user more flexibility
about how they use RBD, including the option to use a single EC pool for data and metadata.

RGW Support
-----------

RGW (RADOS Gateway) benefits immensely from omap and Cls support in EC pools,
as it heavily relies on these features for S3 and Swift object storage semantics.
Traditionally, RGW required separate replicated pools for metadata and bucket indices.
RGW uses omap and Cls operations extensively for:

- Bucket index management (tracking objects within buckets)
- Multipart upload state tracking and assembly
- User quota and usage tracking
- Object extended attributes and custom metadata

With omap and synchronous read support natively in EC pools, as well as a few tweaks to remove
current restrictions, users will be able to use EC pools as metadata pools in RGW.

CephFS Support
--------------

CephFS (Ceph File System) requires robust omap support and strictly consistent reads
to maintain POSIX-compliant file system semantics.
Historically, the MDS (Metadata Server) required a dedicated replicated pool for its
metadata backing store. CephFS relies on omap and synchronous operations for:

- Directory object management (storing dentries as omap key-value pairs)
- MDS journal and log storage
- File extended attributes (xattrs) and layout metadata
- Inode state management and lock tracking

By bringing omap and synchronous reads to EC pools, as well as a few tweaks to remove 
current retrictions, users will be able to use an EC pool as the metadata pool in CephFS.


Testing
=======

Test Strategy Overview
----------------------

The testing strategy for these features is comprehensive and multi-layered:

- **Omap Journal Unit Tests**: Test the functionality of the journal in isolation
- **Omap Integration Tests**: Test omap operations and recovery in a full rados cluster
- **Cls Integration Tests**: Test cls method calls in a full rados cluster
- **EC Omap in Teuthology Tests**: Allow omap operations in EC pools with ceph_test_rados
- **RBD in Teuthology Tests**: Test RBD in teuthology with an EC metadata pool, and a single EC pool for data and metadata
- **RGW in Teuthology Tests**: Test RGW in teuthology using EC metadata pools
- **CephFS in Teuthology Tests**: Test CephFS in teuthology using an EC metadata pool

A key aspect of the testing strategy is the use of common test fixtures that
enable running existing tests on both replicated and fast EC pools.

Common Test Class Approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~

A common test class has been implemented that:

- Provides a unified interface for test cases
- Supports both replicated and FastEC pools
- Allows existing test suites to run on EC pools with minimal modifications
- Ensures consistent test coverage across pool types
- Reduces code duplication

This approach significantly increases test coverage while minimizing test
development effort.

Existing Test Suite Integration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Integration with existing test suites includes:

- Running existing Cls tests on EC pools
- Enabling omap operations in EC pools for ceph_test_rados
- Changing all uses of RBD, RGW and CephFS to use just an EC pool in Teuthology

This integration ensures comprehensive coverage with minimal new test
development.

Migration and Compatibility
===========================

Release Requirements
--------------------

Umbrella Release Requirement
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ability to enable omap support on EC pools will be tied to the **Umbrella
release**. This introduces important version requirements:

- **All OSDs must be running at least the Umbrella release** before omap
  support can be enabled on any EC pool
- **Any OSDs added to the cluster in the future** must also be running at least
  the Umbrella release
- Attempting to enable omap support on a cluster with pre-Umbrella OSDs will
  fail with an error

This requirement ensures that all OSDs in the cluster have the necessary code
to support omap operations on EC pools, preventing data corruption or
inconsistencies that could arise from version mismatches.

Upgrade Path
------------

Enabling Features on Existing Pools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Existing EC pools can be upgraded to support the new features:

#. Upgrade all OSDs to the Umbrella release or later
#. Verify that all OSDs in the cluster are at the required version
#. Enable EC overwrites on the pool (if not already enabled)
#. Enable EC optimisations on the pool (if not already enabled)
#. Enable omap support on the pool (if desired)
#. Existing data remains accessible throughout the upgrade

The upgrade process is designed to be non-disruptive, but the version
requirement must be strictly enforced.

Backward Compatibility
~~~~~~~~~~~~~~~~~~~~~~

Backward compatibility is maintained with important caveats:

- Pools without omap support continue to work as before on any version
- Clients that don't use the new features are unaffected
- **Downgrade is not supported** once omap has been enabled on an EC pool, as
  pre-Umbrella OSDs cannot handle omap data on EC pools
- Pools with omap enabled require all OSDs to remain at Umbrella or later

This compatibility ensures that upgrades are safe, but downgrades are
restricted to protect data integrity.

Configuration
-------------

Required Settings
~~~~~~~~~~~~~~~~~

To enable the new features, the following OSDMap pool settings are required:

- ``allows_ec_overwrites = true`
- ``allows_ec_optimizations = true``
- ``supports_omap = true``

These settings can be configured per-pool. The cluster will enforce that all
OSDs are at the Umbrella release before allowing omap support to be enabled.