Notice

This document is for a development version of Ceph.

Design Document: Ceph Erasure Coded (EC) Stretch Cluster

Note

Phased Delivery (R1, R2, …)

This document distinguishes between R1 and later release work. Sections and sub-sections are annotated accordingly. R1 delivers:

Zone-local direct reads (good path only; any failure redirects to the Primary)
Writes via Primary (all write IO traverses the inter-zone link; no Zone Primary encoding)
Recovery from Primary (Primary reads all data and writes all remote-zone shards; no inter-zone bandwidth optimization)

The labels R1, R2, etc. refer to incremental PR delivery milestones — they do not map one-to-one to Ceph named releases. Multiple delivery phases may land within a single Ceph release, or a single phase may span releases depending on development progress.

The implementation order is detailed in Section 14.

Important

Scope and Applicability

This design is an extension to Fast EC (Fast Erasure Coding) only. We make no attempt to support legacy EC implementations with this feature. All functionality described in this document requires Fast EC to be enabled.

1. Intent & High-Level Summary

The primary goal of this feature is to support a Replicated Erasure Coded (EC) configuration within a multi-zone Ceph cluster (Stretch Cluster), building upon the Fast EC infrastructure.

Note

Throughout this document, a zone refers to a group of OSDs within a single CRUSH failure domain — typically a data center, but it could be any CRUSH bucket type (e.g., rack, room). The key characteristic of a zone boundary is that traffic crossing it incurs a significant performance penalty: higher latency and/or lower bandwidth compared to intra-zone communication.

The term inter-zone link refers to the network connection between these failure domains (e.g., the dedicated link between data centers). This is typically the most bandwidth-constrained and latency-sensitive path in the cluster. There is a strong desire to minimize inter-zone link traffic, even to the extent of issuing additional zone-local reads to avoid sending data across this link.

Currently, Ceph Stretch clusters typically rely on Replica to ensure data availability across zones (e.g., data centers). While Erasure Coding provides storage efficiency, applying a standard EC profile naively across a stretch cluster introduces significant operational problems.

Consider a 2-zone stretch cluster using a standard EC profile of k=4, m=6. While the high parity count provides sufficient fault tolerance to survive a full zone loss, this configuration suffers from three fundamental inefficiencies:

Write Bandwidth Waste: Every write must distribute all k+m chunks across both zones. The coding (parity) shards are unnecessarily duplicated over the inter-zone link, consuming bandwidth that scales linearly with the number of shards.
Reads Must Cross Zones: Serving a read requires retrieving at least k chunks, which in general will span both zones. Every read operation incurs inter-zone link latency.
Recovery is Inter-Zone Link Bound: After a zone failure, recovering the lost chunks requires reading surviving chunks across the inter-zone link, placing the entire recovery burden on the most constrained network path.

This feature introduces a hybrid approach to eliminate these problems: creating Erasure Coded stripes (defined by k + m) and replicating those stripes zones times across a specified topology level (e.g., Data Center). Each zone holds a complete, independent copy of the EC stripe, meaning reads and recovery can be performed entirely within a single zone. This combines the zone-local storage efficiency of EC with the zone-level redundancy of a stretch cluster, while minimizing inter-zone link usage to write replication only.

2. Proposed Configuration (CLI) Changes

To support this topology, the EC Pool configuration interface will be expanded directly within the pool creation command. We intend to address a number of issues with the current CLI design:

Create-then-set flow. A typical setup procedure of a pool requires multiple CLI commands (e.g. create replica, then set num copies, then set stretched, etc… )
Remove the need for an EC profile. Indeed, stretch mode EC pools will be mutually exclusive with the use of a profile.

The current CLI behaviour of accepting either positional or non-positional arguments will be maintained, however no positional arguments will be added for the new functionality.

Supporting stretch EC pools will require changes to several CLI commands. The design document will focus on just the changes that will be made to the pool create CLI to give an idea of how the new CLI will work. Similar changes will be made to the CLIs that allow modification of a pool. It also expected that changes will be made to the CLIs that control stretch mode.

2.1 ceph osd pool create

The ceph osd pool create command will be extended to become a parameterized command.

For backward compatibility, the positional arguments will be maintained and can be specified along side the new paramaters.

2.2.1 Full command syntax

The full command syntax is listed here. Refer to the later sections for pool-type specific details.

ceph osd pool create {pool_name}
     [{pg_num} | --pg_num <pg_num>]
     [{pgp_num} | --pgp_num <pgp_num>]
     [{replicated | erasure} | --pool_type <replicated|erasure>]
     [{expected_num_objects} | --expected_num_objects <expected_num_objects>]

     # CRUSH Placement Options (Mutually Exclusive)
     [ {crush_rule_name} | --crush_rule_name <crush_rule_name> |
       [--crush_root <crush_root>] [--zone_failure_domain <zone_failure_domain>] [--osd_failure_domain <osd_failure_domain>] [--crush_device_class <class>] ]

     # Replica-Specific Options
     [--size <size>]

     # Erasure-Specific Options
     [--data_shards <num_data_shards>]
     [--coding_shards <num_coding_shards>]
     [--stripe_unit <stripe_unit>]
     [ {erasure_code_profile} | --profile <profile_name> ]

     # Topology and Redundancy
     [--zones <num_zones>]
     [--min_size <min_size>]

     # Autoscaling and General Config
     [--autoscale_mode=<on,off,warn>]
     [--pg_num_min <pg_num_min>]
     [--pg_num_max <pg_num_max>]
     [--bulk]
     [--target_size_bytes <target_size_bytes>]
     [--target_size_ratio <target_size_ratio>]
     [--crimson]
     [--yes_i_really_mean_it]

     # Legacy Options
     [--stretch_mode]

2.2.2 Legacy positional syntax

The following syntax, taken from the current docs, will be maintained for backward compatibility:

ceph osd pool create {pool_name} [{pg_num} [{pgp_num}]] [replicated] \
         [crush_rule_name] [expected_num_objects]

ceph osd pool create {pool_name} [{pg_num} [{pgp_num}]] erasure \
         [erasure_code_profile] [crush_rule_name] [expected_num_objects] [--autoscale_mode=<on,off,warn>]

This syntax can be used with the new parameters which do not directly conflict. For example --pg_num cannot be used with its positional counterpart.

2.2.1 Basic Parameters

These are the primary parameters required for standard deployments.

--pool_type (or positional equivalent)

Definition: The type of pool to create.
Values: replicated or erasure.
Default Value: Derived from osd_pool_default_type if omitted, but highly recommended to specify.

--size

Definition: The number of OSDs the data is striped over for each object. (Replica only)
Note: Parameter will be ignored, rather than rejected for EC pools, for backward compatibility.

--data_shards (or --k)

Definition: Within a zone, the number of OSDs the data is striped over for each object. (EC only)

--coding_shards (or --m)

Definition: Within a zone, the number of OSDs the coding shards are striped over. (EC only)

--zones

Definition: For a stretched cluster configuration defines the number of zones, each which store a full replica of the pool. (EC or Replica)
Default Value: 1
Behavior: Setting this to >1 creates a stretched pool.
A non-stretched pool achieves redundancy accross OSDs. A stretched pool creates redundancy accross zones.
Pool Size: The resulting pool size is zones × (k + m).

2.2.2 Advanced

These parameters are intended for advanced users and offer finer control over the cluster layout.

--stripe_unit

Definition: The amount of data in a shard, per stripe. Sometimes referred to as “chunk size”. (EC only)
Default Value: 16k for Fast EC, 4k for classic EC.
Purpose: Where beneficial for a well-known access pattern, the stripe unit can be tuned to any 4k-divisible value.

--zone_failure_domain

Definition: The CRUSH bucket type over which zone-redundancy is achieved.
Default Value: datacenter
Purpose: This is the CRUSH bucket type that defines a zone.
Note: Mutually exclusive with --crush_rule

--osd_failure_domain

Definition: The CRUSH bucket type over which OSD-redundancy is achieved.
Default Value: host
Note: Mutually exclusive with --crush_rule

--crush_device_class

Definition: Restrict placement to devices of a specific class (e.g., ssd or hdd), using the CRUSH device class names in the CRUSH map.
Purpose: Only required if a cluster has a mixture of different classes of OSD.
Note: Mutually exclusive with --crush_rule

--crush_root

Definition: The root of the CRUSH tree to use. (R2 only)
Default Value: The cluster root (default).
Topology and Validation: Specifying the CRUSH root to use (defaults to default), the CRUSH level for a zone (defaults to datacenter), and the number of zones (defaults to 1) is sufficient to define the pool’s placement:
- Example 1: In a cluster with 2 datacenters, specifying --zones 2 will create a stretch pool across the 2 datacenters.
- Example 2: In a cluster with 2 datacenters, specifying --crush_root DC1 will create an pool completely contained in DC1.
- Validation: If there are N datacenters with the same root and you specify a number of zones M != N, the command will fail because the specified number of zones is different from the number of zones in the CRUSH hierarchy.
- Custom Rules: If users want to use a subset of zones (e.g., a special 3-datacenter configuration), they must specify a custom CRUSH rule. A custom CRUSH rule is mutually exclusive with specifying the CRUSH root and/or CRUSH level.
Operational Restrictions: When there are stretch pools, adding or moving a CRUSH bucket that impacts the number of zones for the pool will require a yes-i-really-mean-it flag, as this is liable to break things.
Note: Mutually exclusive with --crush_rule

--min_size

Definition: The minimum number of shards required to serve I/O within a zone.
Note: The value should be within allowed specified ranges. For replica pools this range is 1-<num replicas>, for EC pools this is <num data shards>-<num data shards>+<num coding shards>.

--crush_rule

Definition: Use this CRUSH rule, instead of an auto-generated rule.
Purpose: Create a bespoke CRUSH rule for advanced use cases not covered by the auto rule generation above.
Note: Mutually exclusive with --crush_root, --osd_failure_domain and --zone_failure_domain

--profile

Definition: The legacy EC Profile to use.
Note: Mutually exclusive with --zones: Cannot be used with multi-zone configurations.

Note

The following parameters are existing, standard pool configuration options included here for completeness.

--pg_num

Definition: The total number of placement groups for the pool.

--pgp_num

Definition: The total number of placement groups for placement purposes.
Note: This should never be modified. Purely here for backward compaitbility.

--expected_num_objects

Definition: The expected number of objects for this pool, used to pre-split placement groups at pool creation.

--autoscale_mode

Definition: The auto-scaling mode for placement groups.
Values: on, off, or warn.

--pg_num_min

Definition: The minimum number of placement groups when auto-scaling is active.

--pg_num_max

Definition: The maximum number of placement groups when auto-scaling is active.

--bulk

Definition: Flags the pool as a “bulk” pool to pre-allocate more placement groups automatically.

--target_size_bytes

Definition: The expected total size of the pool in bytes, used to guide PG autoscaling.

--target_size_ratio

Definition: The expected ratio of the cluster’s total capacity this pool will consume, used for PG autoscaling.

--crimson

Definition: Flags the pool to run on Crimson OSD.
Note: Crimson OSD is experimental.

--yes_i_really_mean_it

Definition: Internal safety override flag. In the context of pool creation, it is specifically used to allow the creation of hidden or system-reserved pools whose names begin with a dot (e.g., .rgw.root).

2.2.3 Deprecated

These are used for backward compatibility only.

--stretch_mode

Definition: Deprecated parameter for Replica pools which configures two zones with half the specified number of replica copies in each zone. Not supported for EC pools.
Note: Mutually exclusive with --zones and erasure pools.

2.3 Examples

The following are examples of how the new parameterized ceph osd pool create command simplifies pool creation across different topologies.

Example 1: Basic Replicated Pool Create a standard 3-way replicated pool containing 128 placement groups:

ceph osd pool create my_rep_pool --pool_type replicated --size 3 --pg_num 128

Example 2: Standard Erasure Coded Pool Create an erasure-coded pool using a k=4, m=2 configuration (yielding a size of 6 shards) with 64 placement groups:

ceph osd pool create my_ec_pool --pool_type erasure --data_shards 4 --coding_shards 2 --pg_num 64

Example 3: Stretched Replicated Pool Create a replicated pool that spans across two datacenters, achieving a total size of 4 (2 replicas in each datacenter):

ceph osd pool create stretch_rep --pool_type replicated --size 4 --zones 2 --zone_failure_domain datacenter

Example 4: Stretched Erasure Coded Pool Create an erasure-coded pool stretched across two racks. Using a k=4, m=2 configuration per zone across 2 zones creates a total pool size of 12 shards (4 data and 2 coding per rack):

ceph osd pool create stretch_ec --pool_type erasure --data_shards 4 --coding_shards 2 --zones 2 --zone_failure_domain rack

Example 5: Single Datacenter Erasure Coded Pool Create an EC pool confined entirely to a specific datacenter using the --crush_root parameter:

ceph osd pool create dc1_ec --pool_type erasure --data_shards 4 --coding_shards 2 --crush_root DC1

Example 6: Bulk Erasure Coded Pool with Autoscaling Limits Create an EC pool where the system automatically scales the PG count but enforces a minimum boundary, marking it as a bulk pool:

ceph osd pool create bulk_ec --pool_type erasure --data_shards 6 --coding_shards 3 --autoscale_mode on --pg_num_min 128 --bulk

1. Approaches Considered but Rejected

We evaluated and rejected the following two alternative designs:

3.1 Multi-layered Backends

This approach involved re-using the existing ReplicationBackend to manage the top-level replication, with EC acting as a secondary layer.

Reason for Rejection: This would require complex, multi-layered peering logic where the replication layer interacts with the EC layer. Ensuring the correctness of these interactions is difficult. Additionally, the front-end and back-end interfaces of the two back ends differ significantly (e.g., support for synchronous reads), which would require substantial refactoring.
Advantage of Proposed Solution: Our chosen bespoke approach minimizes changes to the peering state machine and offers better potential for recovery during complex failure scenarios.

3.2 LRC (Locally Repairable Codes) Plugin

This approach involved configuring the existing LRC plugin to provide multi-zone EC semantics. While LRC is designed for locality-aware erasure coding, it does not address the core problems this design aims to solve:

Reason for Rejection:
1. Writes still cross zones: LRC distributes all chunks (data, local parity, and global parity) across the full CRUSH topology. Every write operation sends chunks over the inter-zone link, offering no bandwidth savings over standard EC.
2. Reads still cross zones: Reading the original data requires k data chunks, which are spread across all zones. LRC provides no mechanism for a single zone to independently serve reads.
3. Recovery remains primary-centralized: In the current Ceph EC infrastructure, all recovery operations are coordinated by the Primary OSD. While LRC defines local repair groups at the coding level, the EC back end would still need to be extended to delegate recovery execution to remote zones — the same complexity required by this proposal.
Advantage of Proposed Solution: The replicated EC stripe approach places a complete (k+m) set at each zone, which inherently enables zone-local reads, zone-local single-OSD recovery (without any special coding scheme), and limits inter-zone link usage to write replication. It achieves better locality than LRC with less infrastructure complexity.

4. Configuration Logic (Preliminary)

The interactions between these parameters imply a hierarchy in the CRUSH rule generation:

Top Level: The CRUSH rule selects zones buckets of type zone (e.g., select 2 data centers).
Lower Level: Inside each selected zone, the CRUSH rule selects k+m OSDs to store the chunks.

This leverages standard CRUSH mechanisms — no changes to CRUSH itself are required.

5. Multi-Zone & Topology Considerations

While this architecture is capable of supporting N-zone configurations, specific attention is given to the common 2-zone High Availability (HA) use case.

Reference Diagrams: For clarity, architectural diagrams and examples within this design documentation will primarily depict a 2-zone HA configuration (zones=2, with 2 data centers).
Logical Scalability: The design is logically N-way capable. The parameter zones is not limited to 2; the system supports any valid CRUSH topology where zones failure domains exist.
Testing Strategy: Testing will follow a phased approach:
1. Unit Tests (Initial): The unit test framework will include tests for both 2-zone (zones=2) and 3-zone (zones=3) configurations, validating the core peering and recovery logic for N-way topologies.
2. Full 2-Zone Testing (Initial Release): 2-zone configurations will be fully tested end-to-end for the initial release, covering all read, write, recovery, and failure scenarios.
3. Full 3-Zone Testing (Later Release): Full integration and real-world testing of 3-zone configurations will be deferred to a subsequent release.

5.1 Single-Zone Pools

The use case for “single-zone” pools is a Ceph cluster which is split across multiple data centers, but the redundancy is provided by the application. Here, the user can specify a pool which is restricted to a single data center. It should be noted that this means that loss of an inter-zone link will lead to an entire zone being lost (something that would not be the case for two independent clusters).

6. Topologies and Terminology

To illustrate the relationship between the replication layer, the EC layer, and the physical topology, we define the following terms and visualization.

6.1 Logical Diagram (2-Zone HA)

The following diagram illustrates a cluster configured with r=2, k=2, m=1 spanning two Data Centers.

6.2 Key Definitions

Primary

The primary OSD in the acting set for a Placement Group (PG). This OSD is responsible for coordinating writes and reads for the PG. (Standard Ceph terminology.)

Shard

The globally unique position of a chunk within the pool’s acting set. Shards are numbered 0 through zones × (k + m) - 1. In the diagram above, Zone A holds Shards 0, 1, 2 and Zone B holds Shards 3, 4, 5.

Zone-local

An OSD or shard is zone-local if it shares the same zone as another OSD or shard.

Zone Primary

An internal EC convention. The Zone Primary is the first primary-capable shard in the acting set that resides in a given zone.

These zone primaries will be used in (post-R1) stretch clusters to perform local-to-zone erasure coding. The intent is to minimize inter-zone bandwidth requirements.

Remote-zone Shard

A relative term referring to a shard in a different zone. For example: “When recovering data in Zone A, a remote-zone shard may be used to read data”

7. Read Path & Recovery Strategies

This architecture supports multiple read strategies to optimize for locality and handle failure scenarios. The strategies are presented in order of increasing complexity: starting with the baseline read-from-Primary path, then layering direct-read optimizations on top.

7.1 Read from Primary — R1

The standard read path involves the client contacting the Primary OSD.

Local Priority: The Primary will prioritize reading from local-to-zone OSDs (within its own zone) to serve the request or recover data (reconstruct the stripe). This minimizes inter-zone link traffic.

7.1.1 Zone-local Recovery

When the Primary cannot serve a read from a single zone-local shard (e.g., because a shard is missing or degraded), it reconstructs the data from the remaining zone-local shards within its zone.

7.1.2 Zone-Local Recovery

If the Primary cannot reconstruct data using only zone-local shards, it may issue direct reads to remote-zone OSDs. This cross-zone read path serves two purposes:

Backfill / Recovery: When a zone-local OSD is down or being backfilled, the Primary can read the corresponding shard from the remote zone to recover the missing data. It is likely that a zone with insufficient EC chunks to reconstruct locally would be taken offline by the stretch cluster monitor logic, but this fallback ensures availability during transient states.
Medium Error Recovery: If a zone-local OSD returns a medium error (e.g., unreadable sector), the Primary can recover the affected data by reading from remote-zone shards and reconstructing locally.

7.2 Direct Reads — Primary Zone — R1

This feature extends the existing “EC Direct Reads” capability (documented separately) to the stretch cluster topology. A client co-located with the Primary can read data directly from a zone-local shard, bypassing the Primary entirely on the good path. No new code is required — this is the standard EC Direct Read path operating over stretch-cluster shard numbering. Any failure falls back to the Primary read path (Section 7.1).

        sequenceDiagram
    title Direct Read — Primary Zone

    participant C as Client
    participant P as Primary
    participant ZS as Zone-local Shard

    C->>ZS: Direct Read
    activate ZS
    ZS->>C: Complete
    deactivate ZS

    C->>ZS: Direct Read
    activate ZS
    ZS->>C: -EAGAIN
    deactivate ZS
    C->>P: Full Read
    activate P
    P->>ZS: Read
    activate ZS
    ZS->>P: Read Done
    deactivate ZS
    P->>C: Done
    deactivate P

Failure Handling & Redirection (R1):

Any failure encountered during a direct read — whether a missing shard, -EAGAIN rejection, or medium error — results in the client redirecting the operation to the Primary (regardless of the Primary’s location).

7.3 Direct Reads — Zone-Local — R1

The client will read from a zone-local shard.

Zone-Local Shard Identification: The client determines which OSDs in the acting set reside within its own zone by comparing OSD CRUSH locations against its own. This identifies the k+m shards of the local zone. The client then selects a suitable set of data shards for the direct read.
Unavailable Zone-local OSDs: If the zone-local shards are not available (e.g., the zone-local OSDs are down or the client cannot identify any zone-local replica), the client does not attempt a direct read and instead directs the operation to the Primary. In later releases this will be improved to redirect to the local Zone Primary instead.
Direct Read Execution: If a zone-local shard is available, the client sends the read directly to that OSD. As detailed in the ec_direct_reads.rst design, it is generally acceptable for reads to overtake in-flight writes. However, the OSD is aware if it has an “uncommitted” write (a write that could potentially be rolled back) for the requested object. If such a write exists, or if the client’s operation requires strict ordering (e.g. the rwordered flag is set), the OSD rejects the operation with -EAGAIN. Data access errors (e.g., a media error on the underlying storage) also cause the OSD to reject the operation.
Failure Handling & Redirection (R1):

An -EAGAIN will cause the client to retry the op. For R1, the op will be redriven to the primary. R2 will redrive the op to the zone primary.

7.4 Read from Zone Primary — Later Release

In the R1 release, the primary will handle all failures. In later releases, an op which cannot be processed as a direct read will be directed at the zone primary instead. The zone primary will attempt to reconstruct the read data from the zone-local shards directly.

A conflict due to an uncommitted write will be continue to be handled by the primary, as the zone primary must also rejected an op if an uncommitted write exists for that object.

Zone-local Recovery: A read directed to a Zone Primary will attempt to serve the request by recovering data using only OSDs within the same data center (the remote zone).
Zone Degradation: If a zone has insufficient redundancy to reconstruct data locally, that zone should be taken offline to clients rather than serving reads that would require inter-zone link access. (How? - Needs to be covered elsewhere)
``-EAGAIN`` Behavior: The Zone Primary will return -EAGAIN only in short-lived transient conditions:

7.4.1 Safe Shard Identification & Synchronous Recovery — Later Release

By default, the Zone Primary does not inherently know which shards are consistent and safe to read. To manage this, the Peering process will be enhanced.

Synchronous Recovery Set: The existing peering Activate message will be extended to include a “Synchronous Recovery Set”. This set identifies shards that are currently undergoing recovery and may not yet be consistent.

Note

The peering state machine has a transition from Active+Clean to Recovery, which implies that recovery can start without a new peering interval. This needs further investigation to ensure the Synchronous Recovery Set is correctly maintained across such transitions.
Consistency Guarantee: Within any single epoch, the Synchronous Recovery Set can only reduce over time (shards are removed as they complete recovery). The set is treated as binary — either the full recovery set is active, or it is empty. This ensures the Zone Primary is conservatively stale: it may reject reads it could safely serve, but will never read from an inconsistent shard.
Initial Behavior:
- Shards listed in the Synchronous Recovery Set will not be used for reads by the Zone Primary.
- If the remaining available shards are insufficient to reconstruct the data, the Zone Primary will return -EAGAIN.
- Recovery messages will signal the Zone Primary when recovery completes, clearing the set.
Subsequent Enhancement:
- A mechanism will be added to allow the Zone Primary to request permission to read from a shard within the Synchronous Recovery Set.
- This request will trigger the necessary recovery for that specific object (if not already complete) before the read is permitted.

8. Write Path & Transaction Handling

This section details how write operations are managed, specifically focusing on Read-Modify-Write (RMW) sequences and the replication of write transactions to remote zones.

8.1 Direct-to-OSD Writes (Primary Encodes All Shards) — R1

In R1, all writes are coordinated exclusively by the Primary.

Mechanism: The Primary generates all zones × (k + m) coded shard writes and sends individual write operations directly to every OSD in the acting set, including those in remote zones.
RMW Reads: The “read” portion of any Read-Modify-Write cycle is performed locally by the Primary, strictly adhering to the logic defined in Section 7 (Read Path & Recovery Strategies).
Inter-zone traffic: This sends all coded shard data (including parity) over the inter-zone link. While this is not bandwidth-optimal, it requires no Zone Primary involvement in write processing and therefore no new write-path code beyond extending the existing shard fan-out to the larger acting set.

8.2 Replicate Transaction (Preferred Path) — Later Release

Note

This section is deferred to a later release. R1 sends all coded shards directly from the Primary.

        sequenceDiagram
    title Write to Primary

    participant C as Client
    participant P as Primary
    participant ZS as Zone-local Shard
    participant ZP as Zone Primary
    participant RS as Remote-zone Shard

    C->>P: Submit Op
    activate P

    P->>RP: Replicate
    activate ZP

    note over P: Replicate message must contain<br/>data and which shards are to be updated.<br/>When OSDs are recovering, the primary<br/>may need to update remote-zone shards directly.

    P->>ZS: SubRead (cache)
    activate ZS
    ZS->>P: SubReadReply
    deactivate ZS

    P->>ZS: SubWrite
    activate ZS
    ZS->>P: SubWriteReply
    deactivate ZS

    ZP->>RS: SubRead (cache)
    activate RS
    RS->>RP: SubReadReply
    deactivate RS

    ZP->>RS: SubWrite
    activate RS
    RS->>RP: SubWriteReply
    deactivate RS

    ZP->>P: ReplicateDone
    deactivate ZP

    P->>C: Complete
    deactivate P

This mechanism is designed to minimize inter-zone link bandwidth usage, which is often the most constrained resource in stretch clusters.

Mechanism: The replicate message is an extension of the existing sub-write message. Instead of sending individually coded shard data to every OSD in the remote zone, the Primary sends a copy of the PGBackend transaction (i.e., the raw write data and metadata, prior to EC encoding) to the Zone Primary.
Role of Zone Primary: Upon receipt, the Zone Primary processes the transaction through its local ECTransaction pipeline — largely unmodified code — to generate the coded shards for its zone. For writes smaller than a full stripe, this includes the Zone Primary issuing reads to its zone-local shards so that the parity update can be calculated. It then fans out the shard writes to the zone-local OSDs within its zone. This means coding (parity) data is never sent over the inter-zone link; it is computed independently at each zone.
Completion: The Zone Primary responds using the existing sub-write-reply message once all zone-local shard writes are durable.
Crash Handling: If the Zone Primary crashes mid-fan-out, any partial writes on remote-zone shards are rolled back by the existing peering process. No new crash-recovery mechanisms are required.
Benefit: This reduces inter-zone link traffic to a single transaction message per remote zone, carrying only the raw data — not the k+m coded chunks.

8.3 Client Writes Direct to Zone Primary — Later Release

Note

This section is deferred to a later release. It requires Replicate Transaction (Section 8.2) as a prerequisite, since the Zone Primary must be able to process transactions and fan out shard writes locally.

        sequenceDiagram
    title Write to Zone Primary

    participant C as Client
    participant P as Primary
    participant ZS as Zone-local Shard
    participant RC as Remote Client
    participant ZP as Zone Primary
    participant RS as Remote-zone Shard

    RC->>RP: Write op
    activate ZP

    ZP->>P: Replicate
    activate P
    P->>RP: Permission to write

    note over ZP: There is potential for pre-emptive caching here,<br/>but it adds complexity and mostly does not actually<br/>help performance, as the limiting factor is actually<br/>the Remote-zone write.

    ZP->>RS: SubRead (cache)
    activate RS
    RS->>RP: SubReadReply
    deactivate RS

    ZP->>RS: SubWrite
    activate RS
    RS->>RP: SubWriteReply
    deactivate RS

    P->>ZS: SubRead (cache)
    activate ZS
    ZS->>P: SubReadReply
    deactivate ZS

    P->>ZS: SubWrite
    activate ZS
    ZS->>P: SubWriteReply
    deactivate ZS

    P->>RP: write done
    ZP->>RC: Complete

    note over ZP: The write complete message is permitted<br/>as soon as all SubWriteReply messages have<br/>been received by the remote. It is shown<br/>here to demonstrate that it is required to<br/>complete processing on the Primary.

    ZP->>P: Remote-zone write complete
    deactivate P
    deactivate ZP

    P->>P: PG Idle.

    P->>RP: DummyOp
    activate P
    activate ZP

    P->>ZS: DummyOp
    activate ZS
    ZS->>P: Done
    deactivate ZS

    ZP->>P: Done
    deactivate P

    note over ZP: Message ordering means that remote<br/>primary does not need to wait for<br/>dummy ops to complete

    ZP->>RS: DummyOp
    activate RS
    RS->>RP: Done
    deactivate RS
    deactivate ZP

This mechanism allows a client to write to its local Zone Primary rather than sending data over the inter-zone link to the Primary. The key benefit is that the write data crosses the inter-zone link only once (Zone Primary → Primary) rather than twice (client → Primary → Zone Primary). Unlike Section 8.2 where the Primary initiates the transaction and replicates it outward, here the Zone Primary receives the client write directly and coordinates with the Primary to maintain global ordering while avoiding redundant data transfer.

Mechanism:
1. The client sends the write to its local Zone Primary. The Zone Primary stashes a local copy of the write data but does not yet fan out to zone-local shards.
2. The Zone Primary replicates the PGBackend transaction (raw write data, prior to EC encoding) to the Primary.
3. The Primary processes the transaction as normal through its local ECTransaction pipeline, performing cache reads, encoding, and fanning out shard writes to its zone-local OSDs. When the Primary reaches the step where it would normally replicate data to the originating Zone Primary (as in Section 8.2), it instead sends a write-permission message to that Zone Primary, since the Zone Primary already holds the data. Writes to any other Zone Primaries (in an zones > 2 configuration) proceed via the normal Replicate Transaction path (Section 8.2).
4. Upon receiving write-permission, the Zone Primary processes the transaction through its local ECTransaction pipeline, generating and fanning out the coded shard writes to the zone-local OSDs within its zone.
5. Once the Primary’s own local writes are durable and all other zones have confirmed durability, the Primary sends a completion message to the originating Zone Primary. The Zone Primary then responds to the client once the completion message is received and its own local writes are also durable.
Write Ordering: Strict write ordering must be maintained across all zones. Since the Primary remains the single authority for PG log sequencing:
- The Primary sequences the write as part of its normal processing in step 3. The write-permission message carries the assigned sequence number, ensuring that writes arriving at the Primary directly and writes arriving via a Zone Primary are globally ordered.
- Writes from multiple Zone Primaries (in an zones > 2 configuration) are serialized through the Primary’s normal sequencing mechanism — no special handling is required beyond the existing PG log ordering.
- The Zone Primary does not fan out to zone-local shards until it receives the write-permission message, guaranteeing that zone-local shard writes occur in the correct global order.
Benefit: For workloads where the client is co-located with a remote zone, write data traverses the inter-zone link exactly once (Zone Primary → Primary transaction replication). This halves the inter-zone bandwidth compared to R1 (client → Primary → Zone Primary), and is equivalent to Section 8.2 but with the added advantage that the client experiences local write latency for the initial acknowledgement. Crucially, the data is never sent back to the originating Zone Primary — the Primary only sends lightweight permission and completion messages.
Failure Handling: If the Zone Primary is unavailable, the client falls back to writing directly to the Primary (Section 8.1 or 8.2, depending on what is available). If the Primary is unreachable from the Zone Primary mid-transaction, the Zone Primary returns an error to the client and any stashed data is discarded (no zone-local shard writes have been issued, since write-permission was never received).
R3 Enhancement: Forward to Other Zone Primaries (``zones > 2``):

Note

This enhancement is a potential R3 feature and may not be implemented.

In topologies with more than two zones, the originating Zone Primary could forward the transaction to the other Zone Primaries in parallel with sending it to the Primary. This would improve write latency for those zones by allowing them to receive the data directly from a peer Zone Primary rather than waiting for the Primary to replicate outward. The Primary would still issue write-permission messages to all Zone Primaries to maintain global ordering, but the data transfer to additional zones would already be complete, reducing the critical path.

Complexity / Review Note: There is a known race condition here. The write-permission message originating from the Primary might overtake the data transfer sent by the originating Zone Primary. To handle this, a unique transaction ID will be required to definitively tie these two independent messages together at the receiving Zone Primary.

8.4 Hybrid Approach — Later Release

Note

Requires Replicate Transaction (Section 8.2). Deferred to a later release.

It is acknowledged that complex failure scenarios may require a combination of Replicate Transaction, Client-via-Remote-Primary, and Direct-to-OSD approaches.

Example: In a partial failure where one remote zone is healthy and another is degraded, the Primary may use Replicate Transaction for the healthy zone while simultaneously using Direct-to-OSD for the degraded zone within the same transaction lifecycle.

9. Recovery Logic

All recovery operations are centralized and coordinated by the Primary OSD.

9.1 Primary-Centric Recovery — R1

In R1, the Primary performs all recovery without delegating to Zone Primaries and without attempting to minimize inter-zone bandwidth.

Zone-local Recovery: The Primary recovers missing shards in its own zone using available zone-local chunks, exactly as standard EC recovery.
Remote-zone Recovery: For every remote zone, the Primary reads all shards required to reconstruct any missing data, encodes the missing shards locally, and pushes them directly to the target remote-zone OSDs.
No Zone Primary Coordination: The Zone Primary plays no role in recovery in R1. All cross-zone data flows from or to the Primary.
Trade-off: This approach may send more data over the inter-zone link than necessary, but it avoids the complexity of remote-delegated recovery and uses existing recovery infrastructure with minimal modification.

9.2 Cost-Based Recovery Planning — Later Release

Note

This section is deferred to a later release. R1 uses Primary-centric recovery (Section 9.1) for all scenarios.

When the system identifies an individual object (including its clones) requiring recovery, the Primary executes a per-object assessment and planning phase. This leverages the existing, reactive, object-by-object recovery mechanism rather than introducing a broad distributed orchestration layer. The overriding goal is to minimize inter-zone link bandwidth — every cross-zone transfer is expensive, so the Primary must dynamically choose the recovery strategy for each object that results in the fewest bytes crossing zone boundaries.

Assess Capabilities: For each zone, the Primary determines how many consistent chunks are available locally and how many are missing.
Cost-Based Plan Selection: The Primary evaluates the inter-zone bandwidth cost of each viable recovery strategy and selects the cheapest option. The key trade-off is between:
1. Zone Primary performs zone-local recovery with remote-zone reads — the Zone Primary reads a small number of missing shards from the Primary’s zone and reconstructs locally.
2. Primary reconstructs and pushes — the Primary reconstructs the full object and pushes the missing shards to the remote zone.
The optimal choice depends on how many shards each zone is missing.
Example: Consider a 6+2 (k=6, m=2) configuration where Zone A has 8 good shards and Zone B has only 5 good shards (3 missing). Two options:
- Option A (Zone Primary reads remotely): The Zone Primary at Zone B needs only 1 remote-zone read (it has 5 of the 6 required chunks locally and fetches 1 from Zone A) to reconstruct the 3 missing shards locally. Cost: 1 shard across the inter-zone link.
- Option B (Primary pushes): The Primary reconstructs the 3 missing shards at Zone A and pushes them to Zone B. Cost: 3 shards across the inter-zone link.
Option A is clearly cheaper. A general algorithm should compare the cross-zone transfer cost for each strategy and choose the minimum. Where inter-zone bandwidth consumption is equal, the algorithm should tie-break by optimizing for zone-local read bandwidth or the total number of operations.

(In the future, the system may adapt these priorities — e.g. explicitly optimizing for read count rather than inter-zone bandwidth on high-bandwidth links to slower rotational media — whether through auto-tuning or operator configuration. However, exploring alternative optimization modes is beyond the scope of this design.)

Note

The detailed algorithm for optimal recovery planning requires further design work. The general principle is: count the number of cross-zone shard transfers each strategy would require and choose the minimum.
Plan Construction: For each domain, the Primary constructs a “Recovery Plan” message containing specific instructions detailing which OSDs must be used to read the available chunks and which OSDs are the targets to write the recovered chunks. Where cross-zone reads are part of the plan, the message specifies which remote-zone shards to fetch.

9.3 Delegated Remote-zone Recovery — Later Release

Note

Requires Cost-Based Recovery Planning (Section 9.2). Deferred to a later release.

Remote-zone Recovery (Plan-Based): The Primary sends a per-object “Recovery Plan” to the Zone Primary (or appropriate peers), instructing them to execute the recovery for that object (and clones) locally within their domain using the provided read/write set. If the plan includes cross-zone reads, the Zone Primary fetches the specified shards before reconstructing. Because this ties directly into the existing object recovery lifecycle, any failure of a remote-zone peer during the plan simply drops the recovery op. The Primary detects the failure via standard peering or timeout mechanisms and will naturally re-assess and generate a new plan for the object.

9.4 Fallback Strategies — Later Release

Note

These strategies are part of the Cost-Based Recovery system (Section 9.2) and are deferred to a later release.

9.4.1 Remote-zone Fallback (Push Recovery)

If a remote zone cannot perform zone-local recovery, and the cost-based analysis determines that Primary-side reconstruction and push is the cheapest option:

The Primary reconstructs the full object locally (using whatever valid shards are available across the cluster).
The Primary pushes the full object data to the Zone Primary.
This push is accompanied by a set of instructions specifying which shards in the remote zone must be written.

9.4.2 Primary Domain Fallback

If the Primary’s own domain lacks sufficient chunks to recover locally:

The Primary is permitted to read required shards from remote zones to perform the reconstruction.
While this crosses zone boundaries, the cost-based planning ensures it is only chosen when it results in fewer cross-zone transfers than any alternative.

10. PG Log Handling — R1

This section describes how PG log entries are managed across replicated EC stripes, building on the FastEC design for primary-capable and non-primary shards.

10.1 Primary-Capable vs. Non-Primary Shards

The existing FastEC design distinguishes between two classes of shard:

Primary-capable shards: Maintain a full copy of the PG log, enabling them to take over as Primary if needed.
Non-primary shards: Maintain a reduced PG log, omitting entries where no write was performed to that specific shard.

For replicated EC stretch clusters, the non-primary shard selection algorithm will be extended. The following shards will be classified as primary-capable (i.e., they maintain a full PG log):

The first data shard in each replica (per zone).
All coding (parity) shards in each replica.

All remaining data shards will be non-primary shards with reduced logs.

Note

Implementation detail: it needs to be determined whether pg_temp should be used to arrange all primary-capable shards (local, then remote) ahead of non-primary-capable shards in the acting set ordering.

10.2 Independent Log Generation at Zone Primary — Later Release

Note

Log generation at the Zone Primary is only needed when Replicate Transaction writes (Section 8.2) are implemented. For R1, where the Primary sends pre-encoded shards directly, PG log entries are generated solely by the Primary and distributed with the sub-write messages as per existing behavior.

For latency optimization, the replicate transaction message (see Section 8.2) will be sent to the Zone Primary before the cache read has completed on the Primary. This means the Zone Primary cannot receive a pre-built log entry from the Primary — it must generate its own PG log entry independently from the transaction data.

Both the Primary and Zone Primary will produce equivalent log entries from the same transaction, but they are generated independently at each zone.

10.3 Log Entry Compatibility & Upgrade Safety — Later Release

The osdmap.requires_osd_release field dictates the minimum code level that OSDs can run, and therefore determines the log entry format that must be used. No new versioning mechanism is required — the existing release gating already constrains log entry compatibility across mixed-version clusters.

Because the Zone Primary generates its own log entry from the transaction data, it must populate the Object-Based Context (OBC) to obtain the old version of attributes (including object_info_t for old size). This is an additional reason the Zone Primary must be a primary-capable shard — only primary-capable shards maintain the full PG log and OBC state needed for correct log entry construction.

Debug Mode for Log Entry Equivalence

A debug mode will be implemented that compares the log entries generated by the Primary and Zone Primary for each transaction. When enabled, the Primary will include its generated log entry in the replicate transaction message, and the Zone Primary will assert that its independently generated entry is identical. This mode will be enabled by default in teuthology testing to catch any divergence early. It will be available as a runtime configuration option for production debugging but disabled by default in production due to the additional message overhead.

Warning

If a future release changes how PG log entries are derived from transactions, the debug equivalence mode provides an automated safety net. Any divergence between Primary and Zone Primary log entries will be caught immediately in CI.

11. Peering & Stretch Mode — R1

This section covers the peering, min_size, and stretch-mode integration required for replicated EC pools. The design follows the existing replica stretch cluster pattern as closely as possible, with EC-specific adaptations.

Important

R1 scope is simple two-zone (zones=2). Three-zone (zones=3) details are described for design completeness but will be implemented in a later release.

11.1 Pool Lifecycle

Note

CLI Under Review: We are reviewing the CLIs for enabling stretch mode and setting stretch mode on a pool. We aim to either get rid of it completely (i.e., infer stretch mode automatically when you create a pool with zones > 1) or just have an enable/disable stretch mode CLI with no arguments. We also will ensure the new CLI works with both stretch-EC and stretch-replica pools for R1.

A replicated EC pool implicitly operates in stretch mode. Creation and configuration will be simplified based on the final CLI iteration as noted above.

CRUSH rule: Set to the stretch CRUSH rule when stretch mode is active.
min_size: Managed by the monitor. Automatically adjusted during stretch mode state transitions (Section 11.3).
Peering: Uses stretch-aware acting set calculation with parameters adjusted by the monitor during state transitions.
Failure: Automatic min_size reduction, degraded/recovery/healthy stretch mode transitions — all managed by OSDMonitor.

Prerequisites for ``zones > 1`` Pool Creation

Requirement	Reason
All OSDs have the `zones > 1` feature bit	Ensures all OSDs understand extended acting-set semantics (Section 15)
Stretch mode must be enabled on the cluster	`zones > 1` pools require the stretch mode state machine for `min_size` management and zone failover
`zone_failure_domain` must match the stretch mode failure domain	The CRUSH rule must align with the stretch cluster topology

11.2 Minimum PG Size Semantics

The user defines a min_size for a single zone, which implicitly specifies the number of failures the pool will tolerate. For an EC pool, this is in the range K to K+M, defining a tolerance of 0 to M failures. For a replica pool, this is in the range 1 to size, defining a tolerance of 0 to size - min_size failures.

Let the number of tolerated failures derived from this setup be denoted as F.

If the number of zones > 1, then this setting is dynamically interpreted according to the cluster’s stretch mode (Healthy, Degraded, Recovery). Both EC and Replica pools with multiple zones will interpret min_size this way, rather than actively modifying the min_size setting whenever a stretch mode transition occurs.

11.2.1 Per-Pool Min-Size Interpretations — R1

The pool statically tolerates up to F OSD failures, but that tolerance applies to different scopes based on the active stretch mode:

Healthy Mode: There may be up to F OSD failures across the OSDs in all zones combined.
Degraded Mode: There may be up to F OSD failures across the OSDs in the surviving zone(s). The failed zone is entirely ignored.
Recovery Mode: There may be up to F OSD failures across the OSDs in the surviving zone(s). The recovering zone is entirely ignored.

Note

A partial failure within a zone may drop a pool below its min_size requirement and cause I/O to stop. Manually removing the rest of the failed zone will cause a transition to Degraded Stretch Mode, which might be sufficient to bring the pool back online because the min_size requirement is now met by the surviving zone. There will be no automation to promote a partial zone failure to a whole zone failure.

11.2.2 Per-Zone Min-Size Interpretations — Later Release

A more sophisticated implementation will reinterpret this tolerance strictly on a per-zone basis:

There may be up to F OSD failures in each individual zone.
If any single zone experiences more than F OSD failures, it will automatically trigger a transition into Degraded Stretch Mode, effectively treating all OSDs in the zone as failed.

11.3 Stretch Mode State Machine

When stretch mode is enabled, the state machine behaves identically to the replica stretch mode state machine — leveraging the existing OSDMonitor infrastructure. As noted above, the explicit min_size setting is simply interpreted against this state machine, rather than actively mutated by the OSDMonitor upon transitions.

        stateDiagram-v2
    [*] --> Healthy: enable_stretch_mode
    Healthy --> Degraded: zone failure detected
    Degraded --> Recovery: failed zone returns
    Recovery --> Healthy: all PGs clean
    Healthy --> [*]: disable_stretch_mode
    Degraded --> Recovery: force_recovery_stretch_mode CLI
    Recovery --> Healthy: force_healthy_stretch_mode CLI

Two-Zone Transitions (zones=2) — R1

Stretch State	min_size	Reasoning
Healthy	`zones × (K+M) − M`	Full redundancy; tolerate up to M failures
Degraded (one zone down)	`K`	One zone lost (K+M shards gone). Surviving zone has K+M shards; can tolerate M more losses. `K+M − M = K`.
Recovery (zone returning)	`K` (same as degraded)	Keep reduced min_size until resync is complete
Healthy (resync complete)	`zones × (K+M) − M`	Full min_size restored

Concrete example — K=2, M=1, zones=2 (size=6):

Stretch State	min_size
Healthy	5
Degraded	2
Recovery	2
Healthy (restored)	5

The degraded-mode formula:

Healthy min_size: zones × (K+M) − M
Degraded min_size: (num_zones−1) × (K+M) − M = healthy min_size minus (K+M)
Rule: if a zone fails, reduce min_size by K+M. If a zone becomes healthy again, increase min_size by K+M.

Three-Zone Transitions (zones=3) — Later Release

Stretch State	min_size	Example (K=2, M=1)
Healthy (3 zones)	`3(K+M) − M`	8
One zone down	`2(K+M) − M`	5
Two zones down	`K`	2

11.4 OSDMonitor Changes

The existing OSDMonitor stretch mode code contains if (is_replicated()) { ... } else { /* not supported */ } patterns in several places. These gaps must be filled for EC pools with zones > 1.

11.4.1 Pool Stretch Set / Unset (prepare_command_pool_stretch_set, prepare_command_pool_stretch_unset)

stretch_set currently works for any pool type, setting peering_crush_bucket_*, crush_rule, size, min_size.

For EC pools with zones > 1, add validation:

Validate min_size ∈ [num_zones×(K+M)−M, num_zones×(K+M)]
If size is provided, validate it matches zones × (K+M)

stretch_unset clears all peering_crush_* fields. No EC-specific changes required.

11.4.2 Enable/Disable Stretch Mode (try_enable_stretch_mode_pools)

Currently rejects EC pools. Change to accept EC pools with zones > 1:

Set peering_crush_bucket_count, peering_crush_bucket_target, peering_crush_bucket_barrier (same values as replica)
Set crush_rule to the stretch CRUSH rule
Set size = r × (k + m) (should already be correct from pool creation)
Set min_size = r × (k + m) − m

Pool Creation Gate: Pool creation with zones > 1 must be rejected if stretch mode is not already enabled on the cluster. This is validated in OSDMonitor::prepare_new_pool.

11.4.3 Degraded Stretch Mode (trigger_degraded_stretch_mode)

Currently sets newp.min_size = pgi.second.min_size / 2 for replica pools. For EC pools, compute:

newp.min_size = p.min_size - (k + m)

For a K=2, M=1, zones=2 pool: min_size = 5 − 3 = 2.

Also set peering_crush_bucket_count and peering_crush_mandatory_member as for replicated pools.

11.4.4 Healthy Stretch Mode (trigger_healthy_stretch_mode)

Currently reads mon_stretch_pool_min_size config for replica pools. For EC pools, compute from the EC profile:

newp.min_size = r × (k + m) − m

11.4.5 Recovery Stretch Mode (trigger_recovery_stretch_mode)

No changes required — does not modify min_size.

11.5 Peering State Changes

11.5.1 Acting Set Calculation

calc_replicated_acting_stretch is pool-type agnostic at the CRUSH bucket level, but for EC pools the acting set has shard semantics — position i must serve shard i. A new calc_ec_acting_stretch function (or an extension of the existing calc_ec_acting) is required.

Algorithm for each position i (0 to num_zones×(K+M)−1):

Prefer up[i] if usable
Otherwise prefer acting[i] if usable
Otherwise search strays for an OSD with shard i
If no usable OSD found, leave as CRUSH_ITEM_NONE

CRUSH bucket_max constraints apply: no zone may contribute more than size / peering_crush_bucket_target OSDs.

Note

This is essentially calc_ec_acting with the additional CRUSH bucket awareness from the stretch path.

11.5.2 Async Recovery

choose_async_recovery_replicated checks stretch_set_can_peer() before removing an OSD from async recovery. An EC stretch variant must respect shard identity: removing an OSD must not cause any zone to drop below K shards.

11.5.3 Stretch Set Validation (stretch_set_can_peer)

For EC pools, additionally verify:

The acting set includes at least K shards in at least one surviving zone
In healthy mode, all zones have K+M shards

11.6 Relationship to `peering_crush_bucket_*` Fields

The existing pg_pool_t fields remain unchanged:

Field	Purpose	EC Stretch Usage
`peering_crush_bucket_count`	Min distinct CRUSH buckets in acting set	zones (zones) in healthy; reduced during degraded
`peering_crush_bucket_target`	Target CRUSH buckets for `bucket_max` calc	zones (zones) in healthy; reduced during degraded
`peering_crush_bucket_barrier`	CRUSH type level (e.g., datacenter)	Same as replica — the failure domain
`peering_crush_mandatory_member`	CRUSH bucket that must be represented	Set to surviving zone during degraded mode

Example values for zones=2, K=2, M=1 (size=6):

Stretch State	bucket_count	bucket_target	bucket_max	mandatory_member
Healthy	2	2	3	NONE
Degraded	1	1	6	surviving_site
Recovery	1	1	6	surviving_site
Healthy (restored)	2	2	3	NONE

bucket_max = ceil(size / bucket_target) — in healthy mode, 6 / 2 = 3 = K+M. This naturally prevents more than one copy of each shard per zone.

11.7 Network Partition Handling

The existing Ceph stretch cluster mechanism handles network partitions as follows:

MON-Based Zone Election: The MONs are distributed between zones (plus a tie-break zone). When the inter-zone link is severed, the MONs elect which zone continues to operate.
OSD Heartbeat Discovery: OSDs heartbeat the MONs and will discover if they are attached to the losing zone. OSDs on the losing zone stop processing IO.
Lease-Based Read Gating: A lease determines whether OSDs are permitted to process read IOs. The surviving zone must wait for the lease to expire before new writes can be processed, ensuring no stale reads are served by the losing zone during the transition.

This mechanism does not preclude more complex failure scenarios that may also need to be treated as zone failures:

Asymmetric Network Splits: OSDs at a remote zone may be able to communicate with a zone-local MON even though the MONs have determined a zone failover has occurred.
Partial Zone Failures: Loss of a rack or subset of OSDs within a zone may warrant treating the zone as failed (e.g., if the remaining shards fall below the per-zone min_size threshold defined in Section 11.2).

11.8 Online OSDs in Offline Zones

When a zone is marked as offline but individual OSDs within that zone remain reachable, they must be removed from the up set that is presented to the Objecter and peering state machine. The mechanism ties to the min_size logic described in Section 11.2:

Removal: If fewer than k OSDs from a zone are present in the up set, the remaining OSDs for that zone are also removed. This forces a zone failover and prevents partial-zone IO.
Reintegration: When enough OSDs return to meet the per-zone threshold, they are added back into the up set. Standard peering will handle resyncing data — no new recovery code is required for reintegration.

12. Scrub Behavior — R1

The only modification required to scrub is how it verifies the CRCs received from each shard during deep scrub. The deep scrub process must:

Verify the coding CRC, where the EC plugin supports it. This is current behavior but must be extended to cover the replicated shards (i.e., verifying the coding relationship within each zone’s stripe independently).
Verify cross-replica CRC equivalence: corresponding shards in each replica must have the same CRC. For example, SiteShard 0 at Zone A must match SiteShard 0 at Zone B.

Scrub never reads data to the Primary — it only collects and compares CRCs reported by each OSD. This means inter-zone bandwidth is not a significant concern, and a primary-centric scrub process continues to make sense for replicated EC stretch clusters.

13. Migration Path

13.1 Pool Migration — R1

Migration from an existing EC pool to a replicated EC stretch cluster pool will leverage the Pool Migration design, which is being implemented for the Umbrella release. This document does not define a separate migration mechanism.

13.2 In-Place Replica Count Modification — Later Release

In a later release, users will be able to modify zones for an existing pool by swapping in a new EC profile that is otherwise identical (same k, m, plugin, etc.) but specifies a different zones value. Upon profile change, the CRUSH rule will be updated to reflect the new replica count, and the standard recovery process will automatically perform all necessary expansion (or contraction) to match the new configuration — no manual data migration is required.

14. Implementation Order

This section defines the phased implementation plan, broken into single-sprint stories.

14.1 R1 — Read-Optimized Stretch Cluster

R1 delivers a functional EC stretch cluster with zone-local read optimization (comparable to what Replica stretch clusters provide today). All writes and recovery traverse the inter-zone link via the Primary.

CRUSH Rule Generation for Replicated EC Implement automatic CRUSH rule creation from the zones and zone_failure_domain pool parameters (Section 4). Validate that the acting set places k+m shards per zone.
EC Profile Extension: ``zones`` and ``zone_failure_domain`` Parameters Extend the EC profile and pool configuration to accept and validate zones and zone_failure_domain. Wire these into pool creation and ceph osd pool commands (Section 2).
Primary-Capable Shard Selection for Replicated EC Extend the non-primary shard selection algorithm to designate the first data shard and all parity shards per replica as primary-capable (Section 10.1).
Stretch Mode and Peering for Replicated EC Broken into the following sub-stories (see Sections 11.1–11.6):
1. Pool Stretch Set/Unset for EC (OSDMonitor): Allow osd pool stretch set/unset on EC pools with zones > 1. Add EC-specific min_size range validation (Section 11.4.1).
2. Enable/Disable Stretch Mode for EC (OSDMonitor): Allow mon enable_stretch_mode when the cluster has EC pools with zones > 1. Set min_size = r × (k+m) − m (Section 11.4.2).
3. Stretch Mode Transitions for EC (OSDMonitor): Implement degraded/recovery/healthy transitions. On zone failure, reduce min_size by k+m; on recovery, restore it (Sections 11.4.3–5).
4. EC Peering with Stretch Constraints (PeeringState): Create calc_ec_acting_stretch to respect both shard identity and CRUSH bucket_max constraints. Extend async recovery checks (Section 11.5).
5. is_recoverable / is_readable for Stretch EC: Verify and extend ECRecPred and ECReadPred to account for the larger shard set and per-zone constraints (Section 11.5.3).
Primary Write Fan-Out to All Shards (Direct-to-OSD Writes) Extend the Primary write path to encode and distribute zones × (k + m) shard writes directly to all OSDs in the acting set (Section 8.1).
Direct-to-OSD Reads with Primary Fallback Extend EC Direct Reads to the stretch topology: clients read locally and redirect all failures to the Primary (Sections 7.2, 7.3).

7a. Single-OSD Recovery (Within-Zone): Extend recovery so the Primary can recover a single missing OSD within any replica. The Primary reads shards from the affected replica, reconstructs the missing shard, and pushes it to the replacement OSD (Section 9.1).
7b. Full-Zone Recovery: Extend recovery to handle full-zone loss: the Primary reads from its local (surviving) replica, encodes the full stripe, and pushes all k+m shards to the recovering zone’s OSDs. Also covers multi-OSD failures within a single zone (Section 9.1).

Scrub CRC Verification for Replicated Shards Extend deep scrub to verify cross-replica CRC equivalence for corresponding shards (Section 12).
Network Partition and Zone Failover Integration Validate the existing MON-based zone election and OSD heartbeat mechanisms work correctly with the replicated EC acting set (Section 11.7).
End-to-End 2-Zone Integration Testing Full integration test suite covering read, write, recovery, scrub, and zone-failover scenarios for a 2-zone (zones=2) configuration.
Inter-Zone Statistics Add perf counters tracking cross-zone bytes, operation counts, latency histograms, and zone-local vs remote-zone-zone read ratios to give operators visibility into inter-zone link utilization (Section 17).

14.2 Later Release — Recovery Bandwidth Optimization

These stories reduce inter-zone link bandwidth consumed during recovery, in priority order.

Cost-Based Recovery Planning Implement the assessment and plan-selection algorithm that compares cross-zone transfer costs for each recovery strategy (Section 9.2).
Recovery Plan Message and Zone Primary Execution Define the “Recovery Plan” message format and implement Zone Primary execution of delegated recovery within its zone (Section 9.3).
Push Recovery Fallback Implement the path where the Primary reconstructs the full object and pushes to the Zone Primary with shard-write instructions (Section 9.4.1).
Primary Domain Fallback (Cross-Zone Read for Zone-local Recovery) Allow the Primary to read shards from remote zones when its own domain lacks sufficient chunks, only when cost-based planning selects it (Section 9.4.2).

14.3 Later Release — Write Bandwidth Optimization

These stories reduce inter-zone link bandwidth consumed during writes, in priority order.

Replicate Transaction Message Implement the replicate transaction message that sends raw write data to the Zone Primary instead of pre-encoded shards. The Zone Primary encodes locally (Section 8.2).
Independent PG Log Generation at Zone Primary Enable the Zone Primary to generate its own PG log entries from the replicate transaction, including OBC population and the debug equivalence mode (Sections 10.2, 10.3).
Client Writes Direct to Zone Primary Enable clients to write to their local Zone Primary, which stashes the data, replicates to the Primary, and fans out locally only after receiving write-permission. Includes the write-permission/completion message mechanism to maintain global consistency (Section 8.3).
Hybrid Write Path (Replicate + Direct-to-OSD) Support mixed write strategies within a single transaction when zones are in different health states (Section 8.4).

14.4 Later Release — Additional Enhancements

These stories improve resilience and flexibility but are lower priority than bandwidth optimizations.

Read from Zone Primary Enable clients to send reads to their local Zone Primary for local reconstruction, including the Synchronous Recovery Set mechanism (Sections 7.4, 7.4.1).
Direct-to-OSD Failure Redirection to Zone Primary Upgrade direct-read failure handling to redirect to the local Zone Primary instead of the global Primary (Section 7.5).
Per-Zone min_size with Zone Failover Implement per-zone minimum shard thresholds that automatically trigger zone failover (Section 11.2, later release).
Online OSDs in Offline Zones Handling Implement removal of residual online OSDs from the up set when their zone is offline (Section 11.8).
In-Place Replica Count Modification Allow zones to be changed on an existing pool via EC profile swap (Section 13.2).
3-Zone (``zones=3``) Full Integration Testing Full integration and real-world testing of 3-zone configurations (Section 5).

15. Upgrade & Backward Compatibility

Replicated EC pools with zones=1 are fully backward compatible. An zones=1 profile produces a standard (k+m) acting set with no additional replication, no new on-disk format, and no new wire messages. Existing OSDs and clients will handle these pools without modification, because the resulting behavior is identical to a conventional EC pool.

Pools with zones > 1 introduce a larger acting set (zones × (k + m) shards), new CRUSH rules, and — in later releases — new inter-OSD messages (Replicate Transaction, Read Permissions, etc.). To prevent mixed-version clusters from misinterpreting these pools:

A new OSD feature bit will gate the creation of zones > 1 profiles. The monitor will reject pool creation or EC profile changes that set zones > 1 unless all OSDs in the cluster advertise this feature bit.
This ensures that every OSD in the cluster understands the extended acting set semantics, shard numbering, and any new message types before an zones > 1 pool can be instantiated.
No data migration is required when upgrading: the feature bit is purely an admission control mechanism. Once all OSDs are upgraded and the bit is present, zones > 1 pools can be created normally.

Note

Upgrade Scenarios: We are still thinking through upgrade scenarios from older releases (e.g. can we infer zones = 2 pools automatically at upgrade time?).

Open design decisions that will be resolved at implementation:

What to do with stretched replica pools during upgrade. One option is to leave these as is (with num_zones = 1 and the current interpretation of min-size) and continue to support all the arguments on enable/disable stretch mode. An alternative option would be to try and automatically convert these pools to num_zones = 2 and the new interpretation of min-size). These pools should currently have a custom CRUSH rule which should be maintained.

16. Kernel Changes

The kernel RBD client (krbd) implements its own EC direct read path, independent of the userspace librados client. For replicated EC pools, the krbd direct read logic must be updated to understand the zones × (k + m) shard layout so that it can identify and target zone-local shards within the client’s zone. Without this change, krbd may attempt direct reads to shards in a remote zone, negating the locality benefit.

The required modification is small — the shard selection logic needs to account for the replica stride when choosing which shard to read from — but it is a kernel-side change and therefore follows the kernel release cycle independently of the Ceph userspace releases.

17. Inter-Zone Statistics — R1

Operators need visibility into inter-zone link utilization to validate that the stretch EC design is delivering its locality benefits and to plan capacity. The following statistics will be exposed per-pool and per-OSD:

Cross-zone bytes sent / received: Total bytes transferred to/from OSDs in remote zones, broken down by operation type:
- Write fan-out (shard data sent to remote zone)
- Recovery push (shard data sent during recovery)
- remote-zone read (shard data read from remote zone for reconstruction or medium-error recovery)
Cross-zone operation counts: Number of cross-zone sub-operations, broken down by type (sub-write, sub-read, recovery push).
Cross-zone latency: Histogram (p50 / p95 / p99) of round-trip latency for cross-zone sub-operations, useful for detecting link degradation.
local vs remote zone read ratio: For direct-read-enabled pools, the fraction of reads served locally versus those that fell back to the Primary across a zone boundary. A high remote-zone ratio indicates a locality problem.

These counters will be exposed through the existing ceph perf counter infrastructure and will be queryable via ceph tell osd.N perf dump and the Ceph Manager dashboard. They should also be available at the pool level via ceph osd pool stats.

Note

The inter-zone classification relies on comparing the CRUSH location of the source OSD with the destination OSD’s zone. This comparison is already performed during shard fan-out; the statistics layer adds only counter increments on the existing code path.

18. Non-Redundant Pool Zone Affinity

A stretch cluster topology may additionally host workloads that do not require multi-zone redundancy, or where redundancy is handled at a higher application layer. To accommodate this, a non-redundant pool can be configured with zone affinity.

This is achieved by setting up the pool to use a specific CRUSH root. When creating the pool, you set the zones to 1 (the default) and pass a crush_root parameter targeting a specific datacenter bucket or zone bucket within your CRUSH hierarchy.

Implementation Details

When configuring a pool with affinity to a specific zone, the system generates a CRUSH rule that performs a standard take <crush_root> operation on the designated zone bucket.

For instance, if a cluster has two datacenters defined in CRUSH as DC1 and DC2, configuring a pool with crush_root=DC1 and zones=1 will prompt the system to generate a CRUSH rule that starts with take DC1, ensuring all data for that pool resides completely within that datacenter. This allows non-redundant applications to leverage zone-local storage without incurring the latency or bandwidth costs of crossing the inter-zone link.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.

Design Document: Ceph Erasure Coded (EC) Stretch Cluster

1. Intent & High-Level Summary

2. Proposed Configuration (CLI) Changes

2.1 ceph osd pool create

2.2.1 Full command syntax

2.2.2 Legacy positional syntax

2.2.1 Basic Parameters

2.2.2 Advanced

2.2.3 Deprecated

2.3 Examples

1. Approaches Considered but Rejected

3.1 Multi-layered Backends

3.2 LRC (Locally Repairable Codes) Plugin

4. Configuration Logic (Preliminary)

5. Multi-Zone & Topology Considerations

5.1 Single-Zone Pools

6. Topologies and Terminology

6.1 Logical Diagram (2-Zone HA)

6.2 Key Definitions

7. Read Path & Recovery Strategies

7.1 Read from Primary — R1

7.1.1 Zone-local Recovery

7.1.2 Zone-Local Recovery

7.2 Direct Reads — Primary Zone — R1

7.3 Direct Reads — Zone-Local — R1

7.4 Read from Zone Primary — Later Release

7.4.1 Safe Shard Identification & Synchronous Recovery — Later Release

8. Write Path & Transaction Handling

8.1 Direct-to-OSD Writes (Primary Encodes All Shards) — R1

8.2 Replicate Transaction (Preferred Path) — Later Release

8.3 Client Writes Direct to Zone Primary — Later Release

8.4 Hybrid Approach — Later Release

9. Recovery Logic

9.1 Primary-Centric Recovery — R1

9.2 Cost-Based Recovery Planning — Later Release

9.3 Delegated Remote-zone Recovery — Later Release

9.4 Fallback Strategies — Later Release

9.4.1 Remote-zone Fallback (Push Recovery)

9.4.2 Primary Domain Fallback

10. PG Log Handling — R1

10.1 Primary-Capable vs. Non-Primary Shards

10.2 Independent Log Generation at Zone Primary — Later Release

10.3 Log Entry Compatibility & Upgrade Safety — Later Release

11. Peering & Stretch Mode — R1

11.1 Pool Lifecycle

11.2 Minimum PG Size Semantics

11.2.1 Per-Pool Min-Size Interpretations — R1

11.2.2 Per-Zone Min-Size Interpretations — Later Release

11.3 Stretch Mode State Machine

11.4 OSDMonitor Changes

11.5 Peering State Changes

11.6 Relationship to peering_crush_bucket_* Fields

11.7 Network Partition Handling

11.8 Online OSDs in Offline Zones

12. Scrub Behavior — R1

13. Migration Path

13.1 Pool Migration — R1

13.2 In-Place Replica Count Modification — Later Release

14. Implementation Order

14.1 R1 — Read-Optimized Stretch Cluster

14.2 Later Release — Recovery Bandwidth Optimization

14.3 Later Release — Write Bandwidth Optimization

14.4 Later Release — Additional Enhancements

15. Upgrade & Backward Compatibility

16. Kernel Changes

17. Inter-Zone Statistics — R1

18. Non-Redundant Pool Zone Affinity

Implementation Details

11.6 Relationship to `peering_crush_bucket_*` Fields