Notice
This document is for a development version of Ceph.
Design Document: Ceph Erasure Coded (EC) Stretch Cluster
Note
Phased Delivery (R1, R2, …)
This document distinguishes between R1 and later release work. Sections and sub-sections are annotated accordingly. R1 delivers:
Zone-local direct reads (good path only; any failure redirects to the Primary)
Writes via Primary (all write IO traverses the inter-zone link; no Zone Primary encoding)
Recovery from Primary (Primary reads all data and writes all remote-zone shards; no inter-zone bandwidth optimization)
The labels R1, R2, etc. refer to incremental PR delivery milestones — they do not map one-to-one to Ceph named releases. Multiple delivery phases may land within a single Ceph release, or a single phase may span releases depending on development progress.
The implementation order is detailed in Section 14.
Important
Scope and Applicability
This design is an extension to Fast EC (Fast Erasure Coding) only. We make no attempt to support legacy EC implementations with this feature. All functionality described in this document requires Fast EC to be enabled.
1. Intent & High-Level Summary
The primary goal of this feature is to support a Replicated Erasure Coded (EC) configuration within a multi-zone Ceph cluster (Stretch Cluster), building upon the Fast EC infrastructure.
Note
Throughout this document, a zone refers to a group of OSDs within a single CRUSH failure domain — typically a data center, but it could be any CRUSH bucket type (e.g., rack, room). The key characteristic of a zone boundary is that traffic crossing it incurs a significant performance penalty: higher latency and/or lower bandwidth compared to intra-zone communication.
The term inter-zone link refers to the network connection between these failure domains (e.g., the dedicated link between data centers). This is typically the most bandwidth-constrained and latency-sensitive path in the cluster. There is a strong desire to minimize inter-zone link traffic, even to the extent of issuing additional zone-local reads to avoid sending data across this link.
Currently, Ceph Stretch clusters typically rely on Replica to ensure data availability across zones (e.g., data centers). While Erasure Coding provides storage efficiency, applying a standard EC profile naively across a stretch cluster introduces significant operational problems.
Consider a 2-zone stretch cluster using a standard EC profile of k=4, m=6.
While the high parity count provides sufficient fault tolerance to survive a
full zone loss, this configuration suffers from three fundamental
inefficiencies:
Write Bandwidth Waste: Every write must distribute all
k+mchunks across both zones. The coding (parity) shards are unnecessarily duplicated over the inter-zone link, consuming bandwidth that scales linearly with the number of shards.Reads Must Cross Zones: Serving a read requires retrieving at least
kchunks, which in general will span both zones. Every read operation incurs inter-zone link latency.Recovery is Inter-Zone Link Bound: After a zone failure, recovering the lost chunks requires reading surviving chunks across the inter-zone link, placing the entire recovery burden on the most constrained network path.
This feature introduces a hybrid approach to eliminate these problems: creating
Erasure Coded stripes (defined by k + m) and replicating those stripes
zones times across a specified topology level (e.g., Data Center). Each zone
holds a complete, independent copy of the EC stripe, meaning reads and recovery
can be performed entirely within a single zone. This combines the zone-local storage
efficiency of EC with the zone-level redundancy of a stretch cluster, while
minimizing inter-zone link usage to write replication only.
2. Proposed Configuration (CLI) Changes
To support this topology, the EC Pool configuration interface will be expanded directly within the pool creation command. We intend to address a number of issues with the current CLI design:
Create-then-set flow. A typical setup procedure of a pool requires multiple CLI commands (e.g. create replica, then set num copies, then set stretched, etc… )
Remove the need for an EC profile. Indeed, stretch mode EC pools will be mutually exclusive with the use of a profile.
The current CLI behaviour of accepting either positional or non-positional arguments will be maintained, however no positional arguments will be added for the new functionality.
Supporting stretch EC pools will require changes to several CLI commands. The design document will focus on just the changes that will be made to the pool create CLI to give an idea of how the new CLI will work. Similar changes will be made to the CLIs that allow modification of a pool. It also expected that changes will be made to the CLIs that control stretch mode.
2.1 ceph osd pool create
The ceph osd pool create command will be extended to become a parameterized command.
For backward compatibility, the positional arguments will be maintained and can be specified along side the new paramaters.
2.2.1 Full command syntax
The full command syntax is listed here. Refer to the later sections for pool-type specific details.
ceph osd pool create {pool_name}
[{pg_num} | --pg_num <pg_num>]
[{pgp_num} | --pgp_num <pgp_num>]
[{replicated | erasure} | --pool_type <replicated|erasure>]
[{expected_num_objects} | --expected_num_objects <expected_num_objects>]
# CRUSH Placement Options (Mutually Exclusive)
[ {crush_rule_name} | --crush_rule_name <crush_rule_name> |
[--crush_root <crush_root>] [--zone_failure_domain <zone_failure_domain>] [--osd_failure_domain <osd_failure_domain>] [--crush_device_class <class>] ]
# Replica-Specific Options
[--size <size>]
# Erasure-Specific Options
[--data_shards <num_data_shards>]
[--coding_shards <num_coding_shards>]
[--stripe_unit <stripe_unit>]
[ {erasure_code_profile} | --profile <profile_name> ]
# Topology and Redundancy
[--zones <num_zones>]
[--min_size <min_size>]
# Autoscaling and General Config
[--autoscale_mode=<on,off,warn>]
[--pg_num_min <pg_num_min>]
[--pg_num_max <pg_num_max>]
[--bulk]
[--target_size_bytes <target_size_bytes>]
[--target_size_ratio <target_size_ratio>]
[--crimson]
[--yes_i_really_mean_it]
# Legacy Options
[--stretch_mode]
2.2.2 Legacy positional syntax
The following syntax, taken from the current docs, will be maintained for backward compatibility:
ceph osd pool create {pool_name} [{pg_num} [{pgp_num}]] [replicated] \
[crush_rule_name] [expected_num_objects]
ceph osd pool create {pool_name} [{pg_num} [{pgp_num}]] erasure \
[erasure_code_profile] [crush_rule_name] [expected_num_objects] [--autoscale_mode=<on,off,warn>]
This syntax can be used with the new parameters which do not directly conflict. For example --pg_num cannot be used with its positional counterpart.
2.2.1 Basic Parameters
These are the primary parameters required for standard deployments.
- --pool_type (or positional equivalent)
Definition: The type of pool to create.
Values:
replicatedorerasure.Default Value: Derived from
osd_pool_default_typeif omitted, but highly recommended to specify.
- --size
Definition: The number of OSDs the data is striped over for each object. (Replica only)
Note: Parameter will be ignored, rather than rejected for EC pools, for backward compatibility.
- --data_shards (or --k)
Definition: Within a zone, the number of OSDs the data is striped over for each object. (EC only)
- --coding_shards (or --m)
Definition: Within a zone, the number of OSDs the coding shards are striped over. (EC only)
- --zones
Definition: For a stretched cluster configuration defines the number of zones, each which store a full replica of the pool. (EC or Replica)
Default Value:
1- Behavior: Setting this to >1 creates a stretched pool.
A non-stretched pool achieves redundancy accross OSDs. A stretched pool creates redundancy accross
zones.
Pool Size: The resulting pool
sizeiszones × (k + m).
2.2.2 Advanced
These parameters are intended for advanced users and offer finer control over the cluster layout.
- --stripe_unit
Definition: The amount of data in a shard, per stripe. Sometimes referred to as “chunk size”. (EC only)
Default Value: 16k for Fast EC, 4k for classic EC.
Purpose: Where beneficial for a well-known access pattern, the stripe unit can be tuned to any 4k-divisible value.
- --zone_failure_domain
Definition: The CRUSH bucket type over which zone-redundancy is achieved.
Default Value:
datacenterPurpose: This is the CRUSH bucket type that defines a zone.
Note: Mutually exclusive with
--crush_rule
- --osd_failure_domain
Definition: The CRUSH bucket type over which OSD-redundancy is achieved.
Default Value:
hostNote: Mutually exclusive with
--crush_rule
- --crush_device_class
Definition: Restrict placement to devices of a specific class (e.g.,
ssdorhdd), using the CRUSH device class names in the CRUSH map.Purpose: Only required if a cluster has a mixture of different classes of OSD.
Note: Mutually exclusive with
--crush_rule
- --crush_root
Definition: The root of the CRUSH tree to use. (R2 only)
Default Value: The cluster root (
default).Topology and Validation: Specifying the CRUSH root to use (defaults to
default), the CRUSH level for a zone (defaults todatacenter), and the number of zones (defaults to1) is sufficient to define the pool’s placement:Example 1: In a cluster with 2 datacenters, specifying
--zones 2will create a stretch pool across the 2 datacenters.Example 2: In a cluster with 2 datacenters, specifying
--crush_root DC1will create an pool completely contained in DC1.Validation: If there are N datacenters with the same root and you specify a number of zones M != N, the command will fail because the specified number of zones is different from the number of zones in the CRUSH hierarchy.
Custom Rules: If users want to use a subset of zones (e.g., a special 3-datacenter configuration), they must specify a custom CRUSH rule. A custom CRUSH rule is mutually exclusive with specifying the CRUSH root and/or CRUSH level.
Operational Restrictions: When there are stretch pools, adding or moving a CRUSH bucket that impacts the number of zones for the pool will require a
yes-i-really-mean-itflag, as this is liable to break things.Note: Mutually exclusive with
--crush_rule
- --min_size
Definition: The minimum number of shards required to serve I/O within a zone.
Note: The value should be within allowed specified ranges. For replica pools this range is
1-<num replicas>, for EC pools this is<num data shards>-<num data shards>+<num coding shards>.
- --crush_rule
Definition: Use this CRUSH rule, instead of an auto-generated rule.
Purpose: Create a bespoke CRUSH rule for advanced use cases not covered by the auto rule generation above.
Note: Mutually exclusive with
--crush_root,--osd_failure_domainand--zone_failure_domain
- --profile
Definition: The legacy EC Profile to use.
Note: Mutually exclusive with
--zones: Cannot be used with multi-zone configurations.
Note
The following parameters are existing, standard pool configuration options included here for completeness.
- --pg_num
Definition: The total number of placement groups for the pool.
- --pgp_num
Definition: The total number of placement groups for placement purposes.
Note: This should never be modified. Purely here for backward compaitbility.
- --expected_num_objects
Definition: The expected number of objects for this pool, used to pre-split placement groups at pool creation.
- --autoscale_mode
Definition: The auto-scaling mode for placement groups.
Values:
on,off, orwarn.
- --pg_num_min
Definition: The minimum number of placement groups when auto-scaling is active.
- --pg_num_max
Definition: The maximum number of placement groups when auto-scaling is active.
- --bulk
Definition: Flags the pool as a “bulk” pool to pre-allocate more placement groups automatically.
- --target_size_bytes
Definition: The expected total size of the pool in bytes, used to guide PG autoscaling.
- --target_size_ratio
Definition: The expected ratio of the cluster’s total capacity this pool will consume, used for PG autoscaling.
- --crimson
Definition: Flags the pool to run on Crimson OSD.
Note: Crimson OSD is experimental.
- --yes_i_really_mean_it
Definition: Internal safety override flag. In the context of pool creation, it is specifically used to allow the creation of hidden or system-reserved pools whose names begin with a dot (e.g.,
.rgw.root).
2.2.3 Deprecated
These are used for backward compatibility only.
- --stretch_mode
Definition: Deprecated parameter for Replica pools which configures two zones with half the specified number of replica copies in each zone. Not supported for EC pools.
Note: Mutually exclusive with
--zonesand erasure pools.
2.3 Examples
The following are examples of how the new parameterized ceph osd pool create command simplifies pool creation across different topologies.
Example 1: Basic Replicated Pool Create a standard 3-way replicated pool containing 128 placement groups:
ceph osd pool create my_rep_pool --pool_type replicated --size 3 --pg_num 128
Example 2: Standard Erasure Coded Pool
Create an erasure-coded pool using a k=4, m=2 configuration (yielding a size of 6 shards) with 64 placement groups:
ceph osd pool create my_ec_pool --pool_type erasure --data_shards 4 --coding_shards 2 --pg_num 64
Example 3: Stretched Replicated Pool Create a replicated pool that spans across two datacenters, achieving a total size of 4 (2 replicas in each datacenter):
ceph osd pool create stretch_rep --pool_type replicated --size 4 --zones 2 --zone_failure_domain datacenter
Example 4: Stretched Erasure Coded Pool
Create an erasure-coded pool stretched across two racks. Using a k=4, m=2 configuration per zone across 2 zones creates a total pool size of 12 shards (4 data and 2 coding per rack):
ceph osd pool create stretch_ec --pool_type erasure --data_shards 4 --coding_shards 2 --zones 2 --zone_failure_domain rack
Example 5: Single Datacenter Erasure Coded Pool
Create an EC pool confined entirely to a specific datacenter using the --crush_root parameter:
ceph osd pool create dc1_ec --pool_type erasure --data_shards 4 --coding_shards 2 --crush_root DC1
Example 6: Bulk Erasure Coded Pool with Autoscaling Limits Create an EC pool where the system automatically scales the PG count but enforces a minimum boundary, marking it as a bulk pool:
ceph osd pool create bulk_ec --pool_type erasure --data_shards 6 --coding_shards 3 --autoscale_mode on --pg_num_min 128 --bulk
1. Approaches Considered but Rejected
We evaluated and rejected the following two alternative designs:
3.1 Multi-layered Backends
This approach involved re-using the existing ReplicationBackend to manage
the top-level replication, with EC acting as a secondary layer.
Reason for Rejection: This would require complex, multi-layered peering logic where the replication layer interacts with the EC layer. Ensuring the correctness of these interactions is difficult. Additionally, the front-end and back-end interfaces of the two back ends differ significantly (e.g., support for synchronous reads), which would require substantial refactoring.
Advantage of Proposed Solution: Our chosen bespoke approach minimizes changes to the peering state machine and offers better potential for recovery during complex failure scenarios.
3.2 LRC (Locally Repairable Codes) Plugin
This approach involved configuring the existing LRC plugin to provide multi-zone EC semantics. While LRC is designed for locality-aware erasure coding, it does not address the core problems this design aims to solve:
Reason for Rejection:
Writes still cross zones: LRC distributes all chunks (data, local parity, and global parity) across the full CRUSH topology. Every write operation sends chunks over the inter-zone link, offering no bandwidth savings over standard EC.
Reads still cross zones: Reading the original data requires
kdata chunks, which are spread across all zones. LRC provides no mechanism for a single zone to independently serve reads.Recovery remains primary-centralized: In the current Ceph EC infrastructure, all recovery operations are coordinated by the Primary OSD. While LRC defines local repair groups at the coding level, the EC back end would still need to be extended to delegate recovery execution to remote zones — the same complexity required by this proposal.
Advantage of Proposed Solution: The replicated EC stripe approach places a complete
(k+m)set at each zone, which inherently enables zone-local reads, zone-local single-OSD recovery (without any special coding scheme), and limits inter-zone link usage to write replication. It achieves better locality than LRC with less infrastructure complexity.
4. Configuration Logic (Preliminary)
The interactions between these parameters imply a hierarchy in the CRUSH rule generation:
Top Level: The CRUSH rule selects
zonesbuckets of typezone(e.g., select 2 data centers).Lower Level: Inside each selected
zone, the CRUSH rule selectsk+mOSDs to store the chunks.
This leverages standard CRUSH mechanisms — no changes to CRUSH itself are required.
5. Multi-Zone & Topology Considerations
While this architecture is capable of supporting N-zone configurations, specific attention is given to the common 2-zone High Availability (HA) use case.
Reference Diagrams: For clarity, architectural diagrams and examples within this design documentation will primarily depict a 2-zone HA configuration (
zones=2, with 2 data centers).Logical Scalability: The design is logically N-way capable. The parameter
zonesis not limited to 2; the system supports any valid CRUSH topology wherezonesfailure domains exist.Testing Strategy: Testing will follow a phased approach:
Unit Tests (Initial): The unit test framework will include tests for both 2-zone (
zones=2) and 3-zone (zones=3) configurations, validating the core peering and recovery logic for N-way topologies.Full 2-Zone Testing (Initial Release): 2-zone configurations will be fully tested end-to-end for the initial release, covering all read, write, recovery, and failure scenarios.
Full 3-Zone Testing (Later Release): Full integration and real-world testing of 3-zone configurations will be deferred to a subsequent release.
5.1 Single-Zone Pools
The use case for “single-zone” pools is a Ceph cluster which is split across multiple data centers, but the redundancy is provided by the application. Here, the user can specify a pool which is restricted to a single data center. It should be noted that this means that loss of an inter-zone link will lead to an entire zone being lost (something that would not be the case for two independent clusters).
6. Topologies and Terminology
To illustrate the relationship between the replication layer, the EC layer, and the physical topology, we define the following terms and visualization.
6.1 Logical Diagram (2-Zone HA)
The following diagram illustrates a cluster configured with r=2, k=2,
m=1 spanning two Data Centers.
6.2 Key Definitions
- Primary
The primary OSD in the acting set for a Placement Group (PG). This OSD is responsible for coordinating writes and reads for the PG. (Standard Ceph terminology.)
- Shard
The globally unique position of a chunk within the pool’s acting set. Shards are numbered
0throughzones × (k + m) - 1. In the diagram above, Zone A holds Shards 0, 1, 2 and Zone B holds Shards 3, 4, 5.- Zone-local
An OSD or shard is zone-local if it shares the same
zoneas another OSD or shard.- Zone Primary
An internal EC convention. The Zone Primary is the first primary-capable shard in the acting set that resides in a given zone.
These zone primaries will be used in (post-R1) stretch clusters to perform local-to-zone erasure coding. The intent is to minimize inter-zone bandwidth requirements.
- Remote-zone Shard
A relative term referring to a shard in a different zone. For example: “When recovering data in Zone A, a remote-zone shard may be used to read data”
7. Read Path & Recovery Strategies
This architecture supports multiple read strategies to optimize for locality and handle failure scenarios. The strategies are presented in order of increasing complexity: starting with the baseline read-from-Primary path, then layering direct-read optimizations on top.
7.1 Read from Primary — R1
The standard read path involves the client contacting the Primary OSD.
Local Priority: The Primary will prioritize reading from local-to-zone OSDs (within its own
zone) to serve the request or recover data (reconstruct the stripe). This minimizes inter-zone link traffic.
7.1.1 Zone-local Recovery
When the Primary cannot serve a read from a single zone-local shard (e.g., because
a shard is missing or degraded), it reconstructs the data from the remaining
zone-local shards within its zone.
7.1.2 Zone-Local Recovery
If the Primary cannot reconstruct data using only zone-local shards, it may issue direct reads to remote-zone OSDs. This cross-zone read path serves two purposes:
Backfill / Recovery: When a zone-local OSD is down or being backfilled, the Primary can read the corresponding shard from the remote zone to recover the missing data. It is likely that a zone with insufficient EC chunks to reconstruct locally would be taken offline by the stretch cluster monitor logic, but this fallback ensures availability during transient states.
Medium Error Recovery: If a zone-local OSD returns a medium error (e.g., unreadable sector), the Primary can recover the affected data by reading from remote-zone shards and reconstructing locally.
7.2 Direct Reads — Primary Zone — R1
This feature extends the existing “EC Direct Reads” capability (documented separately) to the stretch cluster topology. A client co-located with the Primary can read data directly from a zone-local shard, bypassing the Primary entirely on the good path. No new code is required — this is the standard EC Direct Read path operating over stretch-cluster shard numbering. Any failure falls back to the Primary read path (Section 7.1).
sequenceDiagram
title Direct Read — Primary Zone
participant C as Client
participant P as Primary
participant ZS as Zone-local Shard
C->>ZS: Direct Read
activate ZS
ZS->>C: Complete
deactivate ZS
C->>ZS: Direct Read
activate ZS
ZS->>C: -EAGAIN
deactivate ZS
C->>P: Full Read
activate P
P->>ZS: Read
activate ZS
ZS->>P: Read Done
deactivate ZS
P->>C: Done
deactivate P
Failure Handling & Redirection (R1):
Any failure encountered during a direct read — whether a missing shard,
-EAGAINrejection, or medium error — results in the client redirecting the operation to the Primary (regardless of the Primary’s location).
7.3 Direct Reads — Zone-Local — R1
The client will read from a zone-local shard.
Zone-Local Shard Identification: The client determines which OSDs in the acting set reside within its own zone by comparing OSD CRUSH locations against its own. This identifies the
k+mshards of the local zone. The client then selects a suitable set of data shards for the direct read.Unavailable Zone-local OSDs: If the zone-local shards are not available (e.g., the zone-local OSDs are down or the client cannot identify any zone-local replica), the client does not attempt a direct read and instead directs the operation to the Primary. In later releases this will be improved to redirect to the local Zone Primary instead.
Direct Read Execution: If a zone-local shard is available, the client sends the read directly to that OSD. As detailed in the
ec_direct_reads.rstdesign, it is generally acceptable for reads to overtake in-flight writes. However, the OSD is aware if it has an “uncommitted” write (a write that could potentially be rolled back) for the requested object. If such a write exists, or if the client’s operation requires strict ordering (e.g. therworderedflag is set), the OSD rejects the operation with-EAGAIN. Data access errors (e.g., a media error on the underlying storage) also cause the OSD to reject the operation.Failure Handling & Redirection (R1):
An -EAGAIN will cause the client to retry the op. For R1, the op will be redriven to the primary. R2 will redrive the op to the zone primary.
7.4 Read from Zone Primary — Later Release
In the R1 release, the primary will handle all failures. In later releases, an op which cannot be processed as a direct read will be directed at the zone primary instead. The zone primary will attempt to reconstruct the read data from the zone-local shards directly.
A conflict due to an uncommitted write will be continue to be handled by the primary, as the zone primary must also rejected an op if an uncommitted write exists for that object.
Zone-local Recovery: A read directed to a Zone Primary will attempt to serve the request by recovering data using only OSDs within the same data center (the remote zone).
Zone Degradation: If a zone has insufficient redundancy to reconstruct data locally, that zone should be taken offline to clients rather than serving reads that would require inter-zone link access. (How? - Needs to be covered elsewhere)
``-EAGAIN`` Behavior: The Zone Primary will return
-EAGAINonly in short-lived transient conditions:
7.4.1 Safe Shard Identification & Synchronous Recovery — Later Release
By default, the Zone Primary does not inherently know which shards are consistent and safe to read. To manage this, the Peering process will be enhanced.
Synchronous Recovery Set: The existing peering Activate message will be extended to include a “Synchronous Recovery Set”. This set identifies shards that are currently undergoing recovery and may not yet be consistent.
Note
The peering state machine has a transition from Active+Clean to Recovery, which implies that recovery can start without a new peering interval. This needs further investigation to ensure the Synchronous Recovery Set is correctly maintained across such transitions.
Consistency Guarantee: Within any single epoch, the Synchronous Recovery Set can only reduce over time (shards are removed as they complete recovery). The set is treated as binary — either the full recovery set is active, or it is empty. This ensures the Zone Primary is conservatively stale: it may reject reads it could safely serve, but will never read from an inconsistent shard.
Initial Behavior:
Shards listed in the Synchronous Recovery Set will not be used for reads by the Zone Primary.
If the remaining available shards are insufficient to reconstruct the data, the Zone Primary will return
-EAGAIN.Recovery messages will signal the Zone Primary when recovery completes, clearing the set.
Subsequent Enhancement:
A mechanism will be added to allow the Zone Primary to request permission to read from a shard within the Synchronous Recovery Set.
This request will trigger the necessary recovery for that specific object (if not already complete) before the read is permitted.
8. Write Path & Transaction Handling
This section details how write operations are managed, specifically focusing on Read-Modify-Write (RMW) sequences and the replication of write transactions to remote zones.
8.1 Direct-to-OSD Writes (Primary Encodes All Shards) — R1
In R1, all writes are coordinated exclusively by the Primary.
Mechanism: The Primary generates all
zones × (k + m)coded shard writes and sends individual write operations directly to every OSD in the acting set, including those in remote zones.RMW Reads: The “read” portion of any Read-Modify-Write cycle is performed locally by the Primary, strictly adhering to the logic defined in Section 7 (Read Path & Recovery Strategies).
Inter-zone traffic: This sends all coded shard data (including parity) over the inter-zone link. While this is not bandwidth-optimal, it requires no Zone Primary involvement in write processing and therefore no new write-path code beyond extending the existing shard fan-out to the larger acting set.
8.2 Replicate Transaction (Preferred Path) — Later Release
Note
This section is deferred to a later release. R1 sends all coded shards directly from the Primary.
sequenceDiagram
title Write to Primary
participant C as Client
participant P as Primary
participant ZS as Zone-local Shard
participant ZP as Zone Primary
participant RS as Remote-zone Shard
C->>P: Submit Op
activate P
P->>RP: Replicate
activate ZP
note over P: Replicate message must contain<br/>data and which shards are to be updated.<br/>When OSDs are recovering, the primary<br/>may need to update remote-zone shards directly.
P->>ZS: SubRead (cache)
activate ZS
ZS->>P: SubReadReply
deactivate ZS
P->>ZS: SubWrite
activate ZS
ZS->>P: SubWriteReply
deactivate ZS
ZP->>RS: SubRead (cache)
activate RS
RS->>RP: SubReadReply
deactivate RS
ZP->>RS: SubWrite
activate RS
RS->>RP: SubWriteReply
deactivate RS
ZP->>P: ReplicateDone
deactivate ZP
P->>C: Complete
deactivate P
This mechanism is designed to minimize inter-zone link bandwidth usage, which is often the most constrained resource in stretch clusters.
Mechanism: The replicate message is an extension of the existing sub-write message. Instead of sending individually coded shard data to every OSD in the remote zone, the Primary sends a copy of the PGBackend transaction (i.e., the raw write data and metadata, prior to EC encoding) to the Zone Primary.
Role of Zone Primary: Upon receipt, the Zone Primary processes the transaction through its local ECTransaction pipeline — largely unmodified code — to generate the coded shards for its zone. For writes smaller than a full stripe, this includes the Zone Primary issuing reads to its zone-local shards so that the parity update can be calculated. It then fans out the shard writes to the zone-local OSDs within its
zone. This means coding (parity) data is never sent over the inter-zone link; it is computed independently at each zone.Completion: The Zone Primary responds using the existing sub-write-reply message once all zone-local shard writes are durable.
Crash Handling: If the Zone Primary crashes mid-fan-out, any partial writes on remote-zone shards are rolled back by the existing peering process. No new crash-recovery mechanisms are required.
Benefit: This reduces inter-zone link traffic to a single transaction message per remote zone, carrying only the raw data — not the
k+mcoded chunks.
8.3 Client Writes Direct to Zone Primary — Later Release
Note
This section is deferred to a later release. It requires Replicate Transaction (Section 8.2) as a prerequisite, since the Zone Primary must be able to process transactions and fan out shard writes locally.
sequenceDiagram
title Write to Zone Primary
participant C as Client
participant P as Primary
participant ZS as Zone-local Shard
participant RC as Remote Client
participant ZP as Zone Primary
participant RS as Remote-zone Shard
RC->>RP: Write op
activate ZP
ZP->>P: Replicate
activate P
P->>RP: Permission to write
note over ZP: There is potential for pre-emptive caching here,<br/>but it adds complexity and mostly does not actually<br/>help performance, as the limiting factor is actually<br/>the Remote-zone write.
ZP->>RS: SubRead (cache)
activate RS
RS->>RP: SubReadReply
deactivate RS
ZP->>RS: SubWrite
activate RS
RS->>RP: SubWriteReply
deactivate RS
P->>ZS: SubRead (cache)
activate ZS
ZS->>P: SubReadReply
deactivate ZS
P->>ZS: SubWrite
activate ZS
ZS->>P: SubWriteReply
deactivate ZS
P->>RP: write done
ZP->>RC: Complete
note over ZP: The write complete message is permitted<br/>as soon as all SubWriteReply messages have<br/>been received by the remote. It is shown<br/>here to demonstrate that it is required to<br/>complete processing on the Primary.
ZP->>P: Remote-zone write complete
deactivate P
deactivate ZP
P->>P: PG Idle.
P->>RP: DummyOp
activate P
activate ZP
P->>ZS: DummyOp
activate ZS
ZS->>P: Done
deactivate ZS
ZP->>P: Done
deactivate P
note over ZP: Message ordering means that remote<br/>primary does not need to wait for<br/>dummy ops to complete
ZP->>RS: DummyOp
activate RS
RS->>RP: Done
deactivate RS
deactivate ZP
This mechanism allows a client to write to its local Zone Primary rather than sending data over the inter-zone link to the Primary. The key benefit is that the write data crosses the inter-zone link only once (Zone Primary → Primary) rather than twice (client → Primary → Zone Primary). Unlike Section 8.2 where the Primary initiates the transaction and replicates it outward, here the Zone Primary receives the client write directly and coordinates with the Primary to maintain global ordering while avoiding redundant data transfer.
Mechanism:
The client sends the write to its local Zone Primary. The Zone Primary stashes a local copy of the write data but does not yet fan out to zone-local shards.
The Zone Primary replicates the PGBackend transaction (raw write data, prior to EC encoding) to the Primary.
The Primary processes the transaction as normal through its local ECTransaction pipeline, performing cache reads, encoding, and fanning out shard writes to its zone-local OSDs. When the Primary reaches the step where it would normally replicate data to the originating Zone Primary (as in Section 8.2), it instead sends a write-permission message to that Zone Primary, since the Zone Primary already holds the data. Writes to any other Zone Primaries (in an
zones > 2configuration) proceed via the normal Replicate Transaction path (Section 8.2).Upon receiving write-permission, the Zone Primary processes the transaction through its local ECTransaction pipeline, generating and fanning out the coded shard writes to the zone-local OSDs within its
zone.Once the Primary’s own local writes are durable and all other zones have confirmed durability, the Primary sends a completion message to the originating Zone Primary. The Zone Primary then responds to the client once the completion message is received and its own local writes are also durable.
Write Ordering: Strict write ordering must be maintained across all zones. Since the Primary remains the single authority for PG log sequencing:
The Primary sequences the write as part of its normal processing in step 3. The write-permission message carries the assigned sequence number, ensuring that writes arriving at the Primary directly and writes arriving via a Zone Primary are globally ordered.
Writes from multiple Zone Primaries (in an
zones > 2configuration) are serialized through the Primary’s normal sequencing mechanism — no special handling is required beyond the existing PG log ordering.The Zone Primary does not fan out to zone-local shards until it receives the write-permission message, guaranteeing that zone-local shard writes occur in the correct global order.
Benefit: For workloads where the client is co-located with a remote zone, write data traverses the inter-zone link exactly once (Zone Primary → Primary transaction replication). This halves the inter-zone bandwidth compared to R1 (client → Primary → Zone Primary), and is equivalent to Section 8.2 but with the added advantage that the client experiences local write latency for the initial acknowledgement. Crucially, the data is never sent back to the originating Zone Primary — the Primary only sends lightweight permission and completion messages.
Failure Handling: If the Zone Primary is unavailable, the client falls back to writing directly to the Primary (Section 8.1 or 8.2, depending on what is available). If the Primary is unreachable from the Zone Primary mid-transaction, the Zone Primary returns an error to the client and any stashed data is discarded (no zone-local shard writes have been issued, since write-permission was never received).
R3 Enhancement: Forward to Other Zone Primaries (``zones > 2``):
Note
This enhancement is a potential R3 feature and may not be implemented.
In topologies with more than two zones, the originating Zone Primary could forward the transaction to the other Zone Primaries in parallel with sending it to the Primary. This would improve write latency for those zones by allowing them to receive the data directly from a peer Zone Primary rather than waiting for the Primary to replicate outward. The Primary would still issue write-permission messages to all Zone Primaries to maintain global ordering, but the data transfer to additional zones would already be complete, reducing the critical path.
Complexity / Review Note: There is a known race condition here. The write-permission message originating from the Primary might overtake the data transfer sent by the originating Zone Primary. To handle this, a unique transaction ID will be required to definitively tie these two independent messages together at the receiving Zone Primary.
8.4 Hybrid Approach — Later Release
Note
Requires Replicate Transaction (Section 8.2). Deferred to a later release.
It is acknowledged that complex failure scenarios may require a combination of Replicate Transaction, Client-via-Remote-Primary, and Direct-to-OSD approaches.
Example: In a partial failure where one remote zone is healthy and another is degraded, the Primary may use Replicate Transaction for the healthy zone while simultaneously using Direct-to-OSD for the degraded zone within the same transaction lifecycle.
9. Recovery Logic
All recovery operations are centralized and coordinated by the Primary OSD.
9.1 Primary-Centric Recovery — R1
In R1, the Primary performs all recovery without delegating to Zone Primaries and without attempting to minimize inter-zone bandwidth.
Zone-local Recovery: The Primary recovers missing shards in its own
zoneusing available zone-local chunks, exactly as standard EC recovery.Remote-zone Recovery: For every remote
zone, the Primary reads all shards required to reconstruct any missing data, encodes the missing shards locally, and pushes them directly to the target remote-zone OSDs.No Zone Primary Coordination: The Zone Primary plays no role in recovery in R1. All cross-zone data flows from or to the Primary.
Trade-off: This approach may send more data over the inter-zone link than necessary, but it avoids the complexity of remote-delegated recovery and uses existing recovery infrastructure with minimal modification.
9.2 Cost-Based Recovery Planning — Later Release
Note
This section is deferred to a later release. R1 uses Primary-centric recovery (Section 9.1) for all scenarios.
When the system identifies an individual object (including its clones) requiring recovery, the Primary executes a per-object assessment and planning phase. This leverages the existing, reactive, object-by-object recovery mechanism rather than introducing a broad distributed orchestration layer. The overriding goal is to minimize inter-zone link bandwidth — every cross-zone transfer is expensive, so the Primary must dynamically choose the recovery strategy for each object that results in the fewest bytes crossing zone boundaries.
Assess Capabilities: For each
zone, the Primary determines how many consistent chunks are available locally and how many are missing.Cost-Based Plan Selection: The Primary evaluates the inter-zone bandwidth cost of each viable recovery strategy and selects the cheapest option. The key trade-off is between:
Zone Primary performs zone-local recovery with remote-zone reads — the Zone Primary reads a small number of missing shards from the Primary’s zone and reconstructs locally.
Primary reconstructs and pushes — the Primary reconstructs the full object and pushes the missing shards to the remote zone.
The optimal choice depends on how many shards each zone is missing.
Example: Consider a
6+2(k=6, m=2) configuration where Zone A has 8 good shards and Zone B has only 5 good shards (3 missing). Two options:Option A (Zone Primary reads remotely): The Zone Primary at Zone B needs only 1 remote-zone read (it has 5 of the 6 required chunks locally and fetches 1 from Zone A) to reconstruct the 3 missing shards locally. Cost: 1 shard across the inter-zone link.
Option B (Primary pushes): The Primary reconstructs the 3 missing shards at Zone A and pushes them to Zone B. Cost: 3 shards across the inter-zone link.
Option A is clearly cheaper. A general algorithm should compare the cross-zone transfer cost for each strategy and choose the minimum. Where inter-zone bandwidth consumption is equal, the algorithm should tie-break by optimizing for zone-local read bandwidth or the total number of operations.
(In the future, the system may adapt these priorities — e.g. explicitly optimizing for read count rather than inter-zone bandwidth on high-bandwidth links to slower rotational media — whether through auto-tuning or operator configuration. However, exploring alternative optimization modes is beyond the scope of this design.)
Note
The detailed algorithm for optimal recovery planning requires further design work. The general principle is: count the number of cross-zone shard transfers each strategy would require and choose the minimum.
Plan Construction: For each domain, the Primary constructs a “Recovery Plan” message containing specific instructions detailing which OSDs must be used to read the available chunks and which OSDs are the targets to write the recovered chunks. Where cross-zone reads are part of the plan, the message specifies which remote-zone shards to fetch.
9.3 Delegated Remote-zone Recovery — Later Release
Note
Requires Cost-Based Recovery Planning (Section 9.2). Deferred to a later release.
Remote-zone Recovery (Plan-Based): The Primary sends a per-object “Recovery Plan” to the Zone Primary (or appropriate peers), instructing them to execute the recovery for that object (and clones) locally within their domain using the provided read/write set. If the plan includes cross-zone reads, the Zone Primary fetches the specified shards before reconstructing. Because this ties directly into the existing object recovery lifecycle, any failure of a remote-zone peer during the plan simply drops the recovery op. The Primary detects the failure via standard peering or timeout mechanisms and will naturally re-assess and generate a new plan for the object.
9.4 Fallback Strategies — Later Release
Note
These strategies are part of the Cost-Based Recovery system (Section 9.2) and are deferred to a later release.
9.4.1 Remote-zone Fallback (Push Recovery)
If a remote zone cannot perform zone-local recovery, and the cost-based analysis determines that Primary-side reconstruction and push is the cheapest option:
The Primary reconstructs the full object locally (using whatever valid shards are available across the cluster).
The Primary pushes the full object data to the Zone Primary.
This push is accompanied by a set of instructions specifying which shards in the remote zone must be written.
9.4.2 Primary Domain Fallback
If the Primary’s own domain lacks sufficient chunks to recover locally:
The Primary is permitted to read required shards from remote zones to perform the reconstruction.
While this crosses zone boundaries, the cost-based planning ensures it is only chosen when it results in fewer cross-zone transfers than any alternative.
10. PG Log Handling — R1
This section describes how PG log entries are managed across replicated EC stripes, building on the FastEC design for primary-capable and non-primary shards.
10.1 Primary-Capable vs. Non-Primary Shards
The existing FastEC design distinguishes between two classes of shard:
Primary-capable shards: Maintain a full copy of the PG log, enabling them to take over as Primary if needed.
Non-primary shards: Maintain a reduced PG log, omitting entries where no write was performed to that specific shard.
For replicated EC stretch clusters, the non-primary shard selection algorithm will be extended. The following shards will be classified as primary-capable (i.e., they maintain a full PG log):
The first data shard in each replica (per
zone).All coding (parity) shards in each replica.
All remaining data shards will be non-primary shards with reduced logs.
Note
Implementation detail: it needs to be determined whether pg_temp should
be used to arrange all primary-capable shards (local, then remote) ahead of
non-primary-capable shards in the acting set ordering.
10.2 Independent Log Generation at Zone Primary — Later Release
Note
Log generation at the Zone Primary is only needed when Replicate Transaction writes (Section 8.2) are implemented. For R1, where the Primary sends pre-encoded shards directly, PG log entries are generated solely by the Primary and distributed with the sub-write messages as per existing behavior.
For latency optimization, the replicate transaction message (see Section 8.2) will be sent to the Zone Primary before the cache read has completed on the Primary. This means the Zone Primary cannot receive a pre-built log entry from the Primary — it must generate its own PG log entry independently from the transaction data.
Both the Primary and Zone Primary will produce equivalent log entries from the same transaction, but they are generated independently at each zone.
10.3 Log Entry Compatibility & Upgrade Safety — Later Release
The osdmap.requires_osd_release field dictates the minimum code level that
OSDs can run, and therefore determines the log entry format that must be used.
No new versioning mechanism is required — the existing release gating already
constrains log entry compatibility across mixed-version clusters.
Because the Zone Primary generates its own log entry from the transaction
data, it must populate the Object-Based Context (OBC) to obtain the old
version of attributes (including object_info_t for old size). This is an
additional reason the Zone Primary must be a primary-capable shard — only
primary-capable shards maintain the full PG log and OBC state needed for
correct log entry construction.
Debug Mode for Log Entry Equivalence
A debug mode will be implemented that compares the log entries generated by the Primary and Zone Primary for each transaction. When enabled, the Primary will include its generated log entry in the replicate transaction message, and the Zone Primary will assert that its independently generated entry is identical. This mode will be enabled by default in teuthology testing to catch any divergence early. It will be available as a runtime configuration option for production debugging but disabled by default in production due to the additional message overhead.
Warning
If a future release changes how PG log entries are derived from transactions, the debug equivalence mode provides an automated safety net. Any divergence between Primary and Zone Primary log entries will be caught immediately in CI.
11. Peering & Stretch Mode — R1
This section covers the peering, min_size, and stretch-mode integration
required for replicated EC pools. The design follows the existing replica
stretch cluster pattern as closely as possible, with EC-specific adaptations.
Important
R1 scope is simple two-zone (zones=2). Three-zone (zones=3) details are
described for design completeness but will be implemented in a later
release.
11.1 Pool Lifecycle
Note
CLI Under Review: We are reviewing the CLIs for enabling stretch mode and
setting stretch mode on a pool. We aim to either get rid of it completely
(i.e., infer stretch mode automatically when you create a pool with zones > 1)
or just have an enable/disable stretch mode CLI with no arguments. We also
will ensure the new CLI works with both stretch-EC and stretch-replica pools for R1.
A replicated EC pool implicitly operates in stretch mode. Creation and configuration will be simplified based on the final CLI iteration as noted above.
CRUSH rule: Set to the stretch CRUSH rule when stretch mode is active.
min_size: Managed by the monitor. Automatically adjusted during stretch mode state transitions (Section 11.3).
Peering: Uses stretch-aware acting set calculation with parameters adjusted by the monitor during state transitions.
Failure: Automatic
min_sizereduction, degraded/recovery/healthy stretch mode transitions — all managed by OSDMonitor.
Prerequisites for ``zones > 1`` Pool Creation
Requirement |
Reason |
|---|---|
All OSDs have the |
Ensures all OSDs understand extended acting-set semantics (Section 15) |
Stretch mode must be enabled on the cluster |
|
|
The CRUSH rule must align with the stretch cluster topology |
11.2 Minimum PG Size Semantics
The user defines a min_size for a single zone, which implicitly specifies
the number of failures the pool will tolerate. For an EC pool, this is in the
range K to K+M, defining a tolerance of 0 to M failures. For a
replica pool, this is in the range 1 to size, defining a tolerance of
0 to size - min_size failures.
Let the number of tolerated failures derived from this setup be denoted as F.
If the number of zones > 1, then this setting is dynamically interpreted
according to the cluster’s stretch mode (Healthy, Degraded, Recovery). Both
EC and Replica pools with multiple zones will interpret min_size this way,
rather than actively modifying the min_size setting whenever a stretch
mode transition occurs.
11.2.1 Per-Pool Min-Size Interpretations — R1
The pool statically tolerates up to F OSD failures, but that tolerance
applies to different scopes based on the active stretch mode:
Healthy Mode: There may be up to
FOSD failures across the OSDs in all zones combined.Degraded Mode: There may be up to
FOSD failures across the OSDs in the surviving zone(s). The failed zone is entirely ignored.Recovery Mode: There may be up to
FOSD failures across the OSDs in the surviving zone(s). The recovering zone is entirely ignored.
Note
A partial failure within a zone may drop a pool below its min_size
requirement and cause I/O to stop. Manually removing the rest of the failed
zone will cause a transition to Degraded Stretch Mode, which might be
sufficient to bring the pool back online because the min_size requirement
is now met by the surviving zone. There will be no automation to promote a
partial zone failure to a whole zone failure.
11.2.2 Per-Zone Min-Size Interpretations — Later Release
A more sophisticated implementation will reinterpret this tolerance strictly on a per-zone basis:
There may be up to
FOSD failures in each individual zone.If any single zone experiences more than
FOSD failures, it will automatically trigger a transition into Degraded Stretch Mode, effectively treating all OSDs in the zone as failed.
11.3 Stretch Mode State Machine
When stretch mode is enabled, the state machine behaves identically to the
replica stretch mode state machine — leveraging the existing OSDMonitor
infrastructure. As noted above, the explicit min_size setting is simply
interpreted against this state machine, rather than actively mutated by the
OSDMonitor upon transitions.
stateDiagram-v2
[*] --> Healthy: enable_stretch_mode
Healthy --> Degraded: zone failure detected
Degraded --> Recovery: failed zone returns
Recovery --> Healthy: all PGs clean
Healthy --> [*]: disable_stretch_mode
Degraded --> Recovery: force_recovery_stretch_mode CLI
Recovery --> Healthy: force_healthy_stretch_mode CLI
Two-Zone Transitions (zones=2) — R1
Stretch State |
min_size |
Reasoning |
|---|---|---|
Healthy |
|
Full redundancy; tolerate up to M failures |
Degraded (one zone down) |
|
One zone lost (K+M shards gone). Surviving zone has K+M shards;
can tolerate M more losses. |
Recovery (zone returning) |
|
Keep reduced min_size until resync is complete |
Healthy (resync complete) |
|
Full min_size restored |
Concrete example — K=2, M=1, zones=2 (size=6):
Stretch State |
min_size |
|---|---|
Healthy |
5 |
Degraded |
2 |
Recovery |
2 |
Healthy (restored) |
5 |
The degraded-mode formula:
Healthy min_size:
zones × (K+M) − MDegraded min_size:
(num_zones−1) × (K+M) − M= healthy min_size minus(K+M)Rule: if a zone fails, reduce min_size by K+M. If a zone becomes healthy again, increase min_size by K+M.
Three-Zone Transitions (zones=3) — Later Release
Stretch State |
min_size |
Example (K=2, M=1) |
|---|---|---|
Healthy (3 zones) |
|
8 |
One zone down |
|
5 |
Two zones down |
|
2 |
11.4 OSDMonitor Changes
The existing OSDMonitor stretch mode code contains
if (is_replicated()) { ... } else { /* not supported */ } patterns in
several places. These gaps must be filled for EC pools with zones > 1.
11.4.1 Pool Stretch Set / Unset (prepare_command_pool_stretch_set,
prepare_command_pool_stretch_unset)
stretch_set currently works for any pool type, setting
peering_crush_bucket_*, crush_rule, size, min_size.
For EC pools with zones > 1, add validation:
Validate
min_size ∈ [num_zones×(K+M)−M, num_zones×(K+M)]If
sizeis provided, validate it matcheszones × (K+M)
stretch_unset clears all peering_crush_* fields. No EC-specific
changes required.
11.4.2 Enable/Disable Stretch Mode (try_enable_stretch_mode_pools)
Currently rejects EC pools. Change to accept EC pools with zones > 1:
Set
peering_crush_bucket_count,peering_crush_bucket_target,peering_crush_bucket_barrier(same values as replica)Set
crush_ruleto the stretch CRUSH ruleSet
size = r × (k + m)(should already be correct from pool creation)Set
min_size = r × (k + m) − m
Pool Creation Gate: Pool creation with zones > 1 must be rejected if
stretch mode is not already enabled on the cluster. This is validated in
OSDMonitor::prepare_new_pool.
11.4.3 Degraded Stretch Mode (trigger_degraded_stretch_mode)
Currently sets newp.min_size = pgi.second.min_size / 2 for replica
pools. For EC pools, compute:
newp.min_size = p.min_size - (k + m)
For a K=2, M=1, zones=2 pool: min_size = 5 − 3 = 2.
Also set peering_crush_bucket_count and
peering_crush_mandatory_member as for replicated pools.
11.4.4 Healthy Stretch Mode (trigger_healthy_stretch_mode)
Currently reads mon_stretch_pool_min_size config for replica pools.
For EC pools, compute from the EC profile:
newp.min_size = r × (k + m) − m
11.4.5 Recovery Stretch Mode (trigger_recovery_stretch_mode)
No changes required — does not modify min_size.
11.5 Peering State Changes
11.5.1 Acting Set Calculation
calc_replicated_acting_stretch is pool-type agnostic at the CRUSH bucket
level, but for EC pools the acting set has shard semantics — position
i must serve shard i. A new calc_ec_acting_stretch function (or
an extension of the existing calc_ec_acting) is required.
Algorithm for each position i (0 to num_zones×(K+M)−1):
Prefer
up[i]if usableOtherwise prefer
acting[i]if usableOtherwise search strays for an OSD with shard
iIf no usable OSD found, leave as
CRUSH_ITEM_NONE
CRUSH bucket_max constraints apply: no zone may contribute more than
size / peering_crush_bucket_target OSDs.
Note
This is essentially calc_ec_acting with the additional CRUSH bucket
awareness from the stretch path.
11.5.2 Async Recovery
choose_async_recovery_replicated checks stretch_set_can_peer() before
removing an OSD from async recovery. An EC stretch variant must respect shard
identity: removing an OSD must not cause any zone to drop below K shards.
11.5.3 Stretch Set Validation (stretch_set_can_peer)
For EC pools, additionally verify:
The acting set includes at least K shards in at least one surviving zone
In healthy mode, all zones have
K+Mshards
11.6 Relationship to peering_crush_bucket_* Fields
The existing pg_pool_t fields remain unchanged:
Field |
Purpose |
EC Stretch Usage |
|---|---|---|
|
Min distinct CRUSH buckets in acting set |
zones (zones) in healthy; reduced during degraded |
|
Target CRUSH buckets for |
zones (zones) in healthy; reduced during degraded |
|
CRUSH type level (e.g., datacenter) |
Same as replica — the failure domain |
|
CRUSH bucket that must be represented |
Set to surviving zone during degraded mode |
Example values for zones=2, K=2, M=1 (size=6):
Stretch State |
bucket_count |
bucket_target |
bucket_max |
mandatory_member |
|---|---|---|---|---|
Healthy |
2 |
2 |
3 |
NONE |
Degraded |
1 |
1 |
6 |
surviving_site |
Recovery |
1 |
1 |
6 |
surviving_site |
Healthy (restored) |
2 |
2 |
3 |
NONE |
bucket_max = ceil(size / bucket_target) — in healthy mode, 6 / 2 = 3 =
K+M. This naturally prevents more than one copy of each shard per zone.
11.7 Network Partition Handling
The existing Ceph stretch cluster mechanism handles network partitions as follows:
MON-Based Zone Election: The MONs are distributed between zones (plus a tie-break zone). When the inter-zone link is severed, the MONs elect which zone continues to operate.
OSD Heartbeat Discovery: OSDs heartbeat the MONs and will discover if they are attached to the losing zone. OSDs on the losing zone stop processing IO.
Lease-Based Read Gating: A lease determines whether OSDs are permitted to process read IOs. The surviving zone must wait for the lease to expire before new writes can be processed, ensuring no stale reads are served by the losing zone during the transition.
This mechanism does not preclude more complex failure scenarios that may also need to be treated as zone failures:
Asymmetric Network Splits: OSDs at a remote zone may be able to communicate with a zone-local MON even though the MONs have determined a zone failover has occurred.
Partial Zone Failures: Loss of a rack or subset of OSDs within a zone may warrant treating the zone as failed (e.g., if the remaining shards fall below the per-zone
min_sizethreshold defined in Section 11.2).
11.8 Online OSDs in Offline Zones
When a zone is marked as offline but individual OSDs within that zone remain
reachable, they must be removed from the up set that is presented to the
Objecter and peering state machine. The mechanism ties to the min_size
logic described in Section 11.2:
Removal: If fewer than
kOSDs from a zone are present in the up set, the remaining OSDs for that zone are also removed. This forces a zone failover and prevents partial-zone IO.Reintegration: When enough OSDs return to meet the per-zone threshold, they are added back into the up set. Standard peering will handle resyncing data — no new recovery code is required for reintegration.
12. Scrub Behavior — R1
The only modification required to scrub is how it verifies the CRCs received from each shard during deep scrub. The deep scrub process must:
Verify the coding CRC, where the EC plugin supports it. This is current behavior but must be extended to cover the replicated shards (i.e., verifying the coding relationship within each zone’s stripe independently).
Verify cross-replica CRC equivalence: corresponding shards in each replica must have the same CRC. For example, SiteShard 0 at Zone A must match SiteShard 0 at Zone B.
Scrub never reads data to the Primary — it only collects and compares CRCs reported by each OSD. This means inter-zone bandwidth is not a significant concern, and a primary-centric scrub process continues to make sense for replicated EC stretch clusters.
13. Migration Path
13.1 Pool Migration — R1
Migration from an existing EC pool to a replicated EC stretch cluster pool will leverage the Pool Migration design, which is being implemented for the Umbrella release. This document does not define a separate migration mechanism.
13.2 In-Place Replica Count Modification — Later Release
In a later release, users will be able to modify zones for an existing pool by
swapping in a new EC profile that is otherwise identical (same k, m,
plugin, etc.) but specifies a different zones value. Upon profile change, the
CRUSH rule will be updated to reflect the new replica count, and the standard
recovery process will automatically perform all necessary expansion (or
contraction) to match the new configuration — no manual data migration is
required.
14. Implementation Order
This section defines the phased implementation plan, broken into single-sprint stories.
14.1 R1 — Read-Optimized Stretch Cluster
R1 delivers a functional EC stretch cluster with zone-local read optimization (comparable to what Replica stretch clusters provide today). All writes and recovery traverse the inter-zone link via the Primary.
CRUSH Rule Generation for Replicated EC Implement automatic CRUSH rule creation from the
zonesandzone_failure_domainpool parameters (Section 4). Validate that the acting set placesk+mshards perzone.EC Profile Extension: ``zones`` and ``zone_failure_domain`` Parameters Extend the EC profile and pool configuration to accept and validate
zonesandzone_failure_domain. Wire these into pool creation andceph osd poolcommands (Section 2).Primary-Capable Shard Selection for Replicated EC Extend the non-primary shard selection algorithm to designate the first data shard and all parity shards per replica as primary-capable (Section 10.1).
Stretch Mode and Peering for Replicated EC Broken into the following sub-stories (see Sections 11.1–11.6):
Pool Stretch Set/Unset for EC (OSDMonitor): Allow
osd pool stretch set/unseton EC pools withzones > 1. Add EC-specificmin_sizerange validation (Section 11.4.1).Enable/Disable Stretch Mode for EC (OSDMonitor): Allow
mon enable_stretch_modewhen the cluster has EC pools withzones > 1. Setmin_size = r × (k+m) − m(Section 11.4.2).Stretch Mode Transitions for EC (OSDMonitor): Implement degraded/recovery/healthy transitions. On zone failure, reduce
min_sizebyk+m; on recovery, restore it (Sections 11.4.3–5).EC Peering with Stretch Constraints (PeeringState): Create
calc_ec_acting_stretchto respect both shard identity and CRUSHbucket_maxconstraints. Extend async recovery checks (Section 11.5).is_recoverable / is_readable for Stretch EC: Verify and extend
ECRecPredandECReadPredto account for the larger shard set and per-zone constraints (Section 11.5.3).
Primary Write Fan-Out to All Shards (Direct-to-OSD Writes) Extend the Primary write path to encode and distribute
zones × (k + m)shard writes directly to all OSDs in the acting set (Section 8.1).Direct-to-OSD Reads with Primary Fallback Extend EC Direct Reads to the stretch topology: clients read locally and redirect all failures to the Primary (Sections 7.2, 7.3).
- 7a. Single-OSD Recovery (Within-Zone)
Extend recovery so the Primary can recover a single missing OSD within any replica. The Primary reads shards from the affected replica, reconstructs the missing shard, and pushes it to the replacement OSD (Section 9.1).
- 7b. Full-Zone Recovery
Extend recovery to handle full-zone loss: the Primary reads from its local (surviving) replica, encodes the full stripe, and pushes all
k+mshards to the recovering zone’s OSDs. Also covers multi-OSD failures within a single zone (Section 9.1).
Scrub CRC Verification for Replicated Shards Extend deep scrub to verify cross-replica CRC equivalence for corresponding shards (Section 12).
Network Partition and Zone Failover Integration Validate the existing MON-based zone election and OSD heartbeat mechanisms work correctly with the replicated EC acting set (Section 11.7).
End-to-End 2-Zone Integration Testing Full integration test suite covering read, write, recovery, scrub, and zone-failover scenarios for a 2-zone (
zones=2) configuration.Inter-Zone Statistics Add perf counters tracking cross-zone bytes, operation counts, latency histograms, and zone-local vs remote-zone-zone read ratios to give operators visibility into inter-zone link utilization (Section 17).
14.2 Later Release — Recovery Bandwidth Optimization
These stories reduce inter-zone link bandwidth consumed during recovery, in priority order.
Cost-Based Recovery Planning Implement the assessment and plan-selection algorithm that compares cross-zone transfer costs for each recovery strategy (Section 9.2).
Recovery Plan Message and Zone Primary Execution Define the “Recovery Plan” message format and implement Zone Primary execution of delegated recovery within its
zone(Section 9.3).Push Recovery Fallback Implement the path where the Primary reconstructs the full object and pushes to the Zone Primary with shard-write instructions (Section 9.4.1).
Primary Domain Fallback (Cross-Zone Read for Zone-local Recovery) Allow the Primary to read shards from remote zones when its own domain lacks sufficient chunks, only when cost-based planning selects it (Section 9.4.2).
14.3 Later Release — Write Bandwidth Optimization
These stories reduce inter-zone link bandwidth consumed during writes, in priority order.
Replicate Transaction Message Implement the replicate transaction message that sends raw write data to the Zone Primary instead of pre-encoded shards. The Zone Primary encodes locally (Section 8.2).
Independent PG Log Generation at Zone Primary Enable the Zone Primary to generate its own PG log entries from the replicate transaction, including OBC population and the debug equivalence mode (Sections 10.2, 10.3).
Client Writes Direct to Zone Primary Enable clients to write to their local Zone Primary, which stashes the data, replicates to the Primary, and fans out locally only after receiving write-permission. Includes the write-permission/completion message mechanism to maintain global consistency (Section 8.3).
Hybrid Write Path (Replicate + Direct-to-OSD) Support mixed write strategies within a single transaction when zones are in different health states (Section 8.4).
14.4 Later Release — Additional Enhancements
These stories improve resilience and flexibility but are lower priority than bandwidth optimizations.
Read from Zone Primary Enable clients to send reads to their local Zone Primary for local reconstruction, including the Synchronous Recovery Set mechanism (Sections 7.4, 7.4.1).
Direct-to-OSD Failure Redirection to Zone Primary Upgrade direct-read failure handling to redirect to the local Zone Primary instead of the global Primary (Section 7.5).
Per-Zone min_size with Zone Failover Implement per-zone minimum shard thresholds that automatically trigger zone failover (Section 11.2, later release).
Online OSDs in Offline Zones Handling Implement removal of residual online OSDs from the up set when their zone is offline (Section 11.8).
In-Place Replica Count Modification Allow
zonesto be changed on an existing pool via EC profile swap (Section 13.2).3-Zone (``zones=3``) Full Integration Testing Full integration and real-world testing of 3-zone configurations (Section 5).
15. Upgrade & Backward Compatibility
Replicated EC pools with zones=1 are fully backward compatible. An zones=1
profile produces a standard (k+m) acting set with no additional replication,
no new on-disk format, and no new wire messages. Existing OSDs and clients will
handle these pools without modification, because the resulting behavior is
identical to a conventional EC pool.
Pools with zones > 1 introduce a larger acting set (zones × (k + m) shards),
new CRUSH rules, and — in later releases — new inter-OSD messages
(Replicate Transaction, Read Permissions, etc.). To prevent mixed-version
clusters from misinterpreting these pools:
A new OSD feature bit will gate the creation of
zones > 1profiles. The monitor will reject pool creation or EC profile changes that setzones > 1unless all OSDs in the cluster advertise this feature bit.This ensures that every OSD in the cluster understands the extended acting set semantics, shard numbering, and any new message types before an
zones > 1pool can be instantiated.No data migration is required when upgrading: the feature bit is purely an admission control mechanism. Once all OSDs are upgraded and the bit is present,
zones > 1pools can be created normally.
Note
Upgrade Scenarios: We are still thinking through upgrade scenarios from older
releases (e.g. can we infer zones = 2 pools automatically at upgrade time?).
Open design decisions that will be resolved at implementation:
What to do with stretched replica pools during upgrade. One option is to leave these as is (with num_zones = 1 and the current interpretation of min-size) and continue to support all the arguments on enable/disable stretch mode. An alternative option would be to try and automatically convert these pools to num_zones = 2 and the new interpretation of min-size). These pools should currently have a custom CRUSH rule which should be maintained.
16. Kernel Changes
The kernel RBD client (krbd) implements its own EC direct read path,
independent of the userspace librados client. For replicated EC pools, the
krbd direct read logic must be updated to understand the zones × (k + m)
shard layout so that it can identify and target zone-local shards within the
client’s zone. Without this change, krbd may attempt direct
reads to shards in a remote zone, negating the locality benefit.
The required modification is small — the shard selection logic needs to account for the replica stride when choosing which shard to read from — but it is a kernel-side change and therefore follows the kernel release cycle independently of the Ceph userspace releases.
17. Inter-Zone Statistics — R1
Operators need visibility into inter-zone link utilization to validate that the stretch EC design is delivering its locality benefits and to plan capacity. The following statistics will be exposed per-pool and per-OSD:
Cross-zone bytes sent / received: Total bytes transferred to/from OSDs in remote zones, broken down by operation type:
Write fan-out (shard data sent to remote zone)
Recovery push (shard data sent during recovery)
remote-zone read (shard data read from remote zone for reconstruction or medium-error recovery)
Cross-zone operation counts: Number of cross-zone sub-operations, broken down by type (sub-write, sub-read, recovery push).
Cross-zone latency: Histogram (p50 / p95 / p99) of round-trip latency for cross-zone sub-operations, useful for detecting link degradation.
local vs remote zone read ratio: For direct-read-enabled pools, the fraction of reads served locally versus those that fell back to the Primary across a zone boundary. A high remote-zone ratio indicates a locality problem.
These counters will be exposed through the existing ceph perf counter
infrastructure and will be queryable via ceph tell osd.N perf dump and
the Ceph Manager dashboard. They should also be available at the pool level
via ceph osd pool stats.
Note
The inter-zone classification relies on comparing the CRUSH location of the
source OSD with the destination OSD’s zone. This comparison is
already performed during shard fan-out; the statistics layer adds only
counter increments on the existing code path.
18. Non-Redundant Pool Zone Affinity
A stretch cluster topology may additionally host workloads that do not require multi-zone redundancy, or where redundancy is handled at a higher application layer. To accommodate this, a non-redundant pool can be configured with zone affinity.
This is achieved by setting up the pool to use a specific CRUSH root. When creating the pool, you set the zones to 1 (the default) and pass a crush_root parameter targeting a specific datacenter bucket or zone bucket within your CRUSH hierarchy.
Implementation Details
When configuring a pool with affinity to a specific zone, the system generates a
CRUSH rule that performs a standard take <crush_root> operation on the
designated zone bucket.
For instance, if a cluster has two datacenters defined in CRUSH as DC1 and
DC2, configuring a pool with crush_root=DC1 and zones=1 will prompt the
system to generate a CRUSH rule that starts with take DC1, ensuring all
data for that pool resides completely within that datacenter. This allows non-redundant
applications to leverage zone-local storage without incurring the latency or bandwidth
costs of crossing the inter-zone link.
Brought to you by the Ceph Foundation
The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.