Notice

This document is for a development version of Ceph.

Rados Bucket Index

Buckets in RGW store a list of objects and associated metadata in each bucket’s bucket index. Each bucket index entry stores metadata (size, etag, mtime, etc.) to serve API requests to list objects along with some internal bookkeeping. These APIs are ListObjectsV2 and ListObjectVersions in S3, and GET Container in Swift.

The entries are stored in the index object’s (or index objects’, see sharding below) RADOS omap entries.

Buckets can also be created as ‘indexless’. Such buckets have no index and cannot be listed.

For non-versioned buckets there is one entry in the bucket index for every object. For S3 versioned buckets it’s a little more complex (see S3 Object Versioning for details).

S3 Object Versioning

For versioned buckets the bucket index contains an entry for each object version and delete marker. In addition to sorting index entries by object name, it also has entries that sort object versions of the same name from newest to oldest, which are used for versioned listings.

RGW stores a head object in the rgw.buckets.data pool for each object version. This rados object’s oid is a combination of the object name and its version id.

In S3 a GET/HEAD request for an object name will give you that object’s “current” version. To support this RGW stores an extra ‘object logical head’ (olh) object whose oid includes the object name only and that acts as an indirection to the head object of its current version. This indirection logic is implemented in src/rgw/driver/rados/rgw_rados.cc as RGWRados::follow_olh().

To maintain the consistency between this olh object and the bucket index, the index keeps a separate ‘olh’ entry for each object name. This entry stores a log of all writes/deletes to its versions. In src/rgw/driver/rados/rgw_rados.cc, RGWRados::apply_olh_log() replays this log to guarantee that this olh object converges on the same “current” version as the bucket index.

Consistency Guarantee

RGW guarantees read-after-write consistency on object operations. This means that once a client receives a successful response to a write request, then the effects of that write must be visible to subsequent read requests.

For example, if an S3 client sends a PutObject request to overwrite an existing object followed by a GetObject request to read it back, RGW must not return the previous object’s contents. It must either respond with the new object’s contents or with the result of a later object write or delete.

This consistency guarantee applies to all object write requests (PutObject, DeleteObject, PutObjectAcl, etc) and all object read requests (HeadObject, GetObject, ListObjectsV2, etc).

Rados Object Model

S3/Swift objects, or ‘API objects’, are stored as rados objects in the rgw.buckets.data pool. Each API object is comprised of a head object and zero or more tail objects. Bucket index objects are stored in the rgw.buckets.index pool.

When writing an object, its head object is written last. This acts as an atomic ‘commit’ to make it visible to read requests.

Sharding and Resharding

A bucket’s index is generally split into several RADOS objects that are called bucket index shards. In RADOS multiple writes to the same object (e.g., bucket index objects) cannot run in parallel. By spreading the index over more RADOS objects we increase its write parallelism. For a given object upload, the corresponding bucket index shard is selected based on a hash of the object’s name. This guarantees that all entries for a given object (and its versions, if it’s an S3 versioned bucket) are located on the same shard. This also tends to distribute objects evenly across index shards, although this distribution can break down with objects with lots of versions.

The default shard count for new buckets is 11, but can be overridden in the zonegroup’s bucket_index_max_shards or ceph.conf’s rgw_override_bucket_index_max_shards.

Because there’s a practical limit of 100,000 entries for any RADOS object’s omap, we need to make sure that we use enough shards such that we do not exceed this limit. As the number of objects in a bucket grows, we may have to reshard a bucket index, so there are a sufficient number of shards. The radosgw-admin administration command allows a bucket to be resharded manually, but dynamic resharding allows resharding to occur automatically without administrative intervention.

Dynamic resharding can also reduce the number of shards for a bucket. Because some buckets may experience rapid increases and decreases in the number of objects and because we don’t want to chase those changes with multiple reshards, in order to reduce the number of shards there’s a delay between when the need to reduce the number of shards is noted and when it will occur. At the time the resharding would occur, the bucket is checked again and only proceeds if the bucket still needs its number of bucket index shards reduced. This delay is by default 5 days but can be configured with ceph.conf’s rgw_dynamic_resharding_reduction_wait.

Information about the bucket’s index object layout is stored in RGWBucketInfo as struct rgw::BucketLayout from src/rgw/rgw_bucket_layout.h. The resharding logic is in src/rgw/driver/rados/rgw_reshard.cc.

Index Transaction

To keep the bucket index consistent, all object writes or deletes must also update the index accordingly. Because the head objects are stored in different rados objects than the bucket indices, we can’t update both atomically with a single rados operation. In order to satisfy the Consistency Guarantee for listing operations, we have to coordinate these two object writes using a three-step bucket index transaction:

  1. Prepare a transaction on its bucket index object.

  2. Write or delete the head object.

  3. Commit the transaction on the bucket index object (or cancel the transaction if step 2 fails).

Object writes and deletes may race with each other, so a given object may have more than one prepared transaction at a time. RGW considers an object entry to be ‘pending’ if there are any outstanding transactions, or ‘completed’ otherwise.

This transaction is implemented in src/rgw/driver/rados/rgw_rados.cc as RGWRados::Object::Write::write_meta() for object writes and RGWRados::Object::Delete::delete_obj() for object deletes. The bucket index operations are implemented in src/cls/rgw/cls_rgw.cc as rgw_bucket_prepare_op() and rgw_bucket_complete_op().

Listing

When listing objects, RGW will read all entries (pending and completed) from the bucket index. For any pending entries, it must check whether the head object exists before including that entry in the final listing.

If an RGW crashes in the middle of an Index Transaction, an index entry may get stuck in this ‘pending’ state. When bucket listing encounters these pending entries, it also sends information from the head object back to the bucket index so it can update the entry and resolve its stale transactions. This message is called ‘dir suggest’, because the bucket index treats it as a hint or suggestion.

Bucket listing is implemented in src/rgw/driver/rados/rgw_rados.cc as RGWRados::Bucket::List::list_objects_ordered() and RGWRados::Bucket::List::list_objects_unordered(). RGWRados::check_disk_state() is the part that reads the head object and encodes suggested changes. The corresponding bucket index operations are implemented in src/cls/rgw/cls_rgw.cc as rgw_bucket_list() and rgw_dir_suggest_changes().

Because RGW objects are distributed across their bucket’s index shards based on a hash, there is no lexical ordering across index shards, only within each shard. Therefore an ordered listing is a complex and I/O intensive operation. Batches of entries are retrieved in parallel from each shard, and a selection sort is used to produced a portion of an ordered listing. Once the batch from any one shard is exhausted, another batch is read from each shard and the process continues.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.