Notice
This document is for a development version of Ceph.
Rados Bucket Index
Buckets in RGW store a list of objects and associated metadata in each bucket’s bucket index. Each bucket index entry stores metadata (size, etag, mtime, etc.) to serve API requests to list objects along with some internal bookkeeping. These APIs are ListObjectsV2 and ListObjectVersions in S3, and GET Container in Swift.
The entries are stored in the index object’s (or index objects’, see sharding below) RADOS omap entries.
Buckets can also be created as ‘indexless’. Such buckets have no index and cannot be listed.
For non-versioned buckets there is one entry in the bucket index for every object. For S3 versioned buckets it’s a little more complex (see S3 Object Versioning for details).
S3 Object Versioning
For versioned buckets the bucket index contains an entry for each object version and delete marker. In addition to sorting index entries by object name, it also has entries that sort object versions of the same name from newest to oldest, which are used for versioned listings.
RGW stores a head object in the rgw.buckets.data pool for each object version. This rados object’s oid is a combination of the object name and its version id.
In S3 a GET/HEAD request for an object name will give you that
object’s “current” version. To support this RGW stores an extra
‘object logical head’ (olh) object whose oid includes the object name
only and that acts as an indirection to the head object of its current
version. This indirection logic is implemented in
src/rgw/driver/rados/rgw_rados.cc as RGWRados::follow_olh().
To maintain the consistency between this olh object and the bucket
index, the index keeps a separate ‘olh’ entry for each object
name. This entry stores a log of all writes/deletes to its
versions. In src/rgw/driver/rados/rgw_rados.cc,
RGWRados::apply_olh_log() replays this log to guarantee that this
olh object converges on the same “current” version as the bucket
index.
Consistency Guarantee
RGW guarantees read-after-write consistency on object operations. This means that once a client receives a successful response to a write request, then the effects of that write must be visible to subsequent read requests.
For example, if an S3 client sends a PutObject request to overwrite an existing object followed by a GetObject request to read it back, RGW must not return the previous object’s contents. It must either respond with the new object’s contents or with the result of a later object write or delete.
This consistency guarantee applies to all object write requests (PutObject, DeleteObject, PutObjectAcl, etc) and all object read requests (HeadObject, GetObject, ListObjectsV2, etc).
Rados Object Model
S3/Swift objects, or ‘API objects’, are stored as rados objects in the rgw.buckets.data pool. Each API object is comprised of a head object and zero or more tail objects. Bucket index objects are stored in the rgw.buckets.index pool.
When writing an object, its head object is written last. This acts as an atomic ‘commit’ to make it visible to read requests.
Index Transaction
To keep the bucket index consistent, all object writes or deletes must also update the index accordingly. Because the head objects are stored in different rados objects than the bucket indices, we can’t update both atomically with a single rados operation. In order to satisfy the Consistency Guarantee for listing operations, we have to coordinate these two object writes using a three-step bucket index transaction:
Prepare a transaction on its bucket index object.
Write or delete the head object.
Commit the transaction on the bucket index object (or cancel the transaction if step 2 fails).
Object writes and deletes may race with each other, so a given object may have more than one prepared transaction at a time. RGW considers an object entry to be ‘pending’ if there are any outstanding transactions, or ‘completed’ otherwise.
This transaction is implemented in
src/rgw/driver/rados/rgw_rados.cc as
RGWRados::Object::Write::write_meta() for object writes and
RGWRados::Object::Delete::delete_obj() for object deletes. The
bucket index operations are implemented in src/cls/rgw/cls_rgw.cc
as rgw_bucket_prepare_op() and
rgw_bucket_complete_op().
Listing
When listing objects, RGW will read all entries (pending and completed) from the bucket index. For any pending entries, it must check whether the head object exists before including that entry in the final listing.
If an RGW crashes in the middle of an Index Transaction, an index entry may get stuck in this ‘pending’ state. When bucket listing encounters these pending entries, it also sends information from the head object back to the bucket index so it can update the entry and resolve its stale transactions. This message is called ‘dir suggest’, because the bucket index treats it as a hint or suggestion.
Bucket listing is implemented in src/rgw/driver/rados/rgw_rados.cc as RGWRados::Bucket::List::list_objects_ordered() and RGWRados::Bucket::List::list_objects_unordered(). RGWRados::check_disk_state() is the part that reads the head object and encodes suggested changes. The corresponding bucket index operations are implemented in src/cls/rgw/cls_rgw.cc as rgw_bucket_list() and rgw_dir_suggest_changes().
Because RGW objects are distributed across their bucket’s index shards based on a hash, there is no lexical ordering across index shards, only within each shard. Therefore an ordered listing is a complex and I/O intensive operation. Batches of entries are retrieved in parallel from each shard, and a selection sort is used to produced a portion of an ordered listing. Once the batch from any one shard is exhausted, another batch is read from each shard and the process continues.
Brought to you by the Ceph Foundation
The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.