This document is for a development version of Ceph.

Partial Object Recovery

Partial Object Recovery improves the efficiency of log-based recovery (vs backfill). Original log-based recovery calculates missing_set based on pg_log differences.

The whole object should be recovery from one OSD to another if the object is indicated modified by pg_log regardless of how much content in the object is really modified. That means a 4M object, which is just modified 4k inside, should recovery the whole 4M object rather than the modified 4k content. In addition, object map should be also recovered even if it is not modified at all.

Partial Object Recovery is designed to solve the problem mentioned above. In order to achieve the goals, two things should be done:

  1. logging where the object is modified is necessary

  2. logging whether the object_map of an object is modified is also necessary

class ObjectCleanRegion is introduced to do what we want. clean_offsets is a variable of interval_set<uint64_t> and is used to indicate the unmodified content in an object. clean_omap is a variable of bool indicating whether object_map is modified. new_object means that osd does not exist for an object max_num_intervals is an upbound of the number of intervals in clean_offsets so that the memory cost of clean_offsets is always bounded.

The shortest clean interval will be trimmed if the number of intervals in clean_offsets exceeds the boundary.

etc. max_num_intervals=2, clean_offsets:{[5~10], [20~5]}

then new interval [30~10] will evict out the shortest one [20~5]

finally, clean_offsets becomes {[5~10], [30~10]}

Procedures for Partial Object Recovery

Firstly, OpContext and pg_log_entry_t should contain ObjectCleanRegion. In do_osd_ops(), finish_copyfrom(), finish_promote(), corresponding content in ObjectCleanRegion should mark dirty so that trace the modification of an object. Also update ObjectCleanRegion in OpContext to its pg_log_entry_t.

Secondly, pg_missing_set can build and rebuild correctly. when calculating pg_missing_set during peering process, also merge ObjectCleanRegion in each pg_log_entry_t.

etc. object aa has pg_log:

26’101 {[0~4096, 8192~MAX], false}

26’104 {0~8192, 12288~MAX, false}

28’108 {[0~12288, 16384~MAX], true}

missing_set for object aa: merge pg_log above –> {[0~4096, 16384~MAX], true}. which means 4096~16384 is modified and object_map is also modified on version 28’108

Also, OSD may be crash after merge log. Therefore, we need to read_log and rebuild pg_missing_set. For example, pg_log is:

object aa: 26’101 {[0~4096, 8192~MAX], false}

object bb: 26’102 {[0~4096, 8192~MAX], false}

object cc: 26’103 {[0~4096, 8192~MAX], false}

object aa: 26’104 {0~8192, 12288~MAX, false}

object dd: 26’105 {[0~4096, 8192~MAX], false}

object aa: 28’108 {[0~12288, 16384~MAX], true}

Originally, if bb,cc,dd is recovered, and aa is not. So we need to rebuild pg_missing_set for object aa, and find aa is modified on version 28’108. If version in object_info is 26’96 < 28’108, we don’t need to consider 26’104 and 26’101 because the whole object will be recovered. However, Partial Object Recovery should also require us to rebuild ObjectCleanRegion.

Knowing whether the object is modified is not enough.

Therefore, we also need to traverse the pg_log before, that says 26’104 and 26’101 also > object_info(26’96) and rebuild pg_missing_set for object aa based on those three logs: 28’108, 26’104, 26’101. The way how to merge logs is the same as mentioned above

Finally, finish the push and pull process based on pg_missing_set. Updating copy_subset in recovery_info based on ObjectCleanRegion in pg_missing_set. copy_subset indicates the intervals of content need to pull and push.

The complicated part here is submit_push_data and serval cases should be considered separately. what we need to consider is how to deal with the object data, object data makes up of omap_header, xattrs, omap, data:

case 1: first && complete: since object recovering is finished in a single PushOp, we would like to preserve the original object and overwrite on the object directly. Object will not be removed and touch a new one.

issue 1: As object is not removed, old xattrs remain in the old object but maybe updated in new object. Overwriting for the same key or adding new keys is correct, but removing keys will be wrong. In order to solve this issue, We need to remove the all original xattrs in the object, and then update new xattrs.

issue 2: As object is not removed, object_map may be recovered depending on the clean_omap. Therefore, if recovering clean_omap, we need to remove old omap of the object for the same reason since omap updating may also be a deletion. Thus, in this case, we should do:

  1. clear xattrs of the object

  2. clear omap of the object if omap recovery is needed

  3. truncate the object into recovery_info.size

  4. recovery omap_header

  5. recovery xattrs, and recover omap if needed

  6. punch zeros for original object if fiemap tells nothing there

  7. overwrite object content which is modified

  8. finish recovery

case 2: first && !complete: object recovering should be done in multiple times. Here, target_oid will indicate a new temp_object in pgid_TEMP, so the issues are a bit difference.

issue 1: As object is newly created, there is no need to deal with xattrs

issue 2: As object is newly created, and object_map may not be transmitted depending on clean_omap. Therefore, if clean_omap is true, we need to clone object_map from original object. issue 3: As object is newly created, and unmodified data will not be transmitted. Therefore, we need to clone unmodified data from the original object. Thus, in this case, we should do:

  1. remove the temp object

  2. create a new temp object

  3. set alloc_hint for the new temp object

  4. truncate new temp object to recovery_info.size

  5. recovery omap_header

  6. clone object_map from original object if omap is clean

  7. clone unmodified object_data from original object

  8. punch zeros for the new temp object

  9. recovery xattrs, and recover omap if needed

  10. overwrite object content which is modified

  11. remove the original object

  12. move and rename the new temp object to replace the original object

  13. finish recovery