MDS Quiesce Protocol
The MDS quiesce protocol is a mechanism for “quiescing” (quieting) a tree in a file system, stopping all write (and sometimes incidentally read) I/O.
The purpose of this API is to prevent multiple clients from interleaving reads and writes across an eventually consistent snapshot barrier where out-of-band communication exists between clients. This communication can lead to clients wrongly believing they’ve reached a checkpoint that is mutually recoverable to via a snapshot.
Note
This is documentation for the low-level mechanism in the MDS for quiescing a tree of files. The higher-level QuiesceDb is the intended API for clients to effect a quiesce.
Mechanism
The MDS quiesces I/O using a new quiesce_path
internal request that obtains
appropriate locks on the root of a tree and then launches a series of
sub-requests for locking other inodes in the tree. The locks obtained will
force clients to release caps and in-progress client/MDS requests to complete.
The sub-requests launched are quiesce_inode
internal requests. These will
obtain “cap-related” locks which control capability state, including the
filelock
, authlock
, linklock
, and xattrlock
. Additionally, the
new local lock quiescelock
is acquired. More information on that lock in
the next section.
Locks that are not cap-related are skipped because they do not control typical and durable metadata state. Additionally, only Capabilities can give a client local control of a file’s metadata or data.
Once all locks have been acquired, the cap-related locks are released and the
quiescelock
is relied on to prevent issuing Capabilities to clients for the
cap-related locks. This is controlled primarily by CInode:get_caps_*
methods. Releasing these locks is necessary to allow other ranks with the
replicated inode to quiesce without lock state transitions resulting in
deadlock. For example, a client wanting Xx
on an inode will trigger a
xattrlock
in LOCK_SYNC
state to transition to LOCK_SYNC_EXCL
. That
state would not allow another rank to acquire xattrlock
for reading,
thereby creating deadlock, subject to quiesce timeout/expiration. (Quiesce
cannot complete until all ranks quiesce the tree.)
Finally, if the inode is a directory, the quiesce_inode
operation traverses
all directory fragments and issues new quiesce_inode
requests for any child
inodes.
Inode Quiescelock
The quiescelock
is a new local lock for inodes which supports quiescing
I/O. It is a type of superlock where every client or MDS operation which
requires a wrlock or xlock on a “cap-related” inode lock will also implicitly
acquire a wrlock on the quiescelock
.
Note
A local lock supports multiple writers and only one exclusive locker. No read locks.
During normal operation in the MDS, the quiescelock
is never held except
for writing. However, when a subtree is quiesced, the quiesce_inode
internal operation will hold quiescelock
exclusively for the entire
lifetime of the quiesce_inode
operation. This will deny the new
acquisition of any other cap-related inode lock. The quiescelock
must be ordered
before all other locks (see src/include/ceph_fs.h
for ordering) in order to
act as this superlock.
One primary reason for this quiescelock
is to prevent a client request from
blocking on acquiring locks held by quiesce_inode
(e.g. filelock
or
quiescelock
) while still holding locks obtained during normal path
traversal. Notably, the important locks are the snaplock
and policylock
obtained via Locker::try_rdlock_snap_layout
on all parents of the root
inode of the request (the ino
in the filepath
struct). If that
operation waits with those locks held, then a future mksnap
on the root
inode will be impossible.
Note
The mksnap
RPC only acquires a wrlock (write lock) on the
snaplock
for the inode to be snapshotted.
The way quiescelock
helps prevent this is by being the first mandatory
lock acquired when acquiring a wrlock or xlock on a cap-related lock.
Additionally, there is also special handling when it cannot be acquired: all
locks held by the operation are dropped and the operation waits for the
quiescelock
to be available. The lock is mandatory in that a call to
Locker::acquire_locks
with a wrlock/xlock on a cap-related lock will
automatically include (add) the quiescelock
.
So, the expected normal flow is that an operation like mkdir
will perform
its path traversal, acquiring parent and dentry locks, then attempt to acquire
locks on the parent inode necessary for the creation of a dentry. The operation
will fail to acquire a wrlock on the automatically included quiescelock
,
add itself to the quiescelock
wait list, and then drop all held locks.
Lookups and Exports
Quiescing a tree results in a number of quiesce_inode
operations for each
inode under the tree. Those operations have a shared lifetime tied to the
parent quiesce_path
operation. So, once operations complete quiesce (but do
not finish and release locks), the operations sit with locks held and do not
monitor the state of the tree. This means we need to handle cases where new
metadata is imported.
If an inode is fetched via a directory lookup
or readdir
, the MDS will
check if its parent is quiesced (i.e. is the parent directory quiescelock
xlocked?). If so, the MDS will immediately issue an dispatch a
quiesce_inode
operation for that inode. Because it’s a fresh inode, the
operation will immediately succeed and prevent the client from being issued
inappropriate capabailities.
The second case is handling subtree imports from another rank. This is problematic since the subtree import may have inodes with inappropriate state that would invalidate the guarantees of the reportedly “quiesced” tree. To avoid this, importer MDS will skip discovery of the root inode for an import if it encounters a directory inode that is quiesced. If skipped, the rank will send a NAK message back to the exporter which will abort the export.
Brought to you by the Ceph Foundation
The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.