MDS Journaling

CephFS Metadata Pool

CephFS uses a separate (metadata) pool for managing file metadata (inodes and dentries) in a Ceph File System. The metadata pool has all the information about files in a Ceph File System including the File System hierarchy. Additionally, CephFS maintains meta information related to other entities in a file system such as file system journals, open file table, session map, etc.

This document describes how Ceph Metadata Servers use and rely on journaling.

CephFS MDS Journaling

CephFS metadata servers stream a journal of metadata events into RADOS in the metadata pool prior to executing a file system operation. Active MDS daemon(s) manage metadata for files and directories in CephFS.

CephFS uses journaling for couple of reasons:

Consistency: On an MDS failover, the journal events can be replayed to reach a consistent file system state. Also, metadata operations that require multiple updates to the backing store need to be journaled for crash consistency (along with other consistency mechanisms such as locking, etc..).
Performance: Journal updates are (mostly) sequential, hence updates to journals are fast. Furthermore, updates can be batched into single write, thereby saving disk seek time involved in updates to different parts of a file. Having a large journal also helps a standby MDS to warm its cache which helps indirectly during MDS failover.

Each active metadata server maintains its own journal in the metadata pool. Journals are striped over multiple objects. Journal entries which are not required (deemed as old) are trimmed by the metadata server.

Journal Events

Apart from journaling file system metadata updates, CephFS journals various other events such as client session info and directory import/export state to name a few. These events are used by the metadata sever to reestablish correct state as required, e.g., Ceph MDS tries to reconnect clients on restart when journal events get replayed and a specific event type in the journal specifies that a client entity type has a session with the MDS before it was restarted.

To examine the list of such events recorded in the journal, CephFS provides a command line utility cephfs-journal-tool which can be used as follows:

cephfs-journal-tool --rank=<fs>:<rank> event get list

cephfs-journal-tool is also used to discover and repair a damaged Ceph File System. (See cephfs-journal-tool for more details)

Journal Event Types

Following are various event types that are journaled by the MDS.

EVENT_COMMITTED: Mark a request (id) as committed.
EVENT_EXPORT: Maps directories to an MDS rank.
EVENT_FRAGMENT: Tracks various stages of directory fragmentation (split/merge).
EVENT_IMPORTSTART: Logged when an MDS rank starts importing directory fragments.
EVENT_IMPORTFINISH: Logged when an MDS rank finishes importing directory fragments.
EVENT_NOOP: No operation event type for skipping over a journal region.
EVENT_OPEN: Tracks which inodes have open file handles.
EVENT_RESETJOURNAL: Used to mark a journal as reset post truncation.
EVENT_SESSION: Tracks open client sessions.
EVENT_SLAVEUPDATE: Logs various stages of an operation that has been forwarded to a (slave) mds.
EVENT_SUBTREEMAP: Map of directory inodes to directory contents (subtree partition).
EVENT_TABLECLIENT: Log transition states of MDSs view of client tables (snap/anchor).
EVENT_TABLESERVER: Log transition states of MDSs view of server tables (snap/anchor).
EVENT_UPDATE: Log file operations on an inode.
EVENT_SEGMENT: Log a new journal segment boundary.
EVENT_LID: Mark the beginning of a journal without a logical subtree map.

Journal Segments

The MDS journal is composed of logical segments, called LogSegments in the code. These segments are used to collect metadata updates by multiple events into one logical unit for the purposes of trimming. Whenever the journal tries to commit metadata operations (e.g. flush a file create out as an omap update to a dirfrag object), it does so in a replayable batch of updates from the LogSegment. The updates must be replayable in case the MDS fails during the series of updates to various metadata objects. The reason the updates are performed in batch is to group updates to the same metadata object (a dirfrag) where multiple omap entries are probably updated in the same time period.

Once a segment is trimmed, it is considered “expired”. An expired segment is eligible for deletion by the journaler as all of its updates are flushed to the backing RADOS objects. This is done by updating the “expire position” of the journaler to advance past the end of the expired segment. Some expired segments may be kept in the journal to improve cache locality when the MDS restarts.

For most of CephFS’s history (up to 2023), the journal segments were delineated by subtree maps, the ESubtreeMap event. The major reason for this is that journal recovery must start with a copy of the subtree map before replaying any other events.

Now, log segments can be delineated by events which are a SegmentBoundary. These include, ESubtreeMap, EResetJournal, ESegment (2023), or ELid (2023). For ESegment, this light-weight segment boundary allows the MDS to journal the subtree map less frequently while also keeping the journal segments small to keep trimming events short. In order to maintain the constraint that the first event journal replay sees is the ESubtreeMap, those segments beginning with that event are considered “major segments” and a new constraint was added to the deletion of expired segments: the first segment of the journal must always be a major segment.

The ELid event exists to mark the MDS journal as “new” where a logical LogSegment and log sequence number is required for other operations to proceed, in particular the MDSTable operations. The MDS uses this event when creating a rank or shutting it down. No subtree map is required when replaying the rank from this initial state.

Configurations

The targetted size of a log segment in terms of number of events is controlled by:

mds_log_events_per_segment

maximum number of events in an MDS journal segment

type

uint

default

1024

min

1

The frequency of major segments (noted by the journaling of the latest ESubtreeMap) is controlled by:

mds_log_major_segment_event_ratio

multiple of mds_log_events_per_segment between major segments

type

uint

default

12

min

1

see also

mds_log_events_per_segment

When mds_log_events_per_segment * mds_log_major_segment_event_ratio non-ESubtreeMap events are logged, the MDS will journal a new ESubtreeMap. This is necessary to allow the journal to shrink in size during the trimming of expired segments.

The target maximum number of segments is controlled by:

mds_log_max_segments

The maximum number of segments (objects) in the journal before we initiate trimming. Set to -1 to disable limits.

type

uint

default

128

min

8

The MDS will often sit a little above this number due to non-major segments awaiting trimming up to the next major segment.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.