Notice

This document is for a development version of Ceph.

Troubleshooting

Slow/stuck operations

Sometimes CephFS operations hang. The first step in troubleshooting them is to locate the problem causing the operations to hang. Problems present in three places:

in the client
in the MDS
in the network that connects the client to the MDS

First, use the procedure in Slow requests (MDS) to determine if the client has stuck operations or the MDS has stuck operations.

Dump the MDS cache. The contents of the MDS cache will be used to diagnose the nature of the problem. Run the following command to dump the MDS cache:

ceph daemon mds.<name> dump cache /tmp/dump.txt

Note

MDS services that are not controlled by systemd dump the file dump.txt to the machine that runs the MDS. MDS services that are controlled by systemd dump the file dump.txt to a tmpfs in the MDS container. Use nsenter(1) to locate dump.txt or specify another system-wide path.

If high logging levels have been set on the MDS, dump.txt can be expected to hold the information needed to diagnose and solve the issue causing the CephFS operations to hang.

Slow requests (MDS)

List current operations via the admin socket by running the following command from the MDS host:

ceph daemon mds.<name> dump_ops_in_flight

Identify the stuck commands and examine why they are stuck. Usually the last “event” will have been an attempt to gather locks, or sending the operation off to the MDS log. If it is waiting on the OSDs, fix them.

If operations are stuck on a specific inode, then a client is likely holding capabilities, preventing its use by other clients. This situation can be caused by a client trying to flush dirty data, but it might be caused because you have encountered a bug in the distributed file lock code (the file “capabilities” [“caps”] system) of CephFS.

If you have determined that the commands are stuck because of a bug in the capabilities code, restart the MDS. Restarting the MDS is likely to resolve the problem.

If there are no slow requests reported on the MDS, and there is no indication that clients are misbehaving, then either there is a problem with the client or the client’s requests are not reaching the MDS.

Stuck during recovery

Stuck in up:replay

If your MDS is stuck in the up:replay state, then the journal is probably very long. The presence of MDS_HEALTH_TRIM cluster warnings can indicate that the MDS has not yet caught up while trimming its journal. Very large journals can take hours to process. There is no working around this, but there are things you can do to speed up the process:

Temporarily disable MDS debug logs by reducing MDS debugging to 0. Even with the default settings, the MDS logs a few messages to memory for dumping in case a fatal error is encountered. You can turn off all logging by running the following commands:

ceph config set mds debug_mds 0
ceph config set mds debug_ms 0
ceph config set mds debug_monc 0

Remember that when you set debug_mds, debug_ms, and debug_monc to 0, if the MDS fails then there will be no debugging information that can be used to determine why fatal errors occurred. If you can calculate when up:replay will complete, restore these configurations just prior to entering the next state:

ceph config rm mds debug_mds
ceph config rm mds debug_ms
ceph config rm mds debug_monc

After replay has been expedited, calculate when the MDS will complete the replay. Examine the journal replay status:

$ ceph tell mds.<fs_name>:0 status | jq .replay_status
{
  "journal_read_pos": 4195244,
  "journal_write_pos": 4195244,
  "journal_expire_pos": 4194304,
  "num_events": 2,
  "num_segments": 2
}

Replay completes when the journal_read_pos reaches the journal_write_pos. The write position does not change during replay. Track the progression of the read position to compute the expected time to complete. The MDS emits an MDS_ESTIMATED_REPLAY_TIME warning when the act of replaying the journal takes more than 30 seconds. The warning message includes an estimated time to the completion of journal replay:

mds.a(mds.0): replay: 50.0446% complete - elapsed time: 582s, estimated time remaining: 581s

Avoiding recovery roadblocks

Do the following when restoring your file system:

Deny all reconnection to clients. Blocklist all existing CephFS sessions, causing all mounts to hang or become unavailable:
```
ceph config set mds mds_deny_all_reconnect true
```
Remember to undo this after the MDS becomes active.

Note

This does not prevent new sessions from connecting. Use the refuse_client_session file-system setting to prevent new sessions from connecting to the CephFS.
Extend the MDS heartbeat grace period. Doing this causes the system to avoid replacing an MDS that becomes “stuck” during an operation. Sometimes recovery of an MDS may involve operations that take longer than expected (from the programmer’s perspective). This is more likely when recovery has already taken longer than normal to complete (which, if you’re reading this document, is likely the situation you find yourself in). Avoid unnecessary replacement loops by running the following command and extending the heartbeat grace period:
ceph config set mds mds_heartbeat_grace 3600
Note

This causes the MDS to continue to send beacons to the Monitors even when its internal “heartbeat” mechanism has not been reset (it has not beaten) in one hour. In the past, this was achieved with the mds_beacon_grace Monitor setting.
Disable open-file-table prefetch. Under normal circumstances, the MDS prefetches directory contents during recovery as a way of heating up its cache. During a long recovery, the cache is probably already hot and large. If the cache is already hot and large, this prefetching is unnecessary and can be undesirable. Disable open-file-table prefetching by running the following command:
```
ceph config set mds mds_oft_prefetch_dirfrags false
```
Turn off clients. Clients that reconnect to the newly up:active MDS can create new load on the file system just as it is becoming operational. This is often undesirable. Maintenance is often necessary before allowing clients to connect to the file system and before resuming a regular workload. For example, expediting the trimming of journals may be advisable if the recovery took a long time due to the amount of time replay spent in reading a very large journal.

Client sessions can be refused manually, or by using the refuse_client_session tunable as in the following command:
```
ceph fs set <fs_name> refuse_client_session true
```
This command has the effect of preventing clients from establishing new sessions with the MDS.
Do not tweak max_mds. Modifying the file-system setting variable max_mds may seem like a good idea during troubleshooting and recovery, but it probably isn’t. Modifying max_mds might have the effect of further destabilizing the cluster. If max_mds must be changed in such circumstances, run the command to change max_mds with the confirmation flag (--yes-i-really-mean-it).

Turn off async purge threads. The volumes plugin spawns threads that asynchronously purge trashed or deleted subvolumes. During troubleshooting or recovery, these purge threads can be disabled by running the following command:
```
ceph config set mgr mgr/volumes/pause_purging true
```
To resume purging, run the following command:
```
ceph config set mgr mgr/volumes/pause_purging false
```

Turn off async cloner threads. The volumes plugin spawns threads that asynchronously clone subvolume snapshots. During troubleshooting or recovery, these cloner threads can be disabled by running the following command:
```
ceph config set mgr mgr/volumes/pause_cloning true
```
To resume cloning, run the following command:
```
ceph config set mgr mgr/volumes/pause_cloning false
```

Expediting MDS journal trim

MDS_HEALTH_TRIM warnings indicate that the MDS journal has grown too large. When the MDS journal has grown too large, use the mds_tick_interval tunable to modify the “MDS tick interval”. The “tick” interval drives various upkeep activities in the MDS, and modifying the interval will decrease the size of the MDS journal by ensuring that it is trimmed more frequently.

Make sure that there is no significant file-system load present when modifying mds_tick_interval. See Avoiding recovery roadblocks for ways to reduce load on the CephFS.

This setting affects only MDSes in the up:active state. The MDS does not trim its journal during recovery.

Run the following command to modify the mds_tick_interval tunable:

ceph config set mds mds_tick_interval 2

RADOS Health

If part of the CephFS metadata or data pools is unavailable and CephFS is not responding, it could indicate that RADOS itself is unhealthy.

Resolve problems with RADOS before attempting to locate any problems in CephFS. See the RADOS troubleshooting documentation.

The MDS

Run the ceph health command. Any operation that is hung in the MDS is indicated by the slow requests are blocked message.

Messages that read failing to respond indicate that a client is failing to respond.

The following list details potential causes of hung operations:

The system is overloaded. The most likely cause of system overload is an active file set that is larger than the MDS cache.

If you have extra RAM, increase the mds_cache_memory_limit. The specific tunable mds_cache_memory_limit is discussed in the MDS Cache Size. Read the MDS Cache Configuration section in full before making any alterations to the mds_cache_memory_limit tunable.
There is an older (misbehaving) client.
There are underlying RADOS issues. See The RADOS troubleshooting documentation.

Otherwise, you have probably discovered a new bug and should report it to the developers!

ceph-fuse debugging

ceph-fuse is an alternative to the CephFS kernel driver that mounts CephFS file systems in user space. ceph-fuse supports dump_ops_in_flight. Use the following command to dump in-flight ceph-fuse operations for examination:

See the Mount CephFS using FUSE documentation.

Debug output

To get more debugging information from ceph-fuse, list current operations in the foreground while logging to the console (-d), enabling client debug (--debug-client=20), and enabling prints for each message sent (--debug-ms=1).

ceph daemon -d mds.<name> dump_ops_in_flight --debug-client=20 --debug-ms=1

If you suspect a potential Monitor issue, enable Monitor debugging as well (--debug-monc=20) by running a command of the following form:

ceph daemon -d mds.<name> dump_ops_in_flight --debug-client=20 --debug-ms=1 --debug-monc=20

Kernel mount debugging

The first step in diagnosing and repairing an issue with the kernel client is determining whether the problem is in the kernel client or in the MDS. If the kernel client itself is broken, evidence of its breakage will be in the kernel ring buffer, which can be examined by running the following command:

dmesg

Find the relevant kernel state.

Slow requests

Unfortunately, the kernel client does not provide an admin socket. However, if the kernel on the client has debugfs enabled, interfaces similar to the admin socket are available.

Find a folder in /sys/kernel/debug/ceph/ with a name like 28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880. That folder contains files that can be used to diagnose the causes of slow requests. Use cat to see their contents.

These files are described below. The files most useful for diagnosis of slow requests are the mdsc (current requests to the MDS) and the osdc (current operations in-flight to OSDs) files.

bdi: BDI info about the Ceph system (blocks dirtied, written, etc)
caps: counts of file “caps” structures in-memory and used
client_options: dumps the options provided to the CephFS mount
dentry_lru: Dumps the CephFS dentries currently in-memory
mdsc: Dumps current requests to the MDS
mdsmap: Dumps the current MDSMap epoch and MDSes
mds_sessions: Dumps the current sessions to MDSes
monc: Dumps the current maps from the Monitor, and any “subscriptions” held
monmap: Dumps the current monitor map epoch and Monitors
osdc: Dumps the current ops in-flight to OSDs (ie, file data IO)
osdmap: Dumps the current OSDMap epoch, pools, and OSDs

If the data pool is in a NEARFULL condition, then the kernel CephFS client will switch to doing writes synchronously. Synchronous writes are quite slow.

Disconnected+Remounted FS

Because CephFS has a “consistent cache”, the MDS will forcibly evict (and blocklist) clients from the cluster when the network connection has been disrupted for a long time. When this happens, the kernel client cannot safely write back dirty (buffered) data and this results in data loss. However: note that this behavior is appropriate and also follows POSIX semantics. The client has to be remounted to be able to access the file system again. This is the default behavior but it can be overridden by the recover_session mount option. See the “options” section of the “mount.ceph” man page

You are in this situation if the output of dmesg contains something like the following:

[Fri Aug 15 02:38:10 2025] ceph: mds0 caps stale
[Fri Aug 15 02:38:28 2025] libceph: mds0 (2)XXX.XX.XX.XX :6800 socket closed (con state OPEN)
[Fri Aug 15 02:38:28 2025] libceph: mds0 (2)XXX.XX.XX.XX:6800 session reset
[Fri Aug 15 02:38:28 2025] ceph: mds0 closed our session
[Fri Aug 15 02:38:28 2025] ceph: mds0 reconnect start
[Fri Aug 15 02:38:28 2025] ceph: mds0 reconnect denied

Mounting

Mount 5 Error

A mount 5 error indicates a lagging MDS server or a crashed MDS server.

Ensure that at least one MDS is up and running, and the cluster is active + healthy.

Mount 12 Error

A mount 12 error with a message reading cannot allocate memory indicates a version mismatch between the Ceph Client version and the Ceph Storage Cluster version. Check the versions using the following command:

ceph -v

If the Ceph Client is of an older version than the Ceph cluster, upgrade the Client:

sudo apt-get update && sudo apt-get install ceph-common

If this fails to resolve the problem, uninstall, autoclean, and autoremove the ceph-common package and then reinstall it to ensure that you have the latest version of it.

Dynamic Debugging

Dynamic debugging for CephFS kernel driver allows to enable or disable debug logging. The kernel driver logs are written to the kernel ring buffer and can be examined using dmesg(1) utility. Debug logging is disabled by default because enabling debug logging can result in system slowness and a drop in I/O throughput.

Enable dynamic debug against the CephFS module.

See: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh

Note: Running the above script enables debug logging for the CephFS kernel driver, libceph, and the kernel RBD module. To enable debug logging for a specific component (for example, for the CephFS kernel driver), run a command of the following form:

echo 'module ceph +p' > /sys/kernel/debug/dynamic_debug/control

To disable debug logging, run a command of the following form:

echo 'module ceph -p' > /sys/kernel/debug/dynamic_debug/control

In-memory Log Dump

In-memory logs can be dumped by setting mds_extraordinary_events_dump_interval when the log level is set to less than 10. mds_extraordinary_events_dump_interval is the interval in seconds for dumping the recent in-memory logs when there is an extraordinary event.

Extraordinary events include the following:

Client Eviction
Missed Beacon ACK from the Monitors
Missed Internal Heartbeats

In-memory log dump is disabled by default. This prevents production environments from experiencing log file bloat by default.

Run the following two commands in order to enable in-memory log dumping:

```
ceph config set mds debug_mds <log_level>/<gather_level>
```
Set log_level to a value of less than 10. Set gather_level to a value greater than 10. When those two values have been set, in-memory log dump is enabled.
```
ceph config set mds mds_extraordinary_events_dump_interval <seconds>
```
When in-memory log dumping is enabled, the MDS checks for extraordinary events every mds_extraordinary_events_dump_interval seconds. If any extraordinary event occurs, the MDS dumps the in-memory logs that contain relevant event details to the Ceph MDS log.

Note

When higher log levels are set (log_level greater than or equal to 10) there is no reason to dump the in-memory logs. A lower gather level (gather_level less than 10) is insufficient to gather in-memory logs. This means that a log level of greater than or equal to 10 or a gather level of less than 10 in debug_mds prevents enabling in-memory-log dumping. In such cases, if there is a failure, you must reset the value of mds_extraordinary_events_dump_interval to 0 before enabling the use of the above commands.

Disable in-memory log dumping by running the following command:

ceph config set mds mds_extraordinary_events_dump_interval 0

Filesystems Become Inaccessible After an Upgrade

Note

You can avoid operation not permitted errors by running this procedure before an upgrade. As of May 2023, it seems that operation not permitted errors of the kind discussed here occur after upgrades after Nautilus (inclusive).

IF

you have CephFS file systems that have data and metadata pools that were created by a ceph fs new command (meaning that they were not created with the defaults)

OR

you have an existing CephFS file system and are upgrading to a new post-Nautilus major version of Ceph

THEN

in order for the documented ceph fs authorize... commands to function as documented (and to avoid ‘operation not permitted’ errors when doing file I/O or similar security-related problems for all users except the client.admin user), you must first run:

ceph osd pool application set <your metadata pool name> cephfs metadata <your ceph fs filesystem name>

and

ceph osd pool application set <your data pool name> cephfs data <your ceph fs filesystem name>

Otherwise, when the OSDs receive a request to read or write data (not the directory info, but file data) they will not know which Ceph file system name to look up. This is true also of pool names, because the ‘defaults’ themselves changed in the major releases, from:

data pool=fsname
metadata pool=fsname_metadata

to:

data pool=fsname.data and
metadata pool=fsname.meta

Any setup that used client.admin for all mounts did not run into this problem, because the admin key gave blanket permissions.

A temporary fix involves changing mount requests to the ‘client.admin’ user and its associated key. A less drastic but half-fix is to change the osd cap for your user to just caps osd = "allow rw" and delete tag cephfs data=....

Disabling the Volumes Plugin

In certain scenarios, the Volumes plugin may need to be disabled to prevent compromise for the rest of the Ceph cluster. For details see: Disabling Volumes Plugin

Reporting Issues

If you have identified a specific issue, please report it with as much information as possible. Especially important information:

Ceph versions installed on client and server
Whether you are using the kernel or FUSE client
If you are using the kernel client, what kernel version?
How many clients are in play, doing what kind of workload?
If a system is ‘stuck’, is that affecting all clients or just one?
Any ceph health messages
Any backtraces in the ceph logs from crashes

If you are satisfied that you have found a bug, please file it on the bug tracker. For more general queries, please write to the ceph-users mailing list.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.