Notice

This document is for a development version of Ceph.

Troubleshooting OSDs

Before troubleshooting the cluster’s OSDs, check the monitors and the network.

First, determine whether the monitors have a quorum. Run the ceph health command or the ceph -s command and if Ceph shows HEALTH_OK then there is a monitor quorum.

If the monitors don’t have a quorum or if there are errors with the monitor status, address the monitor issues before proceeding by consulting the material in Troubleshooting Monitors.

Next, check your networks to make sure that they are running properly. Networks can have a significant impact on OSD operation and performance. Look for dropped packets on the host side and CRC errors on the switch side.

Obtaining Data About OSDs

When troubleshooting OSDs, it is useful to collect different kinds of information about the OSDs. Some information comes from the practice of monitoring OSDs (for example, by running the ceph osd tree command). Additional information concerns the topology of your cluster, and is discussed in the following sections.

Ceph Logs

Ceph log files are stored under /var/log/ceph. Unless the path has been changed (or you are in a containerized environment that stores logs in a different location), the log files can be listed by running the following command:

ls /var/log/ceph

If there is not enough log detail, change the logging level. To ensure that Ceph performs adequately under high logging volume, see Logging and Debugging.

Admin Socket

Use the admin socket tool to retrieve runtime information. First, list the sockets of Ceph’s daemons by running the following command:

ls /var/run/ceph

Next, run a command of the following form (replacing {daemon-name} with the name of a specific daemon: for example, osd.0):

ceph daemon {daemon-name} help

Alternatively, run the command with a {socket-file} specified (a “socket file” is a specific file in /var/run/ceph):

ceph daemon {socket-file} help

The admin socket makes many tasks possible, including:

  • Listing Ceph configuration at runtime

  • Dumping historic operations

  • Dumping the operation priority queue state

  • Dumping operations in flight

  • Dumping perfcounters

Display Free Space

Filesystem issues may arise. To display your filesystems’ free space, run the following command:

df -h

To see this command’s supported syntax and options, run df --help.

I/O Statistics

The iostat tool can be used to identify I/O-related issues. Run the following command:

iostat -x

Diagnostic Messages

To retrieve diagnostic messages from the kernel, run the dmesg command and specify the output with less, more, grep, or tail. For example:

dmesg | grep scsi

Stopping without Rebalancing

It might be occasionally necessary to perform maintenance on a subset of your cluster or to resolve a problem that affects a failure domain (for example, a rack). However, when you stop OSDs for maintenance, you might want to prevent CRUSH from automatically rebalancing the cluster. To avert this rebalancing behavior, set the cluster to noout by running the following command:

ceph osd set noout

Warning

This is more a thought exercise offered for the purpose of giving the reader a sense of failure domains and CRUSH behavior than a suggestion that anyone in the post-Luminous world run ceph osd set noout. When the OSDs return to an up state, rebalancing will resume and the change introduced by the ceph osd set noout command will be reverted.

In Luminous and later releases, however, it is a safer approach to flag only affected OSDs. To add or remove a noout flag to a specific OSD, run a command like the following:

ceph osd add-noout osd.0
ceph osd rm-noout  osd.0

It is also possible to flag an entire CRUSH bucket. For example, if you plan to take down prod-ceph-data1701 in order to add RAM, you might run the following command:

ceph osd set-group noout prod-ceph-data1701

After the flag is set, stop the OSDs and any other colocated Ceph services within the failure domain that requires maintenance work:

systemctl stop ceph\*.service ceph\*.target

Note

When an OSD is stopped, any placement groups within the OSD are marked as degraded.

After the maintenance is complete, it will be necessary to restart the OSDs and any other daemons that have stopped. However, if the host was rebooted as part of the maintenance, they do not need to be restarted and will come back up automatically. To restart OSDs or other daemons, use a command of the following form:

sudo systemctl start ceph.target

Finally, unset the noout flag as needed by running commands like the following:

ceph osd unset noout
ceph osd unset-group noout prod-ceph-data1701

Many contemporary Linux distributions employ systemd for service management. However, for certain operating systems (especially older ones) it might be necessary to issue equivalent service or start/stop commands.

OSD Not Running

Under normal conditions, restarting a ceph-osd daemon will allow it to rejoin the cluster and recover.

An OSD Won’t Start

If the cluster has started but an OSD isn’t starting, check the following:

  • Configuration File: If you were not able to get OSDs running from a new installation, check your configuration file to ensure it conforms to the standard (for example, make sure that it says host and not hostname, etc.).

  • Check Paths: Ensure that the paths specified in the configuration correspond to the paths for data and metadata that actually exist (for example, the paths to the journals, the WAL, and the DB). Separate the OSD data from the metadata in order to see whether there are errors in the configuration file and in the actual mounts. If so, these errors might explain why OSDs are not starting. To store the metadata on a separate block device, partition or LVM the drive and assign one partition per OSD.

  • Check Max Threadcount: If the cluster has a node with an especially high number of OSDs, it might be hitting the default maximum number of threads (usually 32,000). This is especially likely to happen during recovery. Increasing the maximum number of threads to the maximum possible number of threads allowed (4194303) might help with the problem. To increase the number of threads to the maximum, run the following command:

    sysctl -w kernel.pid_max=4194303
    

    If this increase resolves the issue, you must make the increase permanent by including a kernel.pid_max setting either in a file under /etc/sysctl.d or within the master /etc/sysctl.conf file. For example:

    kernel.pid_max = 4194303
    
  • Check ``nf_conntrack``: This connection-tracking and connection-limiting system causes problems for many production Ceph clusters. The problems often emerge slowly and subtly. As cluster topology and client workload grow, mysterious and intermittent connection failures and performance glitches occur more and more, especially at certain times of the day. To begin taking the measure of your problem, check the syslog history for “table full” events. One way to address this kind of problem is as follows: First, use the sysctl utility to assign nf_conntrack_max a much higher value. Next, raise the value of nf_conntrack_buckets so that nf_conntrack_buckets × 8 = nf_conntrack_max; this action might require running commands outside of sysctl (for example, "echo 131072 > /sys/module/nf_conntrack/parameters/hashsize). Another way to address the problem is to blacklist the associated kernel modules in order to disable processing altogether. This approach is powerful, but fragile. The modules and the order in which the modules must be listed can vary among kernel versions. Even when blacklisted, iptables and docker might sometimes activate connection tracking anyway, so we advise a “set and forget” strategy for the tunables. On modern systems, this approach will not consume appreciable resources.

  • Kernel Version: Identify the kernel version and distribution that are in use. By default, Ceph uses third-party tools that might be buggy or come into conflict with certain distributions or kernel versions (for example, Google’s gperftools and TCMalloc). Check the OS recommendations and the release notes for each Ceph version in order to make sure that you have addressed any issues related to your kernel.

  • Segment Fault: If there is a segment fault, increase log levels and restart the problematic daemon(s). If segment faults recur, search the Ceph bug tracker https://tracker.ceph/com/projects/ceph and the dev and ceph-users mailing list archives https://ceph.io/resources to see if others have experienced and reported these issues. If this truly is a new and unique failure, post to the dev email list and provide the following information: the specific Ceph release being run, ceph.conf (with secrets XXX’d out), your monitor status output, and excerpts from your log file(s).

An OSD Failed

When an OSD fails, this means that a ceph-osd process is unresponsive or has died and that the corresponding OSD has been marked down. Surviving ceph-osd daemons will report to the monitors that the OSD appears to be down, and a new status will be visible in the output of the ceph health command, as in the following example:

ceph health
HEALTH_WARN 1/3 in osds are down

This health alert is raised whenever there are one or more OSDs marked in and down. To see which OSDs are down, add detail to the command as in the following example:

ceph health detail
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080

Alternatively, run the following command:

ceph osd tree down

If there is a drive failure or another fault that is preventing a given ceph-osd daemon from functioning or restarting, then there should be an error message present in its log file under /var/log/ceph.

If the ceph-osd daemon stopped because of a heartbeat failure or a suicide timeout error, then the underlying drive or filesystem might be unresponsive. Check dmesg output and syslog output for drive errors or kernel errors. It might be necessary to specify certain flags (for example, dmesg -T to see human-readable timestamps) in order to avoid mistaking old errors for new errors.

If an entire host’s OSDs are down, check to see if there is a network error or a hardware issue with the host.

If the OSD problem is the result of a software error (for example, a failed assertion or another unexpected error), search for reports of the issue in the bug tracker , the dev mailing list archives, and the ceph-users mailing list archives. If there is no clear fix or existing bug, then report the problem to the ceph-devel email list.

No Free Drive Space

If an OSD is full, Ceph prevents data loss by ensuring that no new data is written to the OSD. In an properly running cluster, health checks are raised when the cluster’s OSDs and pools approach certain “fullness” ratios. The mon_osd_full_ratio threshold defaults to 0.95 (or 95% of capacity): this is the point above which clients are prevented from writing data. The mon_osd_backfillfull_ratio threshold defaults to 0.90 (or 90% of capacity): this is the point above which backfills will not start. The mon_osd_nearfull_ratio threshold defaults to 0.85 (or 85% of capacity): this is the point at which it raises the OSD_NEARFULL health check.

OSDs within a cluster will vary in how much data is allocated to them by Ceph. To check “fullness” by displaying data utilization for every OSD, run the following command:

ceph osd df

To check “fullness” by displaying a cluster’s overall data usage and data distribution among pools, run the following command:

ceph df

When examining the output of the ceph df command, pay special attention to the most full OSDs, as opposed to the percentage of raw space used. If a single outlier OSD becomes full, all writes to this OSD’s pool might fail as a result. When ceph df reports the space available to a pool, it considers the ratio settings relative to the most full OSD that is part of the pool. To flatten the distribution, two approaches are available: (1) Using the reweight-by-utilization command to progressively move data from excessively full OSDs or move data to insufficiently full OSDs, and (2) in later revisions of Luminous and subsequent releases, exploiting the ceph-mgr balancer module to perform the same task automatically.

To adjust the “fullness” ratios, run a command or commands of the following form:

ceph osd set-nearfull-ratio <float[0.0-1.0]>
ceph osd set-full-ratio <float[0.0-1.0]>
ceph osd set-backfillfull-ratio <float[0.0-1.0]>

Sometimes full cluster issues arise because an OSD has failed. This can happen either because of a test or because the cluster is small, very full, or unbalanced. When an OSD or node holds an excessive percentage of the cluster’s data, component failures or natural growth can result in the nearfull and full ratios being exceeded. When testing Ceph’s resilience to OSD failures on a small cluster, it is advised to leave ample free disk space and to consider temporarily lowering the OSD full ratio, OSD backfillfull ratio, and OSD nearfull ratio.

The “fullness” status of OSDs is visible in the output of the ceph health command, as in the following example:

ceph health
HEALTH_WARN 1 nearfull osd(s)

For details, add the detail command as in the following example:

ceph health detail
HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
osd.3 is full at 97%
osd.4 is backfill full at 91%
osd.2 is near full at 87%

To address full cluster issues, it is recommended to add capacity by adding OSDs. Adding new OSDs allows the cluster to redistribute data to newly available storage. Search for rados bench orphans that are wasting space.

If a legacy Filestore OSD cannot be started because it is full, it is possible to reclaim space by deleting a small number of placement group directories in the full OSD.

Important

If you choose to delete a placement group directory on a full OSD, DO NOT delete the same placement group directory on another full OSD. OTHERWISE YOU WILL LOSE DATA. You MUST maintain at least one copy of your data on at least one OSD. Deleting placement group directories is a rare and extreme intervention. It is not to be undertaken lightly.

See Monitor Config Reference for more information.

OSDs are Slow/Unresponsive

OSDs are sometimes slow or unresponsive. When troubleshooting this common problem, it is advised to eliminate other possibilities before investigating OSD performance issues. For example, be sure to confirm that your network(s) are working properly, to verify that your OSDs are running, and to check whether OSDs are throttling recovery traffic.

Tip

In pre-Luminous releases of Ceph, up and in OSDs were sometimes not available or were otherwise slow because recovering OSDs were consuming system resources. Newer releases provide better recovery handling by preventing this phenomenon.

Networking Issues

As a distributed storage system, Ceph relies upon networks for OSD peering and replication, recovery from faults, and periodic heartbeats. Networking issues can cause OSD latency and flapping OSDs. For more information, see Flapping OSDs.

To make sure that Ceph processes and Ceph-dependent processes are connected and listening, run the following commands:

netstat -a | grep ceph
netstat -l | grep ceph
sudo netstat -p | grep ceph

To check network statistics, run the following command:

netstat -s

Drive Configuration

An SAS or SATA storage drive should house only one OSD, but a NVMe drive can easily house two or more. However, it is possible for read and write throughput to bottleneck if other processes share the drive. Such processes include: journals / metadata, operating systems, Ceph monitors, syslog logs, other OSDs, and non-Ceph processes.

Because Ceph acknowledges writes after journaling, fast SSDs are an attractive option for accelerating response time -- particularly when using the XFS or ext4 filesystems for legacy FileStore OSDs. By contrast, the Btrfs file system can write and journal simultaneously. (However, use of Btrfs is not recommended for production deployments.)

Note

Partitioning a drive does not change its total throughput or sequential read/write limits. Throughput might be improved somewhat by running a journal in a separate partition, but it is better still to run such a journal in a separate physical drive.

Warning

Reef does not support FileStore. Releases after Reef do not support FileStore. Any information that mentions FileStore is pertinent only to the Quincy release of Ceph and to releases prior to Quincy.

Bad Sectors / Fragmented Disk

Check your drives for bad blocks, fragmentation, and other errors that can cause significantly degraded performance. Tools that are useful in checking for drive errors include dmesg, syslog logs, and smartctl (found in the smartmontools package).

Note

smartmontools 7.0 and late provides NVMe stat passthrough and JSON output.

Co-resident Monitors/OSDs

Although monitors are relatively lightweight processes, performance issues can result when monitors are run on the same host machine as an OSD. Monitors issue many fsync() calls and this can interfere with other workloads. The danger of performance issues is especially acute when the monitors are co-resident on the same storage drive as an OSD. In addition, if the monitors are running an older kernel (pre-3.0) or a kernel with no syncfs(2) syscall, then multiple OSDs running on the same host might make so many commits as to undermine each other’s performance. This problem sometimes results in what is called “the bursty writes”.

Co-resident Processes

Significant OSD latency can result from processes that write data to Ceph (for example, cloud-based solutions and virtual machines) while operating on the same hardware as OSDs. For this reason, making such processes co-resident with OSDs is not generally recommended. Instead, the recommended practice is to optimize certain hosts for use with Ceph and use other hosts for other processes. This practice of separating Ceph operations from other applications might help improve performance and might also streamline troubleshooting and maintenance.

Running co-resident processes on the same hardware is sometimes called “convergence”. When using Ceph, engage in convergence only with expertise and after consideration.

Logging Levels

Performance issues can result from high logging levels. Operators sometimes raise logging levels in order to track an issue and then forget to lower them afterwards. In such a situation, OSDs might consume valuable system resources to write needlessly verbose logs onto the disk. Anyone who does want to use high logging levels is advised to consider mounting a drive to the default path for logging (for example, /var/log/ceph/$cluster-$name.log).

Recovery Throttling

Depending upon your configuration, Ceph may reduce recovery rates to maintain client or OSD performance, or it may increase recovery rates to the point that recovery impacts client or OSD performance. Check to see if the client or OSD is recovering.

Kernel Version

Check the kernel version that you are running. Older kernels may lack updates that improve Ceph performance.

Kernel Issues with SyncFS

If you have kernel issues with SyncFS, try running one OSD per host to see if performance improves. Old kernels might not have a recent enough version of glibc to support syncfs(2).

Filesystem Issues

In post-Luminous releases, we recommend deploying clusters with the BlueStore back end. When running a pre-Luminous release, or if you have a specific reason to deploy OSDs with the previous Filestore backend, we recommend XFS.

We recommend against using Btrfs or ext4. The Btrfs filesystem has many attractive features, but bugs may lead to performance issues and spurious ENOSPC errors. We do not recommend ext4 for Filestore OSDs because xattr limitations break support for long object names, which are needed for RGW.

For more information, see Filesystem Recommendations.

Insufficient RAM

We recommend a minimum of 4GB of RAM per OSD daemon and we suggest rounding up from 6GB to 8GB. During normal operations, you may notice that ceph-osd processes use only a fraction of that amount. You might be tempted to use the excess RAM for co-resident applications or to skimp on each node’s memory capacity. However, when OSDs experience recovery their memory utilization spikes. If there is insufficient RAM available during recovery, OSD performance will slow considerably and the daemons may even crash or be killed by the Linux OOM Killer.

Blocked Requests or Slow Requests

When a ceph-osd daemon is slow to respond to a request, the cluster log receives messages reporting ops that are taking too long. The warning threshold defaults to 30 seconds and is configurable via the osd_op_complaint_time setting.

Legacy versions of Ceph complain about old requests:

osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops

Newer versions of Ceph complain about slow requests:

{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
{date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]

Possible causes include:

  • A failing drive (check dmesg output)

  • A bug in the kernel file system (check dmesg output)

  • An overloaded cluster (check system load, iostat, etc.)

  • A bug in the ceph-osd daemon.

  • Suboptimal OSD shard configuration (on HDD based cluster with mClock scheduler)

Possible solutions:

Debugging Slow Requests

If you run ceph daemon osd.<id> dump_historic_ops or ceph daemon osd.<id> dump_ops_in_flight, you will see a set of operations and a list of events each operation went through. These are briefly described below.

Events from the Messenger layer:

  • header_read: The time that the messenger first started reading the message off the wire.

  • throttled: The time that the messenger tried to acquire memory throttle space to read the message into memory.

  • all_read: The time that the messenger finished reading the message off the wire.

  • dispatched: The time that the messenger gave the message to the OSD.

  • initiated: This is identical to header_read. The existence of both is a historical oddity.

Events from the OSD as it processes ops:

  • queued_for_pg: The op has been put into the queue for processing by its PG.

  • reached_pg: The PG has started performing the op.

  • waiting for \*: The op is waiting for some other work to complete before it can proceed (for example, a new OSDMap; the scrubbing of its object target; the completion of a PG’s peering; all as specified in the message).

  • started: The op has been accepted as something the OSD should do and is now being performed.

  • waiting for subops from: The op has been sent to replica OSDs.

Events from Filestore:

  • commit_queued_for_journal_write: The op has been given to the FileStore.

  • write_thread_in_journal_buffer: The op is in the journal’s buffer and is waiting to be persisted (as the next disk write).

  • journaled_completion_queued: The op was journaled to disk and its callback has been queued for invocation.

Events from the OSD after data has been given to underlying storage:

  • op_commit: The op has been committed (that is, written to journal) by the primary OSD.

  • op_applied: The op has been written with write() to the backing FS (that is, applied in memory but not flushed out to disk) on the primary.

  • sub_op_applied: op_applied, but for a replica’s “subop”.

  • sub_op_committed: op_commit, but for a replica’s subop (only for EC pools).

  • sub_op_commit_rec/sub_op_apply_rec from <X>: The primary marks this when it hears about the above, but for a particular replica (i.e. <X>).

  • commit_sent: We sent a reply back to the client (or primary OSD, for sub ops).

Although some of these events may appear redundant, they cross important boundaries in the internal code (such as passing data across locks into new threads).

Slow Requests or Slow Recovery With mClock Scheduler

Note

This troubleshooting is applicable only for HDD based clusters running mClock scheduler and with the following OSD shard configuration: osd_op_num_shards_hdd = 5 and osd_op_num_threads_per_shard_hdd = 1. Also, see OSD Shard Configuration For HDD Based Clusters With mClock for details around the reason for the change made to the default OSD HDD shard configuration for mClock.

On scaled HDD based clusters with mClock scheduler enabled and under multiple OSD node failure condition, the following could be reported or observed:

  • slow requests: This also manifests into degraded client I/O performance.

  • slow background recoveries: Lower than expected recovery throughput.

Troubleshooting Steps:

  1. Verify from OSD events that the slow requests are predominantly of type queued_for_pg.

  2. Verify if the reported recovery rate is significantly lower than the expected rate considering the QoS allocations for background recovery service.

If either of the above steps are true, then the following resolution may be applied. Note that this is disruptive as it involves OSD restarts. Run the following commands to change the default OSD shard configuration for HDDs:

ceph config set osd osd_op_num_shards_hdd 1
ceph config set osd osd_op_num_threads_per_shard_hdd 5

The above configuration won’t take effect immediately and would require a restart of the OSDs in the environment. For this process to be least disruptive, the OSDs may be restarted in a carefully staggered manner.

Flapping OSDs

“Flapping” is the term for the phenomenon of an OSD being repeatedly marked up and then down in rapid succession. This section explains how to recognize flapping, and how to mitigate it.

When OSDs peer and check heartbeats, they use the cluster (back-end) network when it is available. See Monitor/OSD Interaction for details.

The upstream Ceph community has traditionally recommended separate public (front-end) and private (cluster / back-end / replication) networks. This provides the following benefits:

  1. Segregation of (1) heartbeat traffic and replication/recovery traffic (private) from (2) traffic from clients and between OSDs and monitors (public). This helps keep one stream of traffic from DoS-ing the other, which could in turn result in a cascading failure.

  2. Additional throughput for both public and private traffic.

In the past, when common networking technologies were measured in a range encompassing 100Mb/s and 1Gb/s, this separation was often critical. But with today’s 10Gb/s, 40Gb/s, and 25/50/100Gb/s networks, the above capacity concerns are often diminished or even obviated. For example, if your OSD nodes have two network ports, dedicating one to the public and the other to the private network means that you have no path redundancy. This degrades your ability to endure network maintenance and network failures without significant cluster or client impact. In situations like this, consider instead using both links for only a public network: with bonding (LACP) or equal-cost routing (for example, FRR) you reap the benefits of increased throughput headroom, fault tolerance, and reduced OSD flapping.

When a private network (or even a single host link) fails or degrades while the public network continues operating normally, OSDs may not handle this situation well. In such situations, OSDs use the public network to report each other down to the monitors, while marking themselves up. The monitors then send out-- again on the public network--an updated cluster map with the affected OSDs marked down. These OSDs reply to the monitors “I’m not dead yet!”, and the cycle repeats. We call this scenario ‘flapping`, and it can be difficult to isolate and remediate. Without a private network, this irksome dynamic is avoided: OSDs are generally either up or down without flapping.

If something does cause OSDs to ‘flap’ (repeatedly being marked down and then up again), you can force the monitors to halt the flapping by temporarily freezing their states:

ceph osd set noup      # prevent OSDs from getting marked up
ceph osd set nodown    # prevent OSDs from getting marked down

These flags are recorded in the osdmap:

ceph osd dump | grep flags
flags no-up,no-down

You can clear these flags with:

ceph osd unset noup
ceph osd unset nodown

Two other flags are available, noin and noout, which prevent booting OSDs from being marked in (allocated data) or protect OSDs from eventually being marked out (regardless of the current value of mon_osd_down_out_interval).

Note

noup, noout, and nodown are temporary in the sense that after the flags are cleared, the action that they were blocking should be possible shortly thereafter. But the noin flag prevents OSDs from being marked in on boot, and any daemons that started while the flag was set will remain that way.

Note

The causes and effects of flapping can be mitigated somewhat by making careful adjustments to mon_osd_down_out_subtree_limit, mon_osd_reporter_subtree_level, and mon_osd_min_down_reporters. Derivation of optimal settings depends on cluster size, topology, and the Ceph release in use. The interaction of all of these factors is subtle and is beyond the scope of this document.

Brought to you by the Ceph Foundation

The Ceph Documentation is a community resource funded and hosted by the non-profit Ceph Foundation. If you would like to support this and our other efforts, please consider joining now.