Device Management
Device management allows Ceph to address hardware failure. Ceph tracks hardware storage devices (HDDs, SSDs) to see which devices are managed by which daemons. Ceph also collects health metrics about these devices. By doing so, Ceph can provide tools that predict hardware failure and can automatically respond to hardware failure.
Device tracking
To see a list of the storage devices that are in use, run the following command:
ceph device ls
Alternatively, to list devices by daemon or by host, run a command of one of the following forms:
ceph device ls-by-daemon <daemon>
ceph device ls-by-host <host>
To see information about the location of an specific device and about how the device is being consumed, run a command of the following form:
ceph device info <devid>
Identifying physical devices
To make the replacement of failed disks easier and less error-prone, you can (in some cases) “blink” the drive’s LEDs on hardware enclosures by running a command of the following form:
device light on|off <devid> [ident|fault] [--force]
Note
Using this command to blink the lights might not work. Whether it works will depend upon such factors as your kernel revision, your SES firmware, or the setup of your HBA.
The <devid>
parameter is the device identification. To retrieve this
information, run the following command:
ceph device ls
The [ident|fault]
parameter determines which kind of light will blink. By
default, the identification light is used.
Note
This command works only if the Cephadm or the Rook orchestrator module is enabled. To see which orchestrator module is enabled, run the following command:
ceph orch status
The command that makes the drive’s LEDs blink is lsmcli. To customize this command, configure it via a Jinja2 template by running commands of the following forms:
ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"
The following arguments can be used to customize the Jinja2 template:
on
A boolean value.
ident_fault
A string that contains ident or fault.
dev
A string that contains the device ID: for example, SanDisk_X400_M.2_2280_512GB_162924424784.
path
A string that contains the device path: for example, /dev/sda.
Enabling monitoring
Ceph can also monitor the health metrics associated with your device. For
example, SATA drives implement a standard called SMART that provides a wide
range of internal metrics about the device’s usage and health (for example: the
number of hours powered on, the number of power cycles, the number of
unrecoverable read errors). Other device types such as SAS and NVMe present a
similar set of metrics (via slightly different standards). All of these
metrics can be collected by Ceph via the smartctl
tool.
You can enable or disable health monitoring by running one of the following commands:
ceph device monitoring on
ceph device monitoring off
Scraping
If monitoring is enabled, device metrics will be scraped automatically at regular intervals. To configure that interval, run a command of the following form:
ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
By default, device metrics are scraped once every 24 hours.
To manually scrape all devices, run the following command:
ceph device scrape-health-metrics
To scrape a single device, run a command of the following form:
ceph device scrape-health-metrics <device-id>
To scrape a single daemon’s devices, run a command of the following form:
ceph device scrape-daemon-health-metrics <who>
To retrieve the stored health metrics for a device (optionally for a specific timestamp), run a command of the following form:
ceph device get-health-metrics <devid> [sample-timestamp]
Failure prediction
Ceph can predict drive life expectancy and device failures by analyzing the health metrics that it collects. The prediction modes are as follows:
none: disable device failure prediction.
local: use a pre-trained prediction model from the
ceph-mgr
daemon.
To configure the prediction mode, run a command of the following form:
ceph config set global device_failure_prediction_mode <mode>
Under normal conditions, failure prediction runs periodically in the background. For this reason, life expectancy values might be populated only after a significant amount of time has passed. The life expectancy of all devices is displayed in the output of the following command:
ceph device ls
To see the metadata of a specific device, run a command of the following form:
ceph device info <devid>
To explicitly force prediction of a specific device’s life expectancy, run a command of the following form:
ceph device predict-life-expectancy <devid>
In addition to Ceph’s internal device failure prediction, you might have an external source of information about device failures. To inform Ceph of a specific device’s life expectancy, run a command of the following form:
ceph device set-life-expectancy <devid> <from> [<to>]
Life expectancies are expressed as a time interval. This means that the uncertainty of the life expectancy can be expressed in the form of a range of time, and perhaps a wide range of time. The interval’s end can be left unspecified.
Health alerts
The mgr/devicehealth/warn_threshold
configuration option controls the
health check for an expected device failure. If the device is expected to fail
within the specified time interval, an alert is raised.
To check the stored life expectancy of all devices and generate any appropriate health alert, run the following command:
ceph device check-health
Automatic Migration
The mgr/devicehealth/self_heal
option (enabled by default) automatically
migrates data away from devices that are expected to fail soon. If this option
is enabled, the module marks such devices out
so that automatic migration
will occur.
Note
The mon_osd_min_up_ratio
configuration option can help prevent
this process from cascading to total failure. If the “self heal” module
marks out
so many OSDs that the ratio value of mon_osd_min_up_ratio
is exceeded, then the cluster raises the DEVICE_HEALTH_TOOMANY
health
check. For instructions on what to do in this situation, see
DEVICE_HEALTH_TOOMANY.
The mgr/devicehealth/mark_out_threshold
configuration option specifies the
time interval for automatic migration. If a device is expected to fail within
the specified time interval, it will be automatically marked out
.