Prometheus Module
Provides a Prometheus exporter to pass on Ceph performance counters from the collection point in ceph-mgr. Ceph-mgr receives MMgrReport messages from all MgrClient processes (mons and OSDs, for instance) with performance counter schema data and actual counter data, and keeps a circular buffer of the last N samples. This module creates an HTTP endpoint (like all Prometheus exporters) and retrieves the latest sample of every counter when polled (or “scraped” in Prometheus terminology). The HTTP path and query parameters are ignored; all extant counters for all reporting entities are returned in text exposition format. (See the Prometheus documentation.)
Enabling prometheus output
The prometheus module is enabled with:
ceph mgr module enable prometheus
Configuration
Note
The Prometheus manager module needs to be restarted for configuration changes to be applied.
- server_addr
the IPv4 or IPv6 address on which the module listens for HTTP requests
- type
str
- default
::
- server_port
the port on which the module listens for HTTP requests
- type
int
- default
9283
- scrape_interval
- type
float
- default
15.0
- cache
- type
bool
- default
true
- stale_cache_strategy
- type
str
- default
log
- rbd_stats_pools
- type
str
- default
<empty string>
- rbd_stats_pools_refresh_interval
- type
int
- default
300
- standby_behaviour
- type
str
- default
default
- standby_error_status_code
- type
int
- default
500
- allowed range
[400, 599]
By default the module will accept HTTP requests on port 9283
on all IPv4
and IPv6 addresses on the host. The port and listen address are both
configurable with ceph config set
, with keys
mgr/prometheus/server_addr
and mgr/prometheus/server_port
. This port
is registered with Prometheus’s registry.
ceph config set mgr mgr/prometheus/server_addr 0.0.0.
ceph config set mgr mgr/prometheus/server_port 9283
Warning
The mgr/prometheus/scrape_interval
of this module should always be set to match
Prometheus’ scrape interval to work properly and not cause any issues.
The scrape interval in the module is used for caching purposes and to determine when a cache is stale.
It is not recommended to use a scrape interval below 10 seconds. It is recommended to use 15 seconds as scrape interval, though, in some cases it might be useful to increase the scrape interval.
To set a different scrape interval in the Prometheus module, set
scrape_interval
to the desired value:
ceph config set mgr mgr/prometheus/scrape_interval 20
On large clusters (>1000 OSDs), the time to fetch the metrics may become
significant. Without the cache, the Prometheus manager module could, especially
in conjunction with multiple Prometheus instances, overload the manager and lead
to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled
by default. This means that there is a possibility that the cache becomes
stale. The cache is considered stale when the time to fetch the metrics from
Ceph exceeds the configured mgr/prometheus/scrape_interval
.
If that is the case, a warning will be logged and the module will either
respond with a 503 HTTP status code (service unavailable) or,
it will return the content of the cache, even though it might be stale.
This behavior can be configured. By default, it will return a 503 HTTP status
code (service unavailable). You can set other options using the ceph config
set
commands.
To tell the module to respond with possibly stale data, set it to return
:
ceph config set mgr mgr/prometheus/stale_cache_strategy return
To tell the module to respond with “service unavailable”, set it to fail
:
ceph config set mgr mgr/prometheus/stale_cache_strategy fail
If you are confident that you don’t require the cache, you can disable it:
ceph config set mgr mgr/prometheus/cache false
If you are using the prometheus module behind some kind of reverse proxy or
loadbalancer, you can simplify discovering the active instance by switching
to error
-mode:
ceph config set mgr mgr/prometheus/standby_behaviour error
If set, the prometheus module will repond with a HTTP error when requesting /
from the standby instance. The default error code is 500, but you can configure
the HTTP response code with:
ceph config set mgr mgr/prometheus/standby_error_status_code 503
Valid error codes are between 400-599.
To switch back to the default behaviour, simply set the config key to default
:
ceph config set mgr mgr/prometheus/standby_behaviour default
Ceph Health Checks
The mgr/prometheus module also tracks and maintains a history of Ceph health checks, exposing them to the Prometheus server as discrete metrics. This allows Prometheus alert rules to be configured for specific health check events.
The metrics take the following form;
# HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
# TYPE ceph_health_detail gauge
ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0
The health check history is made available through the following commands;
healthcheck history ls [--format {plain|json|json-pretty}]
healthcheck history clear
The ls
command provides an overview of the health checks that the cluster has
encountered, or since the last clear
command was issued. The example below;
[ceph: root@c8-node1 /]# ceph healthcheck history ls
Healthcheck Name First Seen (UTC) Last seen (UTC) Count Active
OSDMAP_FLAGS 2021/09/16 03:17:47 2021/09/16 22:07:40 2 No
OSD_DOWN 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
PG_DEGRADED 2021/09/17 00:11:59 2021/09/17 00:11:59 1 Yes
3 health check(s) listed
RBD IO statistics
The module can optionally collect RBD per-image IO statistics by enabling
dynamic OSD performance counters. The statistics are gathered for all images
in the pools that are specified in the mgr/prometheus/rbd_stats_pools
configuration parameter. The parameter is a comma or space separated list
of pool[/namespace]
entries. If the namespace is not specified the
statistics are collected for all namespaces in the pool.
Example to activate the RBD-enabled pools pool1
, pool2
and poolN
:
ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
The wildcard can be used to indicate all pools or namespaces:
ceph config set mgr mgr/prometheus/rbd_stats_pools "*"
The module makes the list of all available images scanning the specified
pools and namespaces and refreshes it periodically. The period is
configurable via the mgr/prometheus/rbd_stats_pools_refresh_interval
parameter (in sec) and is 300 sec (5 minutes) by default. The module will
force refresh earlier if it detects statistics from a previously unknown
RBD image.
Example to turn up the sync interval to 10 minutes:
ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
Statistic names and labels
The names of the stats are exactly as Ceph names them, with
illegal characters .
, -
and ::
translated to _
,
and ceph_
prefixed to all names.
All daemon statistics have a ceph_daemon
label such as “osd.123”
that identifies the type and ID of the daemon they come from. Some
statistics can come from different types of daemon, so when querying
e.g. an OSD’s RocksDB stats, you would probably want to filter
on ceph_daemon starting with “osd” to avoid mixing in the monitor
rocksdb stats.
The cluster statistics (i.e. those global to the Ceph cluster)
have labels appropriate to what they report on. For example,
metrics relating to pools have a pool_id
label.
The long running averages that represent the histograms from core Ceph
are represented by a pair of <name>_sum
and <name>_count
metrics.
This is similar to how histograms are represented in Prometheus
and they can also be treated similarly.
Pool and OSD metadata series
Special series are output to enable displaying and querying on certain metadata fields.
Pools have a ceph_pool_metadata
field like this:
ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
OSDs have a ceph_osd_metadata
field like this:
ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
Correlating drive statistics with node_exporter
The prometheus output from Ceph is designed to be used in conjunction with the generic host monitoring from the Prometheus node_exporter.
To enable correlation of Ceph OSD statistics with node_exporter’s drive statistics, special series are output like this:
ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
To use this to get disk statistics by OSD ID, use either the and
operator or
the *
operator in your prometheus query. All metadata metrics (like ``
ceph_disk_occupation_human`` have the value 1 so they act neutral with *
. Using *
allows to use group_left
and group_right
grouping modifiers, so that
the resulting metric has additional labels from one side of the query.
See the prometheus documentation for more information about constructing queries.
The goal is to run a query like
rate(node_disk_bytes_written[30s]) and
on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
Out of the box the above query will not return any metrics since the instance
labels of
both metrics don’t match. The instance
label of ceph_disk_occupation_human
will be the currently active MGR node.
The following two section outline two approaches to remedy this.
Note
If you need to group on the ceph_daemon label instead of device and instance labels, using ceph_disk_occupation_human may not work reliably. It is advised that you use ceph_disk_occupation instead.
The difference is that ceph_disk_occupation_human may group several OSDs into the value of a single ceph_daemon label in cases where multiple OSDs share a disk.
Use label_replace
The label_replace
function (cp.
label_replace documentation)
can add a label to, or alter a label of, a metric within a query.
To correlate an OSD and its disks write rate, the following query can be used:
label_replace(
rate(node_disk_bytes_written[30s]),
"exported_instance",
"$1",
"instance",
"(.*):.*"
) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
Configuring Prometheus server
honor_labels
To enable Ceph to output properly-labeled data relating to any host,
use the honor_labels
setting when adding the ceph-mgr endpoints
to your prometheus configuration.
This allows Ceph to export the proper instance
label without prometheus
overwriting it. Without this setting, Prometheus applies an instance
label
that includes the hostname and port of the endpoint that the series came from.
Because Ceph clusters have multiple manager daemons, this results in an
instance
label that changes spuriously when the active manager daemon
changes.
If this is undesirable a custom instance
label can be set in the
Prometheus target configuration: you might wish to set it to the hostname
of your first mgr daemon, or something completely arbitrary like “ceph_cluster”.
node_exporter hostname labels
Set your instance
labels to match what appears in Ceph’s OSD metadata
in the instance
field. This is generally the short hostname of the node.
This is only necessary if you want to correlate Ceph stats with host stats, but you may find it useful to do it in all cases in case you want to do the correlation in the future.
Example configuration
This example shows a single node configuration running ceph-mgr and
node_exporter on a server called senta04
. Note that this requires one
to add an appropriate and unique instance
label to each node_exporter
target.
This is just an example: there are other ways to configure prometheus scrape targets and label rewrite rules.
prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
file_sd_configs:
- files:
- node_targets.yml
- job_name: 'ceph'
honor_labels: true
file_sd_configs:
- files:
- ceph_targets.yml
ceph_targets.yml
[
{
"targets": [ "senta04.mydomain.com:9283" ],
"labels": {}
}
]
node_targets.yml
[
{
"targets": [ "senta04.mydomain.com:9100" ],
"labels": {
"instance": "senta04"
}
}
]
Notes
Counters and gauges are exported; currently histograms and long-running averages are not. It’s possible that Ceph’s 2-D histograms could be reduced to two separate 1-D histograms, and that long-running averages could be exported as Prometheus’ Summary type.
Timestamps, as with many Prometheus exporters, are established by the server’s scrape time (Prometheus expects that it is polling the actual counter process synchronously). It is possible to supply a timestamp along with the stat report, but the Prometheus team strongly advises against this. This means that timestamps will be delayed by an unpredictable amount; it’s not clear if this will be problematic, but it’s worth knowing about.