Monitoring a Cluster
After you have a running cluster, you can use the ceph
tool to monitor your
cluster. Monitoring a cluster typically involves checking OSD status, monitor
status, placement group status, and metadata server status.
Using the command line
Interactive mode
To run the ceph
tool in interactive mode, type ceph
at the command line
with no arguments. For example:
ceph
health
status
quorum_status
mon stat
Non-default paths
If you specified non-default locations for your configuration or keyring when
you install the cluster, you may specify their locations to the ceph
tool
by running the following command:
ceph -c /path/to/conf -k /path/to/keyring health
Checking a Cluster’s Status
After you start your cluster, and before you start reading and/or writing data, you should check your cluster’s status.
To check a cluster’s status, run the following command:
ceph status
Alternatively, you can run the following command:
ceph -s
In interactive mode, this operation is performed by typing status
and
pressing Enter:
status
Ceph will print the cluster status. For example, a tiny Ceph “demonstration cluster” that is running one instance of each service (monitor, manager, and OSD) might print the following:
cluster:
id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c
mgr: x(active)
mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby
osd: 3 osds: 3 up, 3 in
data:
pools: 2 pools, 16 pgs
objects: 21 objects, 2.19K
usage: 546 GB used, 384 GB / 931 GB avail
pgs: 16 active+clean
How Ceph Calculates Data Usage
The usage
value reflects the actual amount of raw storage used. The xxx
GB / xxx GB
value means the amount available (the lesser number) of the
overall storage capacity of the cluster. The notional number reflects the size
of the stored data before it is replicated, cloned or snapshotted. Therefore,
the amount of data actually stored typically exceeds the notional amount
stored, because Ceph creates replicas of the data and may also use storage
capacity for cloning and snapshotting.
Watching a Cluster
Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster
itself maintains a cluster log that records high-level events about the
entire Ceph cluster. These events are logged to disk on monitor servers (in
the default location /var/log/ceph/ceph.log
), and they can be monitored via
the command line.
To follow the cluster log, run the following command:
ceph -w
Ceph will print the status of the system, followed by each log message as it is added. For example:
cluster:
id: 477e46f1-ae41-4e43-9c8f-72c918ab0a20
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c
mgr: x(active)
mds: cephfs_a-1/1/1 up {0=a=up:active}, 2 up:standby
osd: 3 osds: 3 up, 3 in
data:
pools: 2 pools, 16 pgs
objects: 21 objects, 2.19K
usage: 546 GB used, 384 GB / 931 GB avail
pgs: 16 active+clean
2017-07-24 08:15:11.329298 mon.a mon.0 172.21.9.34:6789/0 23 : cluster [INF] osd.0 172.21.9.34:6806/20527 boot
2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available
Instead of printing log lines as they are added, you might want to print only
the most recent lines. Run ceph log last [n]
to see the most recent n
lines from the cluster log.
Monitoring Health Checks
Ceph continuously runs various health checks. When
a health check fails, this failure is reflected in the output of ceph status
and
ceph health
. The cluster log receives messages that
indicate when a check has failed and when the cluster has recovered.
For example, when an OSD goes down, the health
section of the status
output is updated as follows:
health: HEALTH_WARN
1 osds down
Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded
At the same time, cluster log messages are emitted to record the failure of the health checks:
2017-07-25 10:08:58.265945 mon.a mon.0 172.21.9.34:6789/0 91 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)
When the OSD comes back online, the cluster log records the cluster’s return to a healthy state:
2017-07-25 10:11:11.526841 mon.a mon.0 172.21.9.34:6789/0 109 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized (PG_DEGRADED)
2017-07-25 10:11:13.535493 mon.a mon.0 172.21.9.34:6789/0 110 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean, 2 pgs degraded, 2 pgs undersized)
2017-07-25 10:11:13.535577 mon.a mon.0 172.21.9.34:6789/0 111 : cluster [INF] Cluster is now healthy
Network Performance Checks
Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon availability and network performance. If a single delayed response is detected, this might indicate nothing more than a busy OSD. But if multiple delays between distinct pairs of OSDs are detected, this might indicate a failed network switch, a NIC failure, or a layer 1 failure.
By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a
health check (a HEALTH_WARN
. For example:
HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms)
In the output of the ceph health detail
command, you can see which OSDs are
experiencing delays and how long the delays are. The output of ceph health
detail
is limited to ten lines. Here is an example of the output you can
expect from the ceph health detail
command:
[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms)
Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving
Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.2 [dc1,rack2] 1030.123 msec
Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec
Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec
To see more detail and to collect a complete dump of network performance
information, use the dump_osd_network
command. This command is usually sent
to a Ceph Manager Daemon, but it can be used to collect information about a
specific OSD’s interactions by sending it to that OSD. The default threshold
for a slow heartbeat is 1 second (1000 milliseconds), but this can be
overridden by providing a number of milliseconds as an argument.
To show all network performance data with a specified threshold of 0, send the following command to the mgr:
ceph daemon /var/run/ceph/ceph-mgr.x.asok dump_osd_network 0
{
"threshold": 0,
"entries": [
{
"last update": "Wed Sep 4 17:04:49 2019",
"stale": false,
"from osd": 2,
"to osd": 0,
"interface": "front",
"average": {
"1min": 1.023,
"5min": 0.860,
"15min": 0.883
},
"min": {
"1min": 0.818,
"5min": 0.607,
"15min": 0.607
},
"max": {
"1min": 1.164,
"5min": 1.173,
"15min": 1.544
},
"last": 0.924
},
{
"last update": "Wed Sep 4 17:04:49 2019",
"stale": false,
"from osd": 2,
"to osd": 0,
"interface": "back",
"average": {
"1min": 0.968,
"5min": 0.897,
"15min": 0.830
},
"min": {
"1min": 0.860,
"5min": 0.563,
"15min": 0.502
},
"max": {
"1min": 1.171,
"5min": 1.216,
"15min": 1.456
},
"last": 0.845
},
{
"last update": "Wed Sep 4 17:04:48 2019",
"stale": false,
"from osd": 0,
"to osd": 1,
"interface": "front",
"average": {
"1min": 0.965,
"5min": 0.811,
"15min": 0.850
},
"min": {
"1min": 0.650,
"5min": 0.488,
"15min": 0.466
},
"max": {
"1min": 1.252,
"5min": 1.252,
"15min": 1.362
},
"last": 0.791
},
...
Muting Health Checks
Health checks can be muted so that they have no effect on the overall
reported status of the cluster. For example, if the cluster has raised a
single health check and then you mute that health check, then the cluster will report a status of HEALTH_OK
.
To mute a specific health check, use the health check code that corresponds to that health check (see Health checks), and
run the following command:
ceph health mute <code>
For example, to mute an OSD_DOWN
health check, run the following command:
ceph health mute OSD_DOWN
Mutes are reported as part of the short and long form of the ceph health
command’s output.
For example, in the above scenario, the cluster would report:
ceph health
HEALTH_OK (muted: OSD_DOWN)
ceph health detail
HEALTH_OK (muted: OSD_DOWN)
(MUTED) OSD_DOWN 1 osds down
osd.1 is down
A mute can be removed by running the following command:
ceph health unmute <code>
For example:
ceph health unmute OSD_DOWN
A “health mute” can have a TTL (Time To Live) associated with it: this means that the mute will automatically expire after a specified period of time. The TTL is specified as an optional duration argument, as seen in the following examples:
ceph health mute OSD_DOWN 4h # mute for 4 hours
ceph health mute MON_DOWN 15m # mute for 15 minutes
Normally, if a muted health check is resolved (for example, if the OSD that raised the OSD_DOWN
health check
in the example above has come back up), the mute goes away. If the health check comes
back later, it will be reported in the usual way.
It is possible to make a health mute “sticky”: this means that the mute will remain even if the health check clears. For example, to make a health mute “sticky”, you might run the following command:
ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour
Most health mutes disappear if the unhealthy condition that triggered the health check gets worse. For example, suppose that there is one OSD down and the health check is muted. In that case, if one or more additional OSDs go down, then the health mute disappears. This behavior occurs in any health check with a threshold value.
Checking a Cluster’s Usage Stats
To check a cluster’s data usage and data distribution among pools, use the
df
command. This option is similar to Linux’s df
command. Run the
following command:
ceph df
The output of ceph df
resembles the following:
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00
TOTAL 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
device_health_metrics 1 1 242 KiB 15 KiB 227 KiB 4 251 KiB 24 KiB 227 KiB 0 297 GiB N/A N/A 4 0 B 0 B
cephfs.a.meta 2 32 6.8 KiB 6.8 KiB 0 B 22 96 KiB 96 KiB 0 B 0 297 GiB N/A N/A 22 0 B 0 B
cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B
test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B
CLASS: For example, “ssd” or “hdd”.
SIZE: The amount of storage capacity managed by the cluster.
AVAIL: The amount of free space available in the cluster.
USED: The amount of raw storage consumed by user data (excluding BlueStore’s database).
RAW USED: The amount of raw storage consumed by user data, internal overhead, and reserved capacity.
%RAW USED: The percentage of raw storage used. Watch this number in conjunction with
full ratio
andnear full ratio
to be forewarned when your cluster approaches the fullness thresholds. See Storage Capacity.
POOLS:
The POOLS section of the output provides a list of pools and the notional usage of each pool. This section of the output DOES NOT reflect replicas, clones, or snapshots. For example, if you store an object with 1MB of data, then the notional usage will be 1MB, but the actual usage might be 2MB or more depending on the number of replicas, clones, and snapshots.
ID: The number of the specific node within the pool.
STORED: The actual amount of data that the user has stored in a pool. This is similar to the USED column in earlier versions of Ceph, but the calculations (for BlueStore!) are more precise (in that gaps are properly handled).
(DATA): Usage for RBD (RADOS Block Device), CephFS file data, and RGW (RADOS Gateway) object data.
(OMAP): Key-value pairs. Used primarily by CephFS and RGW (RADOS Gateway) for metadata storage.
OBJECTS: The notional number of objects stored per pool (that is, the number of objects other than replicas, clones, or snapshots).
USED: The space allocated for a pool over all OSDs. This includes space for replication, space for allocation granularity, and space for the overhead associated with erasure-coding. Compression savings and object-content gaps are also taken into account. However, BlueStore’s database is not included in the amount reported under USED.
(DATA): Object usage for RBD (RADOS Block Device), CephFS file data, and RGW (RADOS Gateway) object data.
(OMAP): Object key-value pairs. Used primarily by CephFS and RGW (RADOS Gateway) for metadata storage.
%USED: The notional percentage of storage used per pool.
MAX AVAIL: An estimate of the notional amount of data that can be written to this pool.
QUOTA OBJECTS: The number of quota objects.
QUOTA BYTES: The number of bytes in the quota objects.
DIRTY: The number of objects in the cache pool that have been written to the cache pool but have not yet been flushed to the base pool. This field is available only when cache tiering is in use.
USED COMPR: The amount of space allocated for compressed data. This includes compressed data in addition to all of the space required for replication, allocation granularity, and erasure- coding overhead.
UNDER COMPR: The amount of data that has passed through compression (summed over all replicas) and that is worth storing in a compressed form.
Note
The numbers in the POOLS section are notional. They do not include the number of replicas, clones, or snapshots. As a result, the sum of the USED and %USED amounts in the POOLS section of the output will not be equal to the sum of the USED and %USED amounts in the RAW section of the output.
Note
The MAX AVAIL value is a complicated function of the replication or
the kind of erasure coding used, the CRUSH rule that maps storage to
devices, the utilization of those devices, and the configured
mon_osd_full_ratio
setting.
Checking OSD Status
To check if OSDs are up
and in
, run the
following command:
ceph osd stat
Alternatively, you can run the following command:
ceph osd dump
To view OSDs according to their position in the CRUSH map, run the following command:
ceph osd tree
To print out a CRUSH tree that displays a host, its OSDs, whether the OSDs are
up
, and the weight of the OSDs, run the following command:
#ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 3.00000 pool default
-3 3.00000 rack mainrack
-2 3.00000 host osd-host
0 ssd 1.00000 osd.0 up 1.00000 1.00000
1 ssd 1.00000 osd.1 up 1.00000 1.00000
2 ssd 1.00000 osd.2 up 1.00000 1.00000
Checking Monitor Status
If your cluster has multiple monitors, then you need to perform certain “monitor status” checks. After starting the cluster and before reading or writing data, you should check quorum status. A quorum must be present when multiple monitors are running to ensure proper functioning of your Ceph cluster. Check monitor status regularly in order to ensure that all of the monitors are running.
To display the monitor map, run the following command:
ceph mon stat
Alternatively, you can run the following command:
ceph mon dump
To check the quorum status for the monitor cluster, run the following command:
ceph quorum_status
Ceph returns the quorum status. For example, a Ceph cluster that consists of three monitors might return the following:
{ "election_epoch": 10,
"quorum": [
0,
1,
2],
"quorum_names": [
"a",
"b",
"c"],
"quorum_leader_name": "a",
"monmap": { "epoch": 1,
"fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
"modified": "2011-12-12 13:28:27.505520",
"created": "2011-12-12 13:28:27.505520",
"features": {"persistent": [
"kraken",
"luminous",
"mimic"],
"optional": []
},
"mons": [
{ "rank": 0,
"name": "a",
"addr": "127.0.0.1:6789/0",
"public_addr": "127.0.0.1:6789/0"},
{ "rank": 1,
"name": "b",
"addr": "127.0.0.1:6790/0",
"public_addr": "127.0.0.1:6790/0"},
{ "rank": 2,
"name": "c",
"addr": "127.0.0.1:6791/0",
"public_addr": "127.0.0.1:6791/0"}
]
}
}
Checking MDS Status
Metadata servers provide metadata services for CephFS. Metadata servers have
two sets of states: up | down
and active | inactive
. To check if your
metadata servers are up
and active
, run the following command:
ceph mds stat
To display details of the metadata servers, run the following command:
ceph fs dump
Checking Placement Group States
Placement groups (PGs) map objects to OSDs. PGs are monitored in order to
ensure that they are active
and clean
. See Monitoring OSDs and
Placement Groups.
Using the Admin Socket
The Ceph admin socket allows you to query a daemon via a socket interface. By
default, Ceph sockets reside under /var/run/ceph
. To access a daemon via
the admin socket, log in to the host that is running the daemon and run one of
the two following commands:
ceph daemon {daemon-name}
ceph daemon {path-to-socket-file}
For example, the following commands are equivalent to each other:
ceph daemon osd.0 foo
ceph daemon /var/run/ceph/ceph-osd.0.asok foo
There are two methods of running admin socket commands: (1)
using ceph daemon
as described above, which bypasses
the monitor and assumes a direct login to the daemon’s host,
and (2) using the ceph tell {daemon-type}.{id}
command,
which is relayed by monitors and does not require access
to the daemon’s host.
Use the raise
command to send a signal to a daemon, as if by running kill -X {daemon.pid}
.
When run via ceph tell
it allows signalling a daemon without access to its host:
ceph daemon {daemon-name} raise HUP
ceph tell {daemon-type}.{id} raise -9
To view the available admin-socket commands, run the following command:
ceph daemon {daemon-name} help
Admin-socket commands enable you to view and set your configuration at runtime. For more on viewing your configuration, see Viewing a Configuration at Runtime.