Monitoring Stack with Cephadm

Ceph Dashboard uses Prometheus, Grafana, and related tools to store and visualize detailed metrics on cluster utilization and performance. Ceph users have three options:

  1. Have cephadm deploy and configure these services. This is the default when bootstrapping a new cluster unless the --skip-monitoring-stack option is used.

  2. Deploy and configure these services manually. This is recommended for users with existing prometheus services in their environment (and in cases where Ceph is running in Kubernetes with Rook).

  3. Skip the monitoring stack completely. Some Ceph dashboard graphs will not be available.

The monitoring stack consists of Prometheus, Prometheus exporters (Prometheus Module, Node exporter), Prometheus Alert Manager and Grafana.

Note

Prometheus’ security model presumes that untrusted users have access to the Prometheus HTTP endpoint and logs. Untrusted users have access to all the (meta)data Prometheus collects that is contained in the database, plus a variety of operational and debugging information.

However, Prometheus’ HTTP API is limited to read-only operations. Configurations can not be changed using the API and secrets are not exposed. Moreover, Prometheus has some built-in measures to mitigate the impact of denial of service attacks.

Please see Prometheus’ Security model <https://prometheus.io/docs/operating/security/> for more detailed information.

By default, bootstrap will deploy a basic monitoring stack. If you did not do this (by passing --skip-monitoring-stack, or if you converted an existing cluster to cephadm management, you can set up monitoring by following the steps below.

  1. Enable the prometheus module in the ceph-mgr daemon. This exposes the internal Ceph metrics so that prometheus can scrape them.

    ceph mgr module enable prometheus
    
  2. Deploy a node-exporter service on every node of the cluster. The node-exporter provides host-level metrics like CPU and memory utilization.

    ceph orch apply node-exporter '*'
    
  3. Deploy alertmanager

    ceph orch apply alertmanager 1
    
  4. Deploy prometheus. A single prometheus instance is sufficient, but for HA you may want to deploy two.

    ceph orch apply prometheus 1    # or 2
    
  5. Deploy grafana

    ceph orch apply grafana 1
    

Cephadm takes care of the configuration of Prometheus, Grafana, and Alertmanager automatically.

However, there is one exception to this rule. In a some setups, the Dashboard user’s browser might not be able to access the Grafana URL configured in Ceph Dashboard. One such scenario is when the cluster and the accessing user are each in a different DNS zone.

For this case, there is an extra configuration option for Ceph Dashboard, which can be used to configure the URL for accessing Grafana by the user’s browser. This value will never be altered by cephadm. To set this configuration option, issue the following command:

$ ceph dashboard set-grafana-frontend-api-url <grafana-server-api>

It may take a minute or two for services to be deployed. Once completed, you should see something like this from ceph orch ls

$ ceph orch ls
NAME           RUNNING  REFRESHED  IMAGE NAME                                      IMAGE ID        SPEC
alertmanager       1/1  6s ago     docker.io/prom/alertmanager:latest              0881eb8f169f  present
crash              2/2  6s ago     docker.io/ceph/daemon-base:latest-master-devel  mix           present
grafana            1/1  0s ago     docker.io/pcuzner/ceph-grafana-el8:latest       f77afcf0bcf6   absent
node-exporter      2/2  6s ago     docker.io/prom/node-exporter:latest             e5a616e4b9cf  present
prometheus         1/1  6s ago     docker.io/prom/prometheus:latest                e935122ab143  present

Configuring SSL/TLS for Grafana

cephadm will deploy Grafana using the certificate defined in the ceph key/value store. If a certificate is not specified, cephadm will generate a self-signed certificate during deployment of the Grafana service.

A custom certificate can be configured using the following commands.

ceph config-key set mgr/cephadm/grafana_key -i $PWD/key.pem
ceph config-key set mgr/cephadm/grafana_crt -i $PWD/certificate.pem

The cephadm manager module needs to be restarted to be able to read updates to these keys.

ceph orch restart mgr

If you already deployed Grafana, you need to redeploy the service for the configuration to be updated.

ceph orch redeploy grafana

The redeploy command also takes care of setting the right URL for Ceph Dashboard.

Using custom images

It is possible to install or upgrade monitoring components based on other images. To do so, the name of the image to be used needs to be stored in the configuration first. The following configuration options are available.

  • container_image_prometheus

  • container_image_grafana

  • container_image_alertmanager

  • container_image_node_exporter

Custom images can be set with the ceph config command

ceph config set mgr mgr/cephadm/<option_name> <value>

For example

ceph config set mgr mgr/cephadm/container_image_prometheus prom/prometheus:v1.4.1

Note

By setting a custom image, the default value will be overridden (but not overwritten). The default value changes when updates become available. By setting a custom image, you will not be able to update the component you have set the custom image for automatically. You will need to manually update the configuration (image name and tag) to be able to install updates.

If you choose to go with the recommendations instead, you can reset the custom image you have set before. After that, the default value will be used again. Use ceph config rm to reset the configuration option

ceph config rm mgr mgr/cephadm/<option_name>

For example

ceph config rm mgr mgr/cephadm/container_image_prometheus

Using custom configuration files

By overriding cephadm templates, it is possible to completely customize the configuration files for monitoring services.

Internally, cephadm already uses Jinja2 templates to generate the configuration files for all monitoring components. To be able to customize the configuration of Prometheus, Grafana or the Alertmanager it is possible to store a Jinja2 template for each service that will be used for configuration generation instead. This template will be evaluated every time a service of that kind is deployed or reconfigured. That way, the custom configuration is preserved and automatically applied on future deployments of these services.

Note

The configuration of the custom template is also preserved when the default configuration of cephadm changes. If the updated configuration is to be used, the custom template needs to be migrated manually.

Option names

The following templates for files that will be generated by cephadm can be overridden. These are the names to be used when storing with ceph config-key set:

  • alertmanager_alertmanager.yml

  • grafana_ceph-dashboard.yml

  • grafana_grafana.ini

  • prometheus_prometheus.yml

You can look up the file templates that are currently used by cephadm in src/pybind/mgr/cephadm/templates:

  • services/alertmanager/alertmanager.yml.j2

  • services/grafana/ceph-dashboard.yml.j2

  • services/grafana/grafana.ini.j2

  • services/prometheus/prometheus.yml.j2

Usage

The following command applies a single line value:

ceph config-key set mgr/cephadm/<option_name> <value>

To set contents of files as template use the -i argument:

ceph config-key set mgr/cephadm/<option_name> -i $PWD/<filename>

Note

When using files as input to config-key an absolute path to the file must be used.

It is required to restart the cephadm mgr module after a configuration option has been set. Then the configuration file for the service needs to be recreated. This is done using redeploy. For more details see the following example.

Example

# set the contents of ./prometheus.yml.j2 as template
ceph config-key set mgr/cephadm/services_prometheus_prometheus.yml \
  -i $PWD/prometheus.yml.j2

# restart cephadm mgr module
ceph orch restart mgr

# redeploy the prometheus service
ceph orch redeploy prometheus

Disabling monitoring

If you have deployed monitoring and would like to remove it, you can do so with

ceph orch rm grafana
ceph orch rm prometheus --force   # this will delete metrics data collected so far
ceph orch rm node-exporter
ceph orch rm alertmanager
ceph mgr module disable prometheus

Deploying monitoring manually

If you have an existing prometheus monitoring infrastructure, or would like to manage it yourself, you need to configure it to integrate with your Ceph cluster.

  • Enable the prometheus module in the ceph-mgr daemon

    ceph mgr module enable prometheus
    

    By default, ceph-mgr presents prometheus metrics on port 9283 on each host running a ceph-mgr daemon. Configure prometheus to scrape these.

  • To enable the dashboard’s prometheus-based alerting, see Enabling Prometheus Alerting.

  • To enable dashboard integration with Grafana, see Enabling the Embedding of Grafana Dashboards.

Enabling RBD-Image monitoring

Due to performance reasons, monitoring of RBD images is disabled by default. For more information please see RBD IO statistics. If disabled, the overview and details dashboards will stay empty in Grafana and the metrics will not be visible in Prometheus.