Compliance Check

The stability and reliability of a Ceph cluster is dependent not just upon the Ceph daemons, but also the OS and hardware that Ceph is installed on. This document is intended to promote a design discussion for providing a “compliance” feature within mgr/cephadm, which would be responsible for identifying common platform-related issues that could impact Ceph stability and operation.

The ultimate goal of these checks is to identify issues early and raise a healthcheck WARN event, to alert the Administrator to the issue.

Prerequisites

In order to effectively analyse the hosts that Ceph is deployed to, this feature requires a cache of host-related metadata. The metadata is already available from cephadm’s HostFacts class and the gather-facts cephadm command. For the purposes of this document, we will assume that this data is available within the mgr/cephadm “cache” structure.

Some checks will require that the host status is also populated e.g. ONLINE, OFFLINE, MAINTENANCE

Administrator Interaction

Not all users will require this feature, and must be able to ‘opt out’. For this reason, mgr/cephadm must provide controls, such as the following;

ceph cephadm compliance enable | disable | status [--format json]
ceph cephadm compliance ls [--format json]
ceph cephadm compliance enable-check <name>
ceph cephadm compliance disable-check <name>
ceph cephadm compliance set-check-interval <int>
ceph cephadm compliance get-check-interval

The status option would show the enabled/disabled state of the feature, along with the check-interval.

The ls subcommand would show all checks in the following format;

check-name status description

Proposed Integration

The compliance checks are not required to run all the time, but instead should run at discrete intervals. The interval would be configurable under via the set-check-interval subcommand (default would be every 12 hours)

mgr/cephadm currently executes an event driven (time based) serve loop to act on deploy/remove and reconcile activity. In order to execute the compliance checks, the compliance check code would be called from this main serve loop - when the set-check-interval is met.

Proposed Checks

All checks would push any errors to a list, so multiple issues can be escalated to the Admin at the same time. The list below provides a description of each check, with the text following the name indicating a shortname version (the shortname is the reference for command Interaction when enabling or disabling a check)

OS Consistency (OS)

  • all hosts must use same vendor

  • all hosts must be on the same major release (this check would only be applicable to distributions that offer a long-term-support strategy (RHEL, CentOS, SLES, Ubuntu etc)

src: gather-facts output

Linux Kernel Security Mode (LSM)

  • All hosts should have a consistent SELINUX/AppArmor configuration

src: gather-facts output

Services Check (SERVICES)

Hosts that are in an ONLINE state should adhere to the following;

  • all daemons (systemd units) should be enabled

  • all daemons should be running (not dead)

src: list_daemons output

Support Status (SUPPORT)

If support status has been detected, it should be consistent across all hosts. At this point support status is available only for Red Hat machines.

src: gather-facts output

Network : MTU (MTU)

All network interfaces on the same Ceph network (public/cluster) should have the same MTU

src: gather-facts output

Network : LinkSpeed (LINKSPEED)

All network interfaces on the same Ceph network (public/cluster) should have the same Linkspeed

src: gather-facts output

Network : Consistency (INTERFACE)

All hosts with OSDs should have consistent network configuration - eg. if some hosts do not separate cluster/public traffic but others do, that is an anomaly that would generate a compliance check warning.

src: gather-facts output

Notification Strategy

If any of the checks fail, mgr/cephadm would raise a WARN level alert

Futures

The checks highlighted here serve only as a starting point, and we should expect to expand on the checks over time.