crimson
Crimson is the code name of crimson-osd
, which is the next
generation ceph-osd
. It improves performance when using fast network
and storage devices, employing state-of-the-art technologies including
DPDK and SPDK. BlueStore continues to support HDDs and slower SSDs.
Crimson aims to be backward compatible with the classic ceph-osd
.
Building Crimson
Crimson is not enabled by default. Enable it at build time by running:
$ WITH_SEASTAR=true ./install-deps.sh
$ mkdir build && cd build
$ cmake -DWITH_SEASTAR=ON ..
Please note, ASan is enabled by default if Crimson is built from a source
cloned using git
.
Testing crimson with cephadm
The Ceph CI/CD pipeline builds containers with
crimson-osd
subsitituted for ceph-osd
.
Once a branch at commit <sha1> has been built and is available in
shaman
, you can deploy it using the cephadm instructions outlined
in Cephadm with the following adaptations.
First, while performing the initial bootstrap, use the --image
flag to
use a Crimson build:
cephadm --image quay.ceph.io/ceph-ci/ceph:<sha1>-crimson --allow-mismatched-release bootstrap ...
You’ll likely need to supply the --allow-mismatched-release
flag to
use a non-release branch.
Additionally, prior to deploying OSDs, you’ll need enable Crimson to direct the default pools to be created as Crimson pools. From the cephadm shell run:
ceph config set global 'enable_experimental_unrecoverable_data_corrupting_features' crimson
ceph osd set-allow-crimson --yes-i-really-mean-it
ceph config set mon osd_pool_default_crimson true
The first command enables the crimson
experimental feature. Crimson
is highly experimental, and malfunctions including crashes
and data loss are to be expected.
The second enables the allow_crimson
OSDMap flag. The monitor will
not allow crimson-osd
to boot without that flag.
The last causes pools to be created by default with the crimson
flag.
Crimson pools are restricted to operations supported by Crimson.
Crimson-osd
won’t instantiate PGs from non-Crimson pools.
Running Crimson
As you might expect, Crimson does not yet have as extensive a feature set as does ceph-osd
.
object store backend
At the moment, crimson-osd
offers both native and alienized object store
backends. The native object store backends perform IO using the SeaStar reactor.
They are:
- cyanstore
CyanStore is modeled after memstore in the classic OSD.
- seastore
Seastore is still under active development.
The alienized object store backends are backed by a thread pool, which is a proxy of the alienstore adaptor running in Seastar. The proxy issues requests to object stores running in alien threads, i.e., worker threads not managed by the Seastar framework. They are:
- memstore
The memory backed object store
- bluestore
The object store used by the classic
ceph-osd
daemonize
Unlike ceph-osd
, crimson-osd
does not daemonize itself even if the
daemonize
option is enabled. In order to read this option, crimson-osd
needs to ready its config sharded service, but this sharded service lives
in the Seastar reactor. If we fork a child process and exit the parent after
starting the Seastar engine, that will leave us with a single thread which is
a replica of the thread that called fork(). Tackling this problem in Crimson
would unnecessarily complicate the code.
Since supported GNU/Linux distributions use systemd
, which is able to
daemonize the application, there is no need to daemonize ourselves.
Those using sysvinit can use start-stop-daemon
to daemonize crimson-osd
.
If this is does not work out, a helper utility may be devised.
logging
Crimson-osd
currently uses the logging utility offered by Seastar. See
src/common/dout.h
for the mapping between Ceph logging levels to
the severity levels in Seastar. For instance, messages sent to derr
will be issued using logger::error()
, and the messages with a debug level
greater than 20
will be issued using logger::trace()
.
ceph |
seastar |
< 0 |
error |
0 |
warn |
[1, 6) |
info |
[6, 20] |
debug |
> 20 |
trace |
Note that crimson-osd
does not send log messages directly to a specified log_file
. It writes
the logging messages to stdout and/or syslog. This behavior can be
changed using --log-to-stdout
and --log-to-syslog
command line
options. By default, log-to-stdout
is enabled, and --log-to-syslog
is disabled.
vstart.sh
The following options can be used with vstart.sh
.
--crimson
Start
crimson-osd
instead ofceph-osd
.--nodaemon
Do not daemonize the service.
--redirect-output
Redirect the
stdout
andstderr
toout/$type.$num.stdout
.--osd-args
Pass extra command line options to
crimson-osd
orceph-osd
. This is useful for passing Seastar options tocrimson-osd
. For example, one can supply--osd-args "--memory 2G"
to set the amount of memory to use. Please refer to the output of:crimson-osd --help-seastar
for additional Seastar-specific command line options.
--cyanstore
Use CyanStore as the object store backend.
--bluestore
Use the alienized BlueStore as the object store backend. This is the default.
--memstore
Use the alienized MemStore as the object store backend.
--seastore
Use SeaStore as the back end object store.
--seastore-devs
Specify the block device used by SeaStore.
--seastore-secondary-devs
Optional. SeaStore supports multiple devices. Enable this feature by passing the block device to this option.
--seastore-secondary-devs-type
Optional. Specify the type of secondary devices. When the secondary device is slower than main device passed to
--seastore-devs
, the cold data in faster device will be evicted to the slower devices over time. Valid types includeHDD
,SSD``(default), ``ZNS
, andRANDOM_BLOCK_SSD
Note secondary devices should not be faster than the main device.
To start a cluster with a single Crimson node, run:
$ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \
--without-dashboard --cyanstore \
--crimson --redirect-output \
--osd-args "--memory 4G"
Here we assign 4 GiB memory and a single thread running on core-0 to crimson-osd
.
Another SeaStore example:
$ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \
--without-dashboard --seastore \
--crimson --redirect-output \
--seastore-devs /dev/sda \
--seastore-secondary-devs /dev/sdb \
--seastore-secondary-devs-type HDD
Stop this vstart
cluster by running:
$ ../src/stop.sh --crimson
Metrics and Tracing
Crimson offers three ways to report stats and metrics.
pg stats reported to mgr
Crimson collects the per-pg, per-pool, and per-osd stats in a MPGStats message which is sent to the Ceph Managers. Manager modules can query them using the MgrModule.get() method.
asock command
An admin socket command is offered for dumping metrics:
$ ceph tell osd.0 dump_metrics
$ ceph tell osd.0 dump_metrics reactor_utilization
Here reactor_utilization is an optional string allowing us to filter the dumped metrics by prefix.
Prometheus text protocol
The listening port and address can be configured using the command line options of --prometheus_port see Prometheus for more details.
Profiling Crimson
fio
crimson-store-nbd
exposes configurable FuturizedStore
internals as an
NBD server for use with fio
.
In order to use fio
to test crimson-store-nbd
, perform the below steps.
You will need to install
libnbd
, and compile it intofio
apt-get install libnbd-dev git clone git://git.kernel.dk/fio.git cd fio ./configure --enable-libnbd make
Build
crimson-store-nbd
cd build ninja crimson-store-nbd
Run the
crimson-store-nbd
server with a block device. Specify the path to the raw device, for example/dev/nvme1n1
, in place of the created file for testing with a block device.export disk_img=/tmp/disk.img export unix_socket=/tmp/store_nbd_socket.sock rm -f $disk_img $unix_socket truncate -s 512M $disk_img ./bin/crimson-store-nbd \ --device-path $disk_img \ --smp 1 \ --mkfs true \ --type transaction_manager \ --uds-path ${unix_socket} &
Below are descriptions of these command line arguments:
--smp
The number of CPU cores to use (Symmetric MultiProcessor)
--mkfs
Initialize the device first.
--type
The back end to use. If
transaction_manager
is specified, SeaStore’sTransactionManager
andBlockSegmentManager
are used to emulate a block device. Otherwise, this option is used to choose a backend ofFuturizedStore
, where the whole “device” is divided into multiple fixed-size objects whose size is specified by--object-size
. So, if you are only interested in testing the lower-level implementation of SeaStore like logical address translation layer and garbage collection without the object store semantics,transaction_manager
would be a better choice.
Create a
fio
job file namednbd.fio
[global] ioengine=nbd uri=nbd+unix:///?socket=${unix_socket} rw=randrw time_based runtime=120 group_reporting iodepth=1 size=512M [job0] offset=0
Test the Crimson object store, using the custom
fio
built just now./fio nbd.fio
CBT
We can use cbt for performance tests:
$ git checkout main
$ make crimson-osd
$ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/baseline ../src/test/crimson/cbt/radosbench_4K_read.yaml
$ git checkout yet-another-pr
$ make crimson-osd
$ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/yap ../src/test/crimson/cbt/radosbench_4K_read.yaml
$ ~/dev/cbt/compare.py -b /tmp/baseline -a /tmp/yap -v
19:48:23 - INFO - cbt - prefill/gen8/0: bandwidth: (or (greater) (near 0.05)):: 0.183165/0.186155 => accepted
19:48:23 - INFO - cbt - prefill/gen8/0: iops_avg: (or (greater) (near 0.05)):: 46.0/47.0 => accepted
19:48:23 - WARNING - cbt - prefill/gen8/0: iops_stddev: (or (less) (near 0.05)):: 10.4403/6.65833 => rejected
19:48:23 - INFO - cbt - prefill/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.340868/0.333712 => accepted
19:48:23 - INFO - cbt - prefill/gen8/1: bandwidth: (or (greater) (near 0.05)):: 0.190447/0.177619 => accepted
19:48:23 - INFO - cbt - prefill/gen8/1: iops_avg: (or (greater) (near 0.05)):: 48.0/45.0 => accepted
19:48:23 - INFO - cbt - prefill/gen8/1: iops_stddev: (or (less) (near 0.05)):: 6.1101/9.81495 => accepted
19:48:23 - INFO - cbt - prefill/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.325163/0.350251 => accepted
19:48:23 - INFO - cbt - seq/gen8/0: bandwidth: (or (greater) (near 0.05)):: 1.24654/1.22336 => accepted
19:48:23 - INFO - cbt - seq/gen8/0: iops_avg: (or (greater) (near 0.05)):: 319.0/313.0 => accepted
19:48:23 - INFO - cbt - seq/gen8/0: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted
19:48:23 - INFO - cbt - seq/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.0497733/0.0509029 => accepted
19:48:23 - INFO - cbt - seq/gen8/1: bandwidth: (or (greater) (near 0.05)):: 1.22717/1.11372 => accepted
19:48:23 - INFO - cbt - seq/gen8/1: iops_avg: (or (greater) (near 0.05)):: 314.0/285.0 => accepted
19:48:23 - INFO - cbt - seq/gen8/1: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted
19:48:23 - INFO - cbt - seq/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.0508262/0.0557337 => accepted
19:48:23 - WARNING - cbt - 1 tests failed out of 16
Here we compile and run the same test against two branches: main
and yet-another-pr
.
We then compare the results. Along with every test case, a set of rules is defined to check for
performance regressions when comparing the sets of test results. If a possible regression is found, the rule and
corresponding test results are highlighted.
Hacking Crimson
Seastar Documents
See Seastar Tutorial . Or build a browsable version and start an HTTP server:
$ cd seastar
$ ./configure.py --mode debug
$ ninja -C build/debug docs
$ python3 -m http.server -d build/debug/doc/html
You might want to install pandoc
and other dependencies beforehand.
Debugging Crimson
Debugging with GDB
The tips for debugging Scylla also apply to Crimson.
Human-readable backtraces with addr2line
When a Seastar application crashes, it leaves us with a backtrace of addresses, like:
Segmentation fault.
Backtrace:
0x00000000108254aa
0x00000000107f74b9
0x00000000105366cc
0x000000001053682c
0x00000000105d2c2e
0x0000000010629b96
0x0000000010629c31
0x00002a02ebd8272f
0x00000000105d93ee
0x00000000103eff59
0x000000000d9c1d0a
/lib/x86_64-linux-gnu/libc.so.6+0x000000000002409a
0x000000000d833ac9
Segmentation fault
The seastar-addr2line
utility provided by Seastar can be used to map these
addresses to functions. The script expects input on stdin
,
so we need to copy and paste the above addresses, then send EOF by inputting
control-D
in the terminal. One might use echo
or cat
instead`:
$ ../src/seastar/scripts/seastar-addr2line -e bin/crimson-osd
0x00000000108254aa
0x00000000107f74b9
0x00000000105366cc
0x000000001053682c
0x00000000105d2c2e
0x0000000010629b96
0x0000000010629c31
0x00002a02ebd8272f
0x00000000105d93ee
0x00000000103eff59
0x000000000d9c1d0a
0x00000000108254aa
[Backtrace #0]
seastar::backtrace_buffer::append_backtrace() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1136
seastar::print_with_backtrace(seastar::backtrace_buffer&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1157
seastar::print_with_backtrace(char const*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1164
seastar::sigsegv_action() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5119
seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5105
seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5101
?? ??:0
seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5418
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/app-template.cc:173 (discriminator 5)
main at /home/kefu/dev/ceph/build/../src/crimson/osd/main.cc:131 (discriminator 1)
Note that seastar-addr2line
is able to extract addresses from
its input, so you can also paste the log messages as below:
2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr:Backtrace:
2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e78dbc
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e7f0
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e8b8
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e985
2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: /lib64/libpthread.so.0+0x0000000000012dbf
Unlike the classic ceph-osd
, Crimson does not print a human-readable backtrace when it
handles fatal signals like SIGSEGV or SIGABRT. It is also more complicated
with a stripped binary. So instead of planting a signal handler for
those signals into Crimson, we can use script/ceph-debug-docker.sh to map
addresses in the backtrace:
# assuming you are under the source tree of ceph
$ ./src/script/ceph-debug-docker.sh --flavor crimson master:27e237c137c330ebb82627166927b7681b20d0aa centos:8
....
[root@3deb50a8ad51 ~]# wget -q https://raw.githubusercontent.com/scylladb/seastar/master/scripts/seastar-addr2line
[root@3deb50a8ad51 ~]# dnf install -q -y file
[root@3deb50a8ad51 ~]# python3 seastar-addr2line -e /usr/bin/crimson-osd
# paste the backtrace here