ceph-mgr orchestrator modules¶
Warning
This is developer documentation, describing Ceph internals that are only relevant to people writing ceph-mgr orchestrator modules.
In this context, orchestrator refers to some external service that provides the ability to discover devices and create Ceph services. This includes external projects such as ceph-ansible, DeepSea, and Rook.
An orchestrator module is a ceph-mgr module (ceph-mgr module developer’s guide) which implements common management operations using a particular orchestrator.
Orchestrator modules subclass the Orchestrator
class: this class is
an interface, it only provides method definitions to be implemented
by subclasses. The purpose of defining this common interface
for different orchestrators is to enable common UI code, such as
the dashboard, to work with various different backends.
Behind all the abstraction, the purpose of orchestrator modules is simple: enable Ceph to do things like discover available hardware, create and destroy OSDs, and run MDS and RGW services.
A tutorial is not included here: for full and concrete examples, see the existing implemented orchestrator modules in the Ceph source tree.
Glossary¶
- Stateful service
a daemon that uses local storage, such as OSD or mon.
- Stateless service
a daemon that doesn’t use any local storage, such as an MDS, RGW, nfs-ganesha, iSCSI gateway.
- Label
arbitrary string tags that may be applied by administrators to nodes. Typically administrators use labels to indicate which nodes should run which kinds of service. Labels are advisory (from human input) and do not guarantee that nodes have particular physical capabilities.
- Drive group
collection of block devices with common/shared OSD formatting (typically one or more SSDs acting as journals/dbs for a group of HDDs).
- Placement
choice of which node is used to run a service.
Key Concepts¶
The underlying orchestrator remains the source of truth for information about whether a service is running, what is running where, which nodes are available, etc. Orchestrator modules should avoid taking any internal copies of this information, and read it directly from the orchestrator backend as much as possible.
Bootstrapping nodes and adding them to the underlying orchestration system is outside the scope of Ceph’s orchestrator interface. Ceph can only work on nodes when the orchestrator is already aware of them.
Calls to orchestrator modules are all asynchronous, and return completion objects (see below) rather than returning values immediately.
Where possible, placement of stateless services should be left up to the orchestrator.
Completions and batching¶
All methods that read or modify the state of the system can potentially be long running. To handle that, all such methods return a completion object (a ReadCompletion or a WriteCompletion). Orchestrator modules must implement the wait method: this takes a list of completions, and is responsible for checking if they’re finished, and advancing the underlying operations as needed.
Each orchestrator module implements its own underlying mechanisms for completions. This might involve running the underlying operations in threads, or batching the operations up before later executing in one go in the background. If implementing such a batching pattern, the module would do no work on any operation until it appeared in a list of completions passed into wait.
WriteCompletion objects have a two-stage execution. First they become persistent, meaning that the write has made it to the orchestrator itself, and been persisted there (e.g. a manifest file has been updated). If ceph-mgr crashed at this point, the operation would still eventually take effect. Second, the completion becomes effective, meaning that the operation has really happened (e.g. a service has actually been started).
-
Orchestrator.
wait
(completions)¶ Given a list of Completion instances, progress any which are incomplete. Return a true if everything is done.
Callers should inspect the detail of each completion to identify partial completion/progress information, and present that information to the user.
For fast operations (e.g. reading from a database), implementations may choose to do blocking IO in this call.
- Return type
bool
-
class
orchestrator.
_Completion
¶ -
property
exception
¶ Holds an exception object.
-
property
is_errored
¶ Has the completion failed. Default implementation looks for self.exception. Can be overwritten.
-
property
result
¶ Return the result of the operation that we were waited for. Only valid after calling Orchestrator.wait() on this completion.
-
property
-
class
orchestrator.
ReadCompletion
¶ Orchestrator
implementations should inherit from this class to implement their own handles to operations in progress, and return an instance of their subclass from calls into methods.-
property
should_wait
¶ Could the external operation be deemed as complete, or should we wait? We must wait for a read operation only if it is not complete.
-
property
-
class
orchestrator.
WriteCompletion
¶ Orchestrator
implementations should inherit from this class to implement their own handles to operations in progress, and return an instance of their subclass from calls into methods.-
property
is_effective
¶ Has the operation taken effect on the cluster? For example, if we were adding a service, has it come up and appeared in Ceph’s cluster maps?
-
property
is_persistent
¶ Has the operation updated the orchestrator’s configuration persistently? Typically this would indicate that an update had been written to a manifest, but that the update had not necessarily been pushed out to the cluster.
-
progress
= None¶ if a orchestrator module can provide a more detailed progress information, it needs to also call
progress.update()
.
-
property
should_wait
¶ Could the external operation be deemed as complete, or should we wait? We must wait for a write operation only if we know it is not persistent yet.
-
property
Placement¶
In general, stateless services do not require any specific placement rules, as they can run anywhere that sufficient system resources are available. However, some orchestrators may not include the functionality to choose a location in this way, so we can optionally specify a location when creating a stateless service.
OSD services generally require a specific placement choice, as this will determine which storage devices are used.
Error Handling¶
The main goal of error handling within orchestrator modules is to provide debug information to assist users when dealing with deployment errors.
-
class
orchestrator.
OrchestratorError
¶ General orchestrator specific error.
Used for deployment, configuration or user errors.
It’s not intended for programming errors or orchestrator internal errors.
-
class
orchestrator.
NoOrchestrator
(msg='No orchestrator configured (try `ceph orchestrator set backend`)')¶ No orchestrator in configured.
-
class
orchestrator.
OrchestratorValidationError
¶ Raised when an orchestrator doesn’t support a specific feature.
In detail, orchestrators need to explicitly deal with different kinds of errors:
No orchestrator configured
See
NoOrchestrator
.An orchestrator doesn’t implement a specific method.
For example, an Orchestrator doesn’t support
add_host
.In this case, a
NotImplementedError
is raised.Missing features within implemented methods.
E.g. optional parameters to a command that are not supported by the backend (e.g. the hosts field in
Orchestrator.update_mons()
command with the rook backend).Input validation errors
The
orchestrator_cli
module and other calling modules are supposed to provide meaningful error messages.Errors when actually executing commands
The resulting Completion should contain an error string that assists in understanding the problem. In addition,
_Completion.is_errored()
is set toTrue
Invalid configuration in the orchestrator modules
This can be tackled similar to 5.
All other errors are unexpected orchestrator issues and thus should raise an exception that are then
logged into the mgr log file. If there is a completion object at that point,
_Completion.result()
may contain an error message.
Excluded functionality¶
Ceph’s orchestrator interface is not a general purpose framework for managing linux servers – it is deliberately constrained to manage the Ceph cluster’s services only.
Multipathed storage is not handled (multipathing is unnecessary for Ceph clusters). Each drive is assumed to be visible only on a single node.
Host management¶
-
Orchestrator.
add_host
(host)¶ Add a host to the orchestrator inventory.
- Parameters
host – hostname
-
Orchestrator.
remove_host
(host)¶ Remove a host from the orchestrator inventory.
- Parameters
host – hostname
-
Orchestrator.
get_hosts
()¶ Report the hosts in the cluster.
The default implementation is extra slow.
- Returns
list of InventoryNodes
Inventory and status¶
-
Orchestrator.
get_inventory
(node_filter=None, refresh=False)¶ Returns something that was created by ceph-volume inventory.
- Returns
list of InventoryNode
-
class
orchestrator.
InventoryFilter
(labels=None, nodes=None)¶ When fetching inventory, use this filter to avoid unnecessarily scanning the whole estate.
- Typical use: filter by node when presenting UI workflow for configuring
a particular server. filter by label when not all of estate is Ceph servers, and we want to only learn about the Ceph servers. filter by label when we are interested particularly in e.g. OSD servers.
-
class
orchestrator.
InventoryNode
(name, devices)¶ When fetching inventory, all Devices are groups inside of an InventoryNode.
-
class
orchestrator.
InventoryDevice
(blank=False, type=None, id=None, size=None, rotates=False, available=False, dev_id=None, extended=None, metadata_space_free=None)¶ When fetching inventory, block devices are reported in this format.
Note on device identifiers: the format of this is up to the orchestrator, but the same identifier must also work when passed into StatefulServiceSpec. The identifier should be something meaningful like a device WWID or stable device node path – not something made up by the orchestrator.
“Extended” is for reporting any special configuration that may have already been done out of band on the block device. For example, if the device has already been configured for encryption, report that here so that it can be indicated to the user. The set of extended properties may differ between orchestrators. An orchestrator is permitted to support no extended properties (only normal block devices)
-
available
= None¶ can be used to create a new OSD?
-
dev_id
= None¶ vendor/model
-
extended
= None¶ arbitrary JSON-serializable object
-
id
= None¶ unique within a node (or globally if you like).
-
pretty_print
(only_header=False)¶ Print a human friendly line with the information of the device
- Parameters
only_header – Print only the name of the device attributes
Ex:
Device Path Type Size Rotates Available Model /dev/sdc hdd 50.00 GB True True ATA/QEMU
-
rotates
= None¶ indicates if it is a spinning disk
-
size
= None¶ byte integer.
-
type
= None¶ ‘ssd’, ‘hdd’, ‘nvme’
-
-
Orchestrator.
describe_service
(service_type=None, service_id=None, node_name=None, refresh=False)¶ Describe a service (of any kind) that is already configured in the orchestrator. For example, when viewing an OSD in the dashboard we might like to also display information about the orchestrator’s view of the service (like the kubernetes pod ID).
When viewing a CephFS filesystem in the dashboard, we would use this to display the pods being currently run for MDS daemons.
- Returns
list of ServiceDescription objects.
-
class
orchestrator.
ServiceDescription
(nodename=None, container_id=None, service=None, service_instance=None, service_type=None, version=None, rados_config_location=None, service_url=None, status=None, status_desc=None)¶ For responding to queries about the status of a particular service, stateful or stateless.
This is not about health or performance monitoring of services: it’s about letting the orchestrator tell Ceph whether and where a service is scheduled in the cluster. When an orchestrator tells Ceph “it’s running on node123”, that’s not a promise that the process is literally up this second, it’s a description of where the orchestrator has decided the service should run.
Service Actions¶
-
Orchestrator.
service_action
(action, service_type, service_name=None, service_id=None)¶ Perform an action (start/stop/reload) on a service.
Either service_name or service_id must be specified:
If using service_name, perform the action on that entire logical service (i.e. all daemons providing that named service).
If using service_id, perform the action on a single specific daemon instance.
- Parameters
action – one of “start”, “stop”, “reload”
service_type – e.g. “mds”, “rgw”, …
service_name – name of logical service (“cephfs”, “us-east”, …)
service_id – service daemon instance (usually a short hostname)
- Return type
OSD management¶
-
Orchestrator.
create_osds
(drive_group, all_hosts)¶ Create one or more OSDs within a single Drive Group.
The principal argument here is the drive_group member of OsdSpec: other fields are advisory/extensible for any finer-grained OSD feature enablement (choice of backing store, compression/encryption, etc).
- Parameters
drive_group – DriveGroupSpec
all_hosts – TODO, this is required because the orchestrator methods are not composable Probably this parameter can be easily removed because each orchestrator can use the “get_inventory” method and the “drive_group.host_pattern” attribute to obtain the list of hosts where to apply the operation
-
Orchestrator.
replace_osds
(drive_group)¶ Like create_osds, but the osd_id_claims must be fully populated.
-
Orchestrator.
remove_osds
(osd_ids)¶ - Parameters
osd_ids – list of OSD IDs
Note that this can only remove OSDs that were successfully created (i.e. got an OSD ID).
-
class
orchestrator.
DeviceSelection
(paths=None, id_model=None, size=None, rotates=None, count=None)¶ Used within
myclass.DriveGroupSpec
to specify the devices used by the Drive Group.Any attributes (even none) can be included in the device specification structure.
-
count
= None¶ if this is present limit the number of drives to this number.
-
id_model
= None¶ A wildcard string. e.g: “SDD*”
-
paths
= None¶ List of absolute paths to the devices.
-
rotates
= None¶ is the drive rotating or not
-
size
= None¶ Size specification of format LOW:HIGH. Can also take the the form :HIGH, LOW: or an exact value (as ceph-volume inventory reports)
-
-
class
orchestrator.
DriveGroupSpec
(host_pattern, data_devices=None, db_devices=None, wal_devices=None, journal_devices=None, data_directories=None, osds_per_device=None, objectstore='bluestore', encrypted=False, db_slots=None, wal_slots=None)¶ Describe a drive group in the same form that ceph-volume understands.
-
data_devices
= None¶
-
data_directories
= None¶ A list of strings, containing paths which should back OSDs
-
db_devices
= None¶
-
db_slots
= None¶ How many OSDs per DB device
-
encrypted
= None¶ true
orfalse
-
host_pattern
= None¶ An fnmatch pattern to select hosts. Can also be a single host.
-
journal_devices
= None¶
-
objectstore
= None¶ filestore
orbluestore
-
osd_id_claims
= None¶ Optional: mapping of drive to OSD ID, used when the created OSDs are meant to replace previous OSDs on the same node.
-
osds_per_device
= None¶ Number of osd daemons per “DATA” device. To fully utilize nvme devices multiple osds are required.
-
wal_devices
= None¶
-
wal_slots
= None¶ How many OSDs per WAL device
-
Stateless Services¶
-
Orchestrator.
add_stateless_service
(service_type, spec)¶ Installing and adding a completely new service to the cluster.
This is not about starting services.
-
Orchestrator.
update_stateless_service
(service_type, spec)¶ This is about changing / redeploying existing services. Like for example changing the number of service instances.
- Return type
-
Orchestrator.
remove_stateless_service
(service_type, id_)¶ Uninstalls an existing service from the cluster.
This is not about stopping services.
Upgrades¶
-
Orchestrator.
upgrade_available
()¶ Report on what versions are available to upgrade to
- Returns
List of strings
-
Orchestrator.
upgrade_start
(upgrade_spec)¶
-
Orchestrator.
upgrade_status
()¶ If an upgrade is currently underway, report on where we are in the process, or if some error has occurred.
- Returns
UpgradeStatusSpec instance
-
class
orchestrator.
UpgradeSpec
¶
-
class
orchestrator.
UpgradeStatusSpec
¶
Utility¶
-
Orchestrator.
available
()¶ Report whether we can talk to the orchestrator. This is the place to give the user a meaningful message if the orchestrator isn’t running or can’t be contacted.
This method may be called frequently (e.g. every page load to conditionally display a warning banner), so make sure it’s not too expensive. It’s okay to give a slightly stale status (e.g. based on a periodic background ping of the orchestrator) if that’s necessary to make this method fast.
- ..note:: True doesn’t mean that the desired functionality
is actually available in the orchestrator. I.e. this won’t work as expected:
>>> if OrchestratorClientMixin().available()[0]: # wrong. ... OrchestratorClientMixin().get_hosts()
- Returns
two-tuple of boolean, string
Client Modules¶
-
class
orchestrator.
OrchestratorClientMixin
¶ A module that inherents from OrchestratorClientMixin can directly call all
Orchestrator
methods without manually calling remote.Every interface method from
Orchestrator
is converted into a stub method that internally callsOrchestratorClientMixin._oremote()
>>> class MyModule(OrchestratorClientMixin): ... def func(self): ... completion = self.add_host('somehost') # calls `_oremote()` ... self._orchestrator_wait([completion]) ... self.log.debug(completion.result)
-
set_mgr
(mgr)¶ Useable in the Dashbord that uses a global
mgr
-