DISKPREDICTION PLUGIN

The diskprediction plugin supports two modes: cloud mode and local mode. In cloud mode, the disk and Ceph operating status information is collected from Ceph cluster and sent to a cloud-based DiskPrediction server over the Internet. DiskPrediction server analyzes the data and provides the analytics and prediction results of performance and disk health states for Ceph clusters.

Local mode doesn’t require any external server for data analysis and output results. In local mode, the diskprediction plugin uses an internal predictor module for disk prediction service, and then returns the disk prediction result to the Ceph system.

Enabling

Run the following command to enable the diskprediction module in the Ceph environment:

ceph mgr module enable diskprediction_cloud
ceph mgr module enable diskprediction_local

Select the prediction mode:

ceph config set global device_failure_prediction_mode local

or:

ceph config set global device_failure_prediction_mode cloud

To disable prediction,:

ceph config set global device_failure_prediction_mode none

Connection settings

The connection settings are used for connection between Ceph and DiskPrediction server.

Local Mode

The diskprediction plugin leverages Ceph device health check to collect disk health metrics and uses internal predictor module to produce the disk failure prediction and returns back to Ceph. Thus, no connection settings are required in local mode. The local predictor module requires at least six datasets of device health metrics to implement the prediction.

Run the following command to use local predictor predict device life expectancy.

ceph device predict-life-expectancy <device id>

Cloud Mode

The user registration is required in cloud mode. The users have to sign up their accounts at https://www.diskprophet.com/#/ to receive the following DiskPrediction server information for connection settings.

Certificate file path: After user registration is confirmed, the system will send a confirmation email including a certificate file download link. Download the certificate file and save it to the Ceph system. Run the following command to verify the file. Without certificate file verification, the connection settings cannot be completed.

DiskPrediction server: The DiskPrediction server name. It could be an IP address if required.

Connection account: An account name used to set up the connection between Ceph and DiskPrediction server

Connection password: The password used to set up the connection between Ceph and DiskPrediction server

Run the following command to complete connection setup.

ceph device set-cloud-prediction-config <diskprediction_server> <connection_account> <connection_password> <certificate file path>

You can use the following command to display the connection settings:

ceph device show-prediction-config

Additional optional configuration settings are the following:

diskprediction_upload_metrics_interval:
 Indicate the frequency to send Ceph performance metrics to DiskPrediction server regularly at times. Default is 10 minutes.
diskprediction_upload_smart_interval:
 Indicate the frequency to send Ceph physical device info to DiskPrediction server regularly at times. Default is 12 hours.
diskprediction_retrieve_prediction_interval:
 Indicate Ceph that retrieves physical device prediction data from DiskPrediction server regularly at times. Default is 12 hours.

Diskprediction Data

The diskprediction plugin actively sends/retrieves the following data to/from DiskPrediction server.

Metrics Data

  • Ceph cluster status
key Description
cluster_health Ceph health check status
num_mon Number of monitor node
num_mon_quorum Number of monitors in quorum
num_osd Total number of OSD
num_osd_up Number of OSDs that are up
num_osd_in Number of OSDs that are in cluster
osd_epoch Current epoch of OSD map
osd_bytes Total capacity of cluster in bytes
osd_bytes_used Number of used bytes on cluster
osd_bytes_avail Number of available bytes on cluster
num_pool Number of pools
num_pg Total number of placement groups
num_pg_active_clean Number of placement groups in active+clean state
num_pg_active Number of placement groups in active state
num_pg_peering Number of placement groups in peering state
num_object Total number of objects on cluster
num_object_degraded Number of degraded (missing replicas) objects
num_object_misplaced Number of misplaced (wrong location in the cluster) objects
num_object_unfound Number of unfound objects
num_bytes Total number of bytes of all objects
num_mds_up Number of MDSs that are up
num_mds_in Number of MDS that are in cluster
num_mds_failed Number of failed MDS
mds_epoch Current epoch of MDS map
  • Ceph mon/osd performance counts

Mon:

key Description
num_sessions Current number of opened monitor sessions
session_add Number of created monitor sessions
session_rm Number of remove_session calls in monitor
session_trim Number of trimed monitor sessions
num_elections Number of elections monitor took part in
election_call Number of elections started by monitor
election_win Number of elections won by monitor
election_lose Number of elections lost by monitor

Osd:

key Description
op_wip Replication operations currently being processed (primary)
op_in_bytes Client operations total write size
op_r Client read operations
op_out_bytes Client operations total read size
op_w Client write operations
op_latency Latency of client operations (including queue time)
op_process_latency Latency of client operations (excluding queue time)
op_r_latency Latency of read operation (including queue time)
op_r_process_latency Latency of read operation (excluding queue time)
op_w_in_bytes Client data written
op_w_latency Latency of write operation (including queue time)
op_w_process_latency Latency of write operation (excluding queue time)
op_rw Client read-modify-write operations
op_rw_in_bytes Client read-modify-write operations write in
op_rw_out_bytes Client read-modify-write operations read out
op_rw_latency Latency of read-modify-write operation (including queue time)
op_rw_process_latency Latency of read-modify-write operation (excluding queue time)
  • Ceph pool statistics
key Description
bytes_used Per pool bytes used
max_avail Max available number of bytes in the pool
objects Number of objects in the pool
wr_bytes Number of bytes written in the pool
dirty Number of bytes dirty in the pool
rd_bytes Number of bytes read in the pool
stored_raw Bytes used in pool including copies made
  • Ceph physical device metadata
key Description
disk_domain_id Physical device identify id
disk_name Device attachement name
disk_wwn Device wwn
model Device model name
serial_number Device serial number
size Device size
vendor Device vendor name
  • Ceph each objects correlation information
  • The plugin agent information
  • The plugin agent cluster information
  • The plugin agent host information

SMART Data

  • Ceph physical device SMART data (provided by Ceph devicehealth plugin)

Prediction Data

  • Ceph physical device prediction data

Receiving predicted health status from a Ceph OSD disk drive

You can receive predicted health status from Ceph OSD disk drive by using the following command.

ceph device get-predicted-status <device id>

The get-predicted-status command returns:

{
    "near_failure": "Good",
    "disk_wwn": "5000011111111111",
    "serial_number": "111111111",
    "predicted": "2018-05-30 18:33:12",
    "attachment": "sdb"
}
Attribute Description
near_failure The disk failure prediction state: Good/Warning/Bad/Unknown
disk_wwn Disk WWN number
serial_number Disk serial number
predicted Predicted date
attachment device name on the local system

The near_failure attribute for disk failure prediction state indicates disk life expectancy in the following table.

near_failure Life expectancy (weeks)
Good > 6 weeks
Warning 2 weeks ~ 6 weeks
Bad < 2 weeks

Debugging

If you want to debug the DiskPrediction module mapping to Ceph logging level, use the following command.

[mgr]

    debug mgr = 20

With logging set to debug for the manager the plugin will print out logging message with prefix mgr[diskprediction] for easy filtering.