What to Monitor in a Ceph Cluster


Once you have a running Ceph cluster, it must be kept running bymonitoring and troubleshooting issues, and profiling its CPU and memory usage. In this article I’ll describe some useful standard operating system tools and Ceph’s built-in functions that can be used to diagnose issues and handle common errors.

Ceph Health

Getting started with Ceph monitoring and troubleshooting starts with the ‘ceph health’ command. You can execute this command on any node of a Ceph cluster, or even on a standalone machine that is connected to a Ceph cluster. In the latter case, you need a valid Ceph configuration file (usually in /etc/ceph.ceph.conf) and of course, a Ceph client installed on this machine. A common use case:

$ ceph health

The ‘HEALTH_OK’ status obviously means that your Ceph cluster is up and running, and that there are no observable issues that were detected by Ceph’s built-in self-diagnosis. Other possible statuses are ‘HEALTH_WARN’ and ‘HEALTH_ERR’.  The ‘ceph health detail’  command shows more detailed information about your cluster’s health.

Monitoring OSDs

The ‘ceph’ tool provides additional monitoring commands that include:

  • ‘ceph osd stat’ (shows the Ceph object-based storage devices – OSDs statuses)
  • ‘ceph mon stat’ (shows the Ceph monitors’ statuses)
  • ‘ceph quorum_status’ (shows the quorum status)

An OSD typically represents a single disk. Although Ceph maintains redundant copies of objects across OSDs to provide data resiliency, it is still important to detect situations when one or more OSDs are not functional and discover why this occurred. The ‘ceph osd stat` command shows the status of OSDs in the cluster. Each OSD can have either of these statuses:

  • “up” (the OSD daemon is running and responsive)
  • “down” (the OSD daemon is either stopped or not responsive).

The OSD sends heartbeats to other OSDs and to the Ceph Monitors. If the OSD is marked as “down”, then it simply means that other OSDs or the Ceph Monitors have not received answers to their heartbeats from that specific OSD.

Therefore, the first step you should do for such an OSD is to understand why it is marked as “down” (node is down, OSD daemon is not running and so on). At the same time, each OSD can have one of these statuses:

  • “in” (OSD participates in data placement)
  • “out” (OSD does not participate in data placement)

The OSD can be “down” but still “in”.

There is a configurable delay between the time that an OSD is marked as “down” before it is marked as “out”. This delay is required to avoid unnecessary rebalancing while the OSD is experiencing a short time failure. The delay is 5 minutes by default, but you can change it in the configuration file (‘mon osd down out interval’).

You can also disable/enable an “auto out” function cluster-wide, using the commands ‘ceph osd set noout’ and ‘ceph osd unset noout’. If you know that an OSD which is marked as “down” will never be functional again, for example, due to unrecoverable disk error, you can mark it as “out” by executing the command ‘ceph osd out OSD_ID’. Ceph will immediately start a recovery operation.

Monitoring Capacity

It is very important to monitor your cluster for available storage capacity. When your cluster gets close to its maximum capacity (the ‘mon osd full ratio’ parameter, the default value is 95%), Ceph stops accepting write requests. The ‘mon osd nearfull ratio’ parameter (the default value is 85%) allows setting a threshold to report a corresponding warning.

To check data usage and data distribution among pools, you can use the ‘ceph df’ command. The ‘GLOBAL’ section of the output contains the overall storage capacity of the cluster, the amount of free space available in the cluster, the amount of raw storage used (including replicas, snapshots, and clones), and the percentage of raw storage used. The ‘POOLS’ section contains a list of pools and the storage usage of each pool. Note that the storage usage for pools does not contain storage used by replicas, snapshots, or clones:

$ ceph df
    SIZE       AVAIL      RAW USED     %RAW USED 
    82799M     78594M        4204M          5.08 

    NAME             ID     USED       %USED     MAX AVAIL     OBJECTS 
    data             0         0           0        39294M           0

The ‘ceph osd df’ command allows you to see disk utilizations per OSD:

$ ceph osd df
0  0.90999  1.00000 931G  545G  385G 58.64 0.98
10 0.90999  1.00000 931G  725G  205G 77.95 1.31
11 0.90999  1.00000 931G  432G  498G 46.47 0.78

If some of OSDs are near full and others are not, then you may have a problem with the CRUSH weight (Controlled, Scalable, Decentralized Placement of Replicated Data) for some OSDs. Uneven data distribution across OSDs is one of the most common Ceph problems.

Standard system tools

Standard GNU/Linux system tools can be used to monitor Ceph nodes or troubleshoot  a specific node. First of all, check the node overall health. For example, check the ‘dmesg’ output for any suspicious messages. For long term operations it is a good idea to use a monitoring tool, such as Zabbix, to monitor the overall health of the Ceph nodes. Because Ceph is sensitive to clock drift, make sure that ntp daemon is running on each Ceph node and there is no issues with time synchronization.

Use standard tools, such as hdparm or smartmontools to check disk health. The smartmontools package contains a daemon, smartd, which can be configured to send warnings of disk degradation and failures.

Other metrics that you may want to monitor for each node in the cluster include:

  • CPU and memory utilization (using Zabbix or atop)
  • Number of dropped RX/TX packets on node’s NICs (using ifconfig; a big number of dropped packets means that you need to increase the corresponding queue size)

Ceph admin socket

Each Ceph daemon, such as Ceph OSD, Ceph Monitor, or Ceph Metadata Server reads its configuration from a corresponding section in the Ceph configuration file (/etc/ceph.ceph.conf). Sometimes you may need to see the actual configuration for the specific daemon or even to change its configuration. The Ceph admin socket allows you to show and set your configuration at runtime. To get a daemon configuration at runtime via the admin socket, login to the node running the daemon and execute the following command:

$ ceph daemon osd.0 config show

Note that you also can use the admin socket and the ‘ceph daemon’ command to change configuration at runtime directly, therefore you do not need running Ceph Monitors for that. For example, to set log level to ‘0/5’ (read more about setting Ceph log levels) for the specified Ceph OSD, execute the following command on the node where the specified Ceph OSD is running:

$ sudo ceph daemon osd.0 config set debug_osd 0/5

Note that you can use the same admin socket for Ceph OSD to view:

  • performance counters: ‘perf dump’ command
  • view recent slow operations: ‘dump_historic_ops’
  • view current operations: ‘dump_ops_in_flight’

Ceph logging

The default location of the Ceph log files is /var/log/ceph, and as expected the logs contain a historical record of events. The ‘ceph -w’ displays the current status of the cluster and major events in real time. Running this command can be very useful to troubleshoot a specific operation, or if you have issues when starting your cluster.

You can enable Ceph debug logging in a Ceph configuration file, or temporarily enable it for a specific component in runtime. Note that logging is resource intensive, so enabling debug logging permanently for a production cluster is not a good idea – it will slow down your system and can generate a significant amount of data (gigabytes per hour). Instead, if you are encountering a problem with a specific component of your cluster, enable logging for that component temporarily. You can use the admin socket (see the section above ‘Ceph configuration’), but you need to log in to the corresponding node to do this. To set a log level for a specific component from any Ceph node, you can use the ‘ceph injectargs’ command (you will need a running Ceph Monitor):

$ ceph tell osd.0 injectargs --debug-osd 0/5

Ceph scrubbing

Each Ceph OSD is responsible for checking its data integrity via a periodic operation called scrubbing. Light scrubbingusually runs daily and checks the object size and attributes. Deep scrubbing usually runs weekly and reads the data and recalculates and verifies checksums to ensure data integrity. Scrubbing is needed for data integrity, but it can also reduce performance of the Ceph cluster, so it is important to know when light and deep scrubbing run and adjust the corresponding parameters. In some cases, you may want to run scrubbing manually to ensure data integrity. The following commands schedule light and deep scrubbing correspondingly:

$ ceph pg scrub PG_ID
$ ceph pg deep-scrub PG_ID

In some cases, for example, you add new OSDs to the cluster or mark some OSDs as out for maintenance, you may want to temporarily disable scrubbing to make recovery or backfill (moving data to a new OSD) operations faster. To disable scrubbing cluster-wide and then enable it again, use the following commands:

$ ceph osd set noscrub
$ ceph osd unset noscrub

Limit the impact of backfill and recovery operations

If you have noticed that the overall performance of your Ceph cluster is low, or you are planning maintenance of one or more OSDs, then you may want to minimize the impact of backfill and recovery operations and preserve the performance of the cluster. There are two parameters you may want to temporarily change:

  • ‘osd max backfills’: (the number of concurrent backfills per OSD, 10 by default)
  • ‘osd recovery max active’ (the number of concurrent recovery operations per OSD, 15 by default).
Print Friendly