Categories
Scripts Tutorials

ZFS health check in Grafana / Telegraf

I recently set up a ZFS mirror on my home server and found myself needing a way to be alerted if something went wrong. That same server runs Grafana and InfluxDB, and collects various metrics from my other machines (and itself) via Telegraf. Since I already have email alerts set up with that stack, it felt simplest to use it for this solution.

A really simple script

#!/bin/sh

# Compares the expected zpool status output with the actual status.
# Copy to a global location such as /usr/local/bin so it's accessible to Telegraf.
# Note: This can provide a false-postive if the output of the command changes, which is not guaranteed to be stable.

# Returns 0 for "false" (not healthy), returns 1 for "true" (healthy)
# Chose using integers over booleans due to how Grafana handles alerts.

OUTPUT="zfs_status,host=[HOSTNAME HERE] healthy="

if [ "$(zpool status -x)" != "all pools are healthy" ]; then
	OUTPUT=${OUTPUT}"0i"
else
	OUTPUT=${OUTPUT}"1i"
fi

echo $OUTPUT

There are similar scripts floating around on the Internet so I used those for inspiration. The only difference with mine is that it outputs the InfluxDB Line Protocol.

host= is just a convenient tag where you could put your box’s hostname (or call the hostname command and interpolate it).

Everything else should be explained by the script, including the possibility of false positives. Feel free to rename the zfs_status field to anything you wish. In my instance I use tws_zfs_status to differentiate custom fields I’ve created and possibly prevent namespace conflicts.

The Telegraf side

Telegraf has a super handy exec input where you can run arbitrary commands, so that’s what we use:

[[inputs.exec]]
  commands = ["sh /usr/local/bin/zfs_check.sh"]
  timeout = "5s"
  data_format = "influx"

When Telegraf collects data from its inputs it will write a one or a zero for its zfs_status field.

And finally, Grafana

The basic setup in Grafana is:

  • Stat type
  • Grab the last value
  • Map 1 to “Healthy” / green, 0 to “Unhealthy” / red
  • Set up an alert for when the value is less than 1

That’s basically it! It’s a really good idea to test it by temporarily tweaking the script to output a 0 and waiting for an email to arrive. 🙂