Imagine you have one more special thing to monitor. While our Icinga 2 can observe infrastructure of almost any size, it still needs a plugin for each kind of check. Unfortunately not every command meets the monitoring plugin API: exit code 0-3 (ok, warning, critical, unknown), performance data, etc. E.g. often programs exit with 1 in case of a fatal error, which is considered just a warning by Icinga. So you have to look for plugins for your specific use case, compare their features, guess the authors’ trustworthiness and hope for long-term maintenance. Ideally, you can read and understand the source code, which allows you to review a plugin’s security.
Or, if your OS already provides a utility which perfectly detects faults, you could write a Bash/Python script to parse its output and exit code. For each specific check! But if you aren’t a (good) programmer, my check_rungrep plugin is likely the better option. Its superpower is running commands and interpreting whatever they return as you wish – one plugin for all commands. Consider e.g. S.M.A.R.T. – the very first thing I personally start monitoring on a new machine. smartctl -H /dev/… clearly says whether a disk has failed or not:
# smartctl -H /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_27R3ETNAS smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.75] (local build) Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
The final line makes it very easy for check_rungrep to rate the success of smartctl:
# check_rungrep \ stdout literal 'SMART overall-health self-assessment test result: PASSED' \ '' 1: '' \ command smartctl -H /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_27R3ETNAS ✅ Command's stdout matched the following pattern 1 times. Critical: 1:~. Literal string: SMART overall-health self-assessment test result: PASSED
The command parameter tells check_rungrep to run smartctl -H on the given device and stdout literal instructs the plugin to count the occurrences of the given string in smartctl’s standard output. If that’s outside the range “1:” (1 – infinity), the check result will be critical. In general, those two empty strings with “1:” in the middle stand for WARN CRIT LABEL. In this example, WARN and LABEL are empty and hence no-op, so there’s no warning threshold and the number of string occurrences won’t be included in the performance data.
Just to be sure, check_rungrep can also monitor standard error, exit code and the execution time:
# check_rungrep \ stdout literal 'SMART overall-health self-assessment test result: PASSED' '' 1: '' \ stderr regex . 0 '' '' \ exit 0 '' '' \ time 1 10 time \ command smartctl -H /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_27R3ETNAS ✅ Command's stdout matched the following pattern 1 times. Critical: 1:~. Literal string: SMART overall-health self-assessment test result: PASSED ✅ Command's stderr matched the following pattern 0 times. Warning: 0:0. Regular expression: . ✅ Command returned 0. Warning: 0:0. ✅ Command ran for 0.6975857 seconds (697ms 585us 700ns). Warning: 0:1. Critical: 0:10. | 'time'=0.6975857s;0:1;0:10;0;
In addition to the mandatory standard output check, this case warns if there’s anything on stderr or an exit code <> 0. Also, the execution time has warning and critical thresholds (in seconds) and is reported as performance data “time”.
However, your Icinga (hopefully) doesn’t run as root and has no permission to inspect any disks. To authorize as less privilege escalation as possible, run smartctl via doas(1) or sudo(8):
$ check_rungrep \ stdout literal 'SMART overall-health self-assessment test result: PASSED' '' 1: '' \ command doas smartctl -H /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_27R3ETNAS
Apropos Icinga – the above example can be integrated like this:
object CheckCommand "smartctl" { command = [ PluginDir + "/check_rungrep", "stdout", "literal", "SMART overall-health self-assessment test result: PASSED", "", "1:", "", "command", "doas", "smartctl", "-H", "$smartctl_disk$" ] } apply Service "smartctl:" for (name => path in host.vars.smartctl_disks) { check_command = "smartctl" command_endpoint = host.name vars.smartctl_disk = path }
Anyway, the more you monitor, the more potential issues you catch before they escalate. If you need any help with integrating any kind of new checks, don’t hesitate to ask us for consultation!