When writing a custom check plugin for Icinga 2, there are situations where in addition to observing the current state of a system, taking the past into account as well can be helpful. A common case for this is when the data source provides counter values, i.e. values that increase over time and you are less interested in the current value but more in how it changes. An example for this are the network interface counters on Linux: if you want to know the data rate on an interface, you need to read a byte counter at two different times and compute the rate from that.
A related and very simple counter will serve as an example for this blog post: Linux provides a virtual file carrier_changes
for each network interface where it counts how often the link state has changed between up and down. With this information, one can write a check that returns a critical state when this value increases as this could be a sign of an unstable connection. If we assume that we can pass the previous value as an argument to the check command, the following Bash script could be used for this purpose. In lines 4 to 14, it simply reads the command line arguments -i
and -p
into the shell variables INTERFACE
and PREVIOUS_CARRIER_CHANGES
respectively. Line 16 reads the current counter value and the rest of the script compares both values and generates a corresponding message and exit code.
#!/usr/bin/env bash set -eu while getopts hi:p: OPT; do case "$OPT" in i) INTERFACE=$OPTARG ;; p) PREVIOUS_CARRIER_CHANGES=$OPTARG ;; h|*) echo "Usage: $0 -i <interface> -p <previous carrier_changes value>" >&2 exit 3 ;; esac done CARRIER_CHANGES=$(<"/sys/class/net/$INTERFACE/carrier_changes") || exit 3 if [ "$CARRIER_CHANGES" -gt "$PREVIOUS_CARRIER_CHANGES" ]; then msg="CRITICAL: $((CARRIER_CHANGES - PREVIOUS_CARRIER_CHANGES)) interface carrier changes on $INTERFACE since last check" result=2 else msg="OK: no interface carrier changes on $INTERFACE since last check" result=0 fi echo "$msg | carrier_changes=${CARRIER_CHANGES}c" exit "$result"
So when the current carrier_changes
value is 4 and the script is called with a previous value parameter of -p 1
for example, it will report an error. If it’s called with a matching -p 4
instead, it will report OK:
root@my-host:~# cat /sys/class/net/eth0/carrier_changes 4 root@my-host:~# ./check_linux_carrier_changes.sh -i eth0 -p 1 CRITICAL: 3 interface carrier changes on eth0 since last check | carrier_changes=4c root@my-host:~# ./check_linux_carrier_changes.sh -i eth0 -p 4 OK: no interface carrier changes on eth0 since last check | carrier_changes=4c
The only remaining question is how the -p
argument is set accordingly for the check command. As you may have noticed, the check script also returns the raw counter value as a performance data value. This can be combined with the Icinga 2 feature that allows to dynamically generate the check command arguments. The following CheckCommand
definition makes use of this by defining a lambda function extracts the corresponding performance data value from the last check result:
object CheckCommand "linux-carrier-changes" { command = ["/path/to/check_linux_carrier_changes.sh"] arguments = { "-i" = "$linux_carrier_changes_interface$" "-p" = "$linux_carrier_changes_previous$" } vars.linux_carrier_changes_previous = {{ var last = macro("$last_check_result$") if (last && last.performance_data) { for (var p in last.performance_data.map(parse_performance_data)) { if (p.label == "carrier_changes") { return p.value } } } return 0 }} }
This can then be used without much extra work in a Service
object. The only somewhat unusual setting is volatile = true
. This is added due to the fact that this specific check script only reports critical for one execution and then automatically resets to OK.
object Service "carrier-changes-eth0" { host_name = "my-host" check_command = "linux-carrier-changes" volatile = true max_check_attempts = 1 vars.linux_carrier_changes_interface = "eth0" }
And there we have it: a check that reports an error when a network link reconnects:
This can not only be used for accessing performance data, there is other information from the previous check that can be accessed using macro strings as well. There are also some macro shortcuts like $last_state$
for the previous state, $last_check$
for when the last check was executed, $output$
and $perfdata$
for the full output and performance data of the last check execution.