First of all, if you own a domain, the following text is for you. In production you obviously want to reduce outages. And an outage of a DNS domain as such takes down all services under that domain, no matter whether your LAMP components are all up and running. At least from users’ perspective.
As usually, roughly speaking, monitoring has to “play end user” to properly discover failures end-to-end. At best you have an Icinga satellite (e.g. from our NWS colleagues) located in another datacenter, so that it also notices outages of your uplink. Regardless of the latter, your monitoring has to do at least more or less what the user also would do. In case of DNS it’s simply resolving your domain:
$ /usr/lib/nagios/plugins/check_dns -H example.com DNS OK: 0.214 seconds response time. example.com returns 2606:2800:21f:cb07:6820:80da:af6b:8b2c,93.184.215.14|time=0.214062s;;;0.000000 $
The above plugin can be easily installed e.g. on Ubuntu 24.04 via the monitoring-plugins package. But, again, just that it works on your machine doesn’t mean anything to anyone else (unless you ship your machine). E.g. dnssec-failed.org also worked in my test environment:
$ /usr/lib/nagios/plugins/check_dns -H dnssec-failed.org DNS OK: 0.517 seconds response time. dnssec-failed.org returns 96.99.227.255|time=0.516514s;;;0.000000 $
In a nutshell, DNSSEC is cryptographic signing for DNS. There’s one so-called trust anchor which signs the DNS root zone keys. They sign the root zone which declares which keys sign which TLDs, etc.. If you’re not using DNSSEC: Why aren’t you using DNSSEC? It’s important for end users and domain-validating CAs to ensure DNS answers’ integrity.
But the other side of the same medal is: If anything in the chain of trust, from the anchor to your domain, is missing, mismatching or expired, the whole chain is invalid. In this case, DNSSEC-aware DNS resolvers such as BIND9 or 9.9.9.9 fail to resolve the domain in question:
$ /usr/lib/nagios/plugins/check_dns -H example.com -s 9.9.9.9 DNS OK: 0.035 seconds response time. example.com returns 2606:2800:21f:cb07:6820:80da:af6b:8b2c,93.184.215.14|time=0.035473s;;;0.000000 $ /usr/lib/nagios/plugins/check_dns -H dnssec-failed.org -s 9.9.9.9 Domain 'dnssec-failed.org' was not found by the server $
So at least if your default DNS server successfully resolves dnssec-failed.org, you should use another one for domain monitoring. But there’s also another DNS caveat. Sysadmins know it very well as a joke, but you could take up to 24 hours to get it. 😉 However, this blog post shouldn’t take so much, so jokes aside: DNS resolvers also cache answers, at most as long as the individual records’ TTLs permit.
E.g., currently all DNSSEC-related records for example.com in the .com Zone actually have a TTL of 24 hours. This could take your monitoring up to a day to just notice an outage. Except, when you use an own DNS resolver and instruct it not to cache that much. On Ubuntu 24.04 the bind9 package is your friend. With tshark -f 'port 53' -i eth0
you could even watch when your local BIND asks authoritative nameservers and when it doesn’t. (Package: tshark.)
$ /usr/lib/nagios/plugins/check_dns -H example.com -s 127.0.0.1 DNS OK: 0.023 seconds response time. example.com returns 2606:2800:21f:cb07:6820:80da:af6b:8b2c,93.184.215.14|time=0.023238s;;;0.000000 $ /usr/lib/nagios/plugins/check_dns -H dnssec-failed.org -s 127.0.0.1 Domain 'dnssec-failed.org' was not found by the server $
Admittedly, one can’t disable caching completely, i.e. TTL 0, as the resolving itself takes a little time time. But this isn’t even necessary. Just keeping the maximum TTL below a minimal reasonable Service#retry_interval, say 30 seconds, is enough:
--- /etc/bind/named.conf.options +++ /etc/bind/named.conf.options @@ -1,2 +1,8 @@ options { + lame-ttl 23; + servfail-ttl 23; + max-ncache-ttl 23; + max-cache-ttl 23; + max-stale-ttl 23; + directory "/var/cache/bind";
Don’t forget to restart the service. Now, after a sleep 23
, tshark should indicate that the BIND at 127.0.0.1 is asking authoritative nameservers again for the same query. 👍
Icinga 2 sample config
apply Service "dnssec:" for (name in host.vars.dnssec) { check_command = "dns" vars.dns_lookup = name vars.dns_server = "127.0.0.1" } apply Service "dnssec-cached:" for (name in host.vars.dnssec) { check_command = "dns" vars.dns_lookup = name vars.dns_server = "9.9.9.9" } object Host "resolver" { check_command = "hostalive" address = "127.0.0.1" vars.dnssec = [ "example.com", "dnssec-failed.org" ] }