Monitoring the Unknown in the Service Manager

by Alvar Penning | Apr 10, 2024

Nearly every operating system comes with at least one kind of service management. On a Unix-based operating system, this is historically part of the init system. While the specific tools have matured over time and there are changes between operating systems, they are essentially used to orchestrate both operating system services and user services. Specifically, a service manager ensures that, e.g., a web server is started once the network is configured and available. It then keeps track of these services and may report their status.

This is where monitoring comes in. Because a modern operating system may come with a multitude of services, it is easy to overlook some crucial ones. Overlooked services can be part of the operating system, a dependency on another already monitored service, or simply a forgotten component. They may work for ages, but suddenly fail after an update or an unforeseen event.

In this way, the monitoring of init or a service management system can be seen as an additional layer in an operating system monitoring strategy. This blog post will showcase some exemplary init systems and explain how to monitor them for failed services. It concludes with an outlook of how to automatically restart failed services with Icinga 2, virtually acting as service manager.

systemd on Linux

Most Linux-based operating systems have switched to systemd within the last decade. Since systemd can be much more than a simple init system, there are many things to monitor. Furthermore, by linking timers – systemd’s answer to cron – to service units, a failed timer can be easily detected.

On the command line, systemctl --failed can be used to check for failed units.

Due to the ubiquity of systemd, there are several monitoring plugins to choose from. The check_systemd plugin was chosen as an example. It can be called without parameters to report a summary of systemd’s state, which is exactly the desired behavior for this use case.

While being part of the Icinga Template Library and shipping its own Check Command, a minimal Service using the by_ssh Check Command might look like the following.

apply Service "systemd" {
  import "generic-service"

  check_command = "by_ssh"
  vars.by_ssh_command = [ "check_systemd" ]

  assign where host.vars.os == "Linux"
}

OpenBSD

Staying in the Unix world, the opposite of systemd, with all its features and glory, might be OpenBSD’s simple rc.d. The daemon control scripts can be managed and monitored using rcctl, which allows listing of failed or rogue services.

The author of this post – that’s me, hi – wrote a monitoring check plugin for OpenBSD’s rc.d a while back. In the spirit of this post, it does the bare minimum and reports failed services.

As the plugin also comes with a Check Command, it can likewise be utilized with by_ssh.

apply Service "rc.d services" {
  import "generic-service"

  check_command = "by_ssh"
  vars.by_ssh_command = [ "check_openbsd_rcd.sh" ]

  assign where host.vars.os == "OpenBSD"
}

Windows

For completeness’ sake, and to point at least once outside the Unix world, there is also Windows. Part of the Icinga Stack is the Icinga PowerShell Framework, coming with the Invoke-IcingaCheckService plugin. It can be used to monitor Windows services and create the same service overview monitoring as described throughout this post.

Automatic Service Restart

Knowing which service has crashed allows for automatic restart of the failed service. While ideally every failure should be resolved at its root, in reality, “turning it off and on again” is often a sufficient solution. Given that the service is already offline, attempting to restart it may not be considered dangerous. However, it may not be advisable to automatically restart every service as some may have startup scripts or behave in other undesirable ways. Only services that are known to crash frequently and are safe to restart should be configured to do so.

Some service managers allow for automatic service restart, such as systemd. Others do not offer this option, either to keep things simple or due to the philosophical stance that crashes should not simply be ignored by restarting.

This is where Icinga 2 steps in with its EventCommands, which can be configured for both Host and Service objects. The Icinga 2 manual shows its strength with numerous examples of how to use EventCommands, including how to restart a service.

In contrast to simply restarting, an EventCommand has the ability to perform a wide range of tasks. Additionally, preflight checks can be run to limit the cases in which a restart should be performed, thus eliminating situations when everything is futile.

Conclusion

In summary, it is beneficial to be aware of the processes running on one’s system. However, modern operating systems may consist of multiple services, some of which are deeply integrated into the operating system. These services may present difficulties only after an update or other changes. Thus, monitoring the health of init or service management can help find errors before their effects arise.