Icinga 1.9.3 with bugfix for Postgresql

icingadbUnfortunately a fix for an ugly libdbi bug causing bad performance was applied not only for MySQL, but also Postgresql causing wrong queries. It slipped through testing and silently made 1.9.2 a faulty IDOUtils Postgresql release. My bad, sorry for that (Sundays are always sleepy 😉 ).
Other than that Eric approached me last week with an init script bug, not returning the correct lsb exit codes when called with ‘status’ – I’ve already fixed that for 1.10 but it’s now backported into 1.9.3 too.
So, if you’re happy with 1.9.2 stick with it. If you are using Postgresql, fetch 1.9.3 from Github as usual. Target bug reports towards https://dev.icinga.com or join the various community channels at https://icinga.com/support 🙂

Icinga Core Reload Problems addressed in 1.9

There are many reports about the core reload/restart taking ages. This mostly happens when you have IDOUtils and a database backend enabled for Icinga Web and/or Reporting. You may ask “How about dropping the database and use something else?”. Well, that’s not really the point. It won’t solve the problem for everyone out there. Even Icinga 2 is not yet production ready to act as a drop-in replacement.
So, what’s the problem at all? The core doesn’t know about config diffs – newly added or deleted objects. When idomod detects a core reload (re)start, it will dump all the config information to the ido socket. ido2db reads from there and pushes the database insert/updates for the configuration objects. This amount of data may get huge in large setups and takes a while being processed.
The configuration dump needs to be finished before any other updates (status, check history) for data integrity reasons (check #1934 for some deeper thoughts). Rewriting the core for config diffs was an idea, but will cost too much resources right now (the configuration format and parsing is one of the major reasons to develop Icinga 2 from scratch).
During Icinga 2 development, we discussed an idomod connector (Compat IDO) and reusing ido2db from Icinga 1.x. That prototyping unveiled these bottlenecks even more, as Icinga 2 is designed for large-scale systems and may generate 100k service checks in  5 minute interval – ido2db did not have fun back there.
We’ve decided to drop that idea (Icinga 2 will add its own ido compatible layer), but the prototyping added 2 nice enhancements for Icinga IDOUtils 1.9:

  • a socket queue (which does not use a kernel message queue, but a thread to proxy the socket data) #3533
  • transactions around large objects (e.g. a service with groups, contacts, dependencies, etc wrapped as single transaction) #3527

Check module/idoutils/config/updates/ido2db.cfg_added_1.8_to_1.9.cfg in Icinga 1.9 for details. These options features are disabled enabled by default (and tagged experimental) not to harm existing installations, but to allow everyone else to test and use them 🙂
Known caveats:

  • ido2db requires more CPU and RAM in order to cache and process data (socket queue only)
  • your database must allow transactions for the database user (transactions only)
  • the insert/update performance still depends on your database – database tuning still required

Below is a small comparison of 4k services test config, Debian 6.0.7 VM, 4 Cores, 2GB RAM, MySQL 5.1.66 without tuning. Icinga adds “Eventloop started…” onto logs, but there’s also a dedicated service check in your sample configuration.
Core Startup with pre 1.9, no options enabled (short log):

Apr 15 18:01:32 sol icinga: Icinga 1.9.0 starting... (PID=4699)
Apr 15 18:01:32 sol icinga: Event broker module '/usr/lib/idomod.so' initialized successfully.
Apr 15 18:01:32 sol ido2db: Client connected, data available.
Apr 15 18:04:22 sol icinga: Event loop started...

 
Core Startup with pre 1.9 and both options enabled (short log):

Apr 15 18:07:35 sol icinga: Icinga 1.9.0 starting... (PID=5336)
Apr 15 18:07:35 sol icinga: Event broker module 'IDOMOD' version '1.9.0' from '/usr/lib/idomod.so' initialized successfully.
Apr 15 18:07:35 sol ido2db: Client connected, data available.
Apr 15 18:07:38 sol icinga: Event loop started...
Apr 15 18:07:52 sol ido2db: IDO2DB buffer sizes: left=5946260, right=0
Apr 15 18:10:04 sol ido2db: IDO2DB buffer sizes: left=10586, right=0

Tip: The buffer size output is logged every ~15 seconds if there’s data waiting. From left (queued socket input) to right (output towards db). If there are no more log entries, the queue is idle and data falling through.
Memory and CPU consumption is pretty moderate in exchange of having the core checking hosts/services directly after event loop started 🙂
icinga_1.9_ido2db_socket_queue
Please test those options in your setup (git next snapshot or wait til 1.9 on 25.4.2013), and provide feedback to our community support channels! Thanks in advance for helping make Icinga better 🙂
Update 4.5.2013: Core release team decided to mark another milestone with 1.9 and set those enhancements the default without any configuration. They’ve been running for months now on our test platforms and we do not want to miss the enhancements. Latest GIT release branch reflects those changes.

SLA Reporting with Added Precision

If you have upgraded to Icinga Web 1.6 you may already be familiar with the new SLA extension in IDOUtils. The optional module is our response to the old niggle from the community that data written to database could be better used. So we have taken the opportunity to add a table to the database model and fiddle with IDO2DB. The end result is ‘enable_sla’ in your IDOUtils configuration files, which takes events and identifies the periods of scheduled downtime and acknowledgment for more accurate SLA reporting.

You could say, it improves SLA results too, by making it clearer, to what extent a critical event is actually critical or being resolved 😉
Coding and coordination aside, the concept behind the SLA extension is actually quite simple. We added a SLA history table to the database model, which organizes event start and end times, object id, state and state type as well as acknowledgement and scheduled downtime. Then in IDO2DB we added extra logic to write data from the core to the aforementioned table correctly.

At the moment you can view SLA data in the Icinga Web interface’s new tackle cronk, in the form of a pie chart. This is just the beginning though. We hope to integrate SLA history into Icinga Reporting with even more refined metrics.
 
So perhaps in the (not too far) future, you may be able to open up Icinga Reporting and call up a diagram that shows service availability for the year, though only from Monday – Friday, 9 am – 5 pm, discounting scheduled downtimes and acknowledgement periods. That maybe something worth waiting for – or even better, contributing to.

Icinga 1.5.1 released

As you may have noticed, the web developers already released a bugfixed 1.5.1 Icinga Web version (and 1.5.2 is to be announced soon). Now it’s time to fix some Core, Classic UI and IDOUtils related issues – so the core team is releasing 1.5.1 too 🙂
Changelog
* core: free memory allocated notification macros right after sending the notification, not in next notification
* classic ui: fix Localization: Form validation message could be improved (thx Mario Rimann) #1849
* classic ui: fix wrong titles in list of scheduled downtimes (thx Mario Rimann) #1848
* classic ui: fix host and service names are not allowed to have a ‘+’ included #1843
* idoutils: idomod: change stacked memory allocation for broker_data IDO_MAX_BUFLEN #1879
* idoutils: fix idomod should log more verbose on errors, asking for a running ido2db process #1885
* spec file: re-add processing headers
As usual, please download from sourceforge and report any bugs or features requests to our dev tracker and/or support channels.

Icinga Core, Classic UI & IDOUtils 1.4.2 released

Due to the recent fixes in 1.4.1 the XSS vulnerability caused the command expander in config.cgi not to work as expected. Alongside this bug, there were various other things to resolve while working on the 1.5 dev branches. All important fixes have been backported into 1.4 tree and can now be found in a revamped 1.4.2 release on Core, Classic UI and IDOUtils.
Download 1.4.2 now or wait for your distribution to push updated packages 🙂 Special note: 1.4.2 does not require IDOUtils DB upgrading.
Changelog

  • core: fix freshness_threshold problem in host checks by using check_interval in HARD or OK state, else retry_interval (like service checks) #1331
  • classic ui: add a check for status data freshness into cgis #1667
  • classic ui: re-fix xss vulnerability and string escaping for command expansion #1605 #1624
  • classic ui: remove sidebar.html inclusion in index.html causing troubles on reload #1632
  • classic ui: fixed: User can execute host/servicegroup commands even if not authorized for (Sven Nierlein) #1679
  • classic ui: fixed: plugin_output_short didn’t get checked properly and caused segfault in status.cgi #1673
  • idoutils: do not update start_time of already started downtimes #1658
  • idoutils: fix started downtime update for table scheduleddowntime in oracle #1658
  • install: fix make install-idoutils overwrites sample – adding idoutils.cfg-sample instead #1625

 
Please report any bugs/feature requests/etc to our development tracker and/or community channels! 🙂