[Nagios-devel] Instrumenting Nagios

Andreas Ericsson ae at op5.se
Wed May 20 08:12:27 UTC 2009

Steven D. Morrey wrote:
> Hi Everyone,
> We're trying to track down a high latency issue we're having with our
> Nagios system and I'm hoping to get some advice from folks. Here's
> what’s going on.
> We have a system running Nagios 2.12 and DNX 0.19 (latest) This setup
> is comprised of 1 main nagios server and 3 DNX "worker nodes".
> We have 29000+ service checks across about 2500 hosts. Over the last
> year we average about 250 or more services alarming at any given
> time. We also have on average about 10 hosts down at any given time.

Nagios 2 does serialized host checks, which, with any number of down
hosts at a time, will cause latency to increase slowly but steadily.
Is there any chance that you can upgrade to Nagios 3?

> My original thought was that perhaps DNX was slowing down, maybe a
> leak or something so I instrumented DNX, by timing from the moment
> it's handed a job until it posts the results into the circular
> results buffer. This figure holds steady at 3.5s.

Since you're a clever chap, I'll assume the DNX-module isn't waiting
for the result after it's posted it. If it does, Nagios does indeed
hang during that period. However, that would increase latency by the
3.5s you see *for each check*, which would quickly cause it to sky-
rocket, so that really can't be it.

> I am pretty sure all checks are getting executed (at least, all the
> ones that are enabled) eventually. Just more and more slowly over
> time. Clearly, some checks are being delayed by something or even
> many things.  What I'd like to do is to perhaps extend nagiostats to
> gather information about why latency is at the level it is, to see if
> we can't determine why Nagios is waiting to run these checks.
> What should we be looking at, either in the event loop or outside of
> it, to get a good overview of how what and why nagios is doing what
> it's doing?
> We are thinking of adding counters to the different events (both high
> and low) in an attempt to determine the source of the latency in
> detail. For example, if the average check latency is 100 seconds,
> being able to show that 30 of that was spent doing notifications, and
> 20 seconds spent doing service reaping, etc. That way we can know
> where we need to make optimizations.

There are compile-options in the Makefiles that allows you to create
profiling information suitable for consumption with gprof. You could
also try running it under oprofile to see if any system calls are
taking abnormally long times. Be especially mindful of the ones that
deadlock in IO calls.

> I'm thinking that if we can instrument the following events we should
> have most of our bases covered (note some of these may already be
> instrumented)...
> log file rotations, external command checks, service reaper events, 
> program shutdown, program restart, orphan check, retention save, 
> status save, service result freshness, host result freshness, expired
> downtime check, check rescheduling, expired comment check host check 
> service check
> Is there anything else that could or should be instrumented that
> could give us a good view in what nagios is doing thats causing
> service checks to be executed further and further away from when they
> were scheduled?

I personally don't believe so heavily in manually adding profiling
timers to some code (although I wrote a small library to do just that
quite a long time ago). It's (usually) far better to use profiling
information obtained from various profiling tools, since they allow
you to see more than just the things you *think* are problematic.

> Are these complete? Do these make sense to instrument and would they
> be useful in determining what is contributing to check latency?

Well, seeing how long each type of event takes to run, combined with
the type of the event being executed, would almost certainly take you
a very long way indeed. However, I'm reasonably certain you'll find
that latency increases by 3 (if you're using check_host) to 5 (if you
are using check_ping) seconds for each host-check that gets executed
against a host that's down.

There's a fairly simple test to be done for this though. Just make
the hostcheck a really stupid program that always exits with OK state
immediately. If that cures your latency, the problem is indeed caused
by the host checks.

Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

More information about the Nagios-devel mailing list