[Nagios-devel] Instrumenting Nagios
eponymousalias at yahoo.com
Thu May 21 01:50:19 UTC 2009
To the extent that such delays may be partly
due to general cost of computing, profiling the
entire nagios binary would not be a bad idea.
gprof is your friend.
--- On Tue, 5/19/09, Steven D. Morrey <smorrey at ldschurch.org> wrote:
> From: Steven D. Morrey <smorrey at ldschurch.org>
> Subject: [Nagios-devel] Instrumenting Nagios
> To: "nagios-devel at lists.sourceforge.net" <nagios-devel at lists.sourceforge.net>
> Date: Tuesday, May 19, 2009, 11:11 AM
> Hi Everyone,
> We're trying to track down a high latency issue we're
> having with our Nagios system and I'm hoping to get some
> advice from folks.
> Here's what’s going on.
> We have a system running Nagios 2.12 and DNX 0.19 (latest)
> This setup is comprised of 1 main nagios server and 3 DNX
> "worker nodes".
> We have 29000+ service checks across about 2500 hosts. Over
> the last year we average about 250 or more services alarming
> at any given time. We also have on average about 10 hosts
> down at any given time.
> My original thought was that perhaps DNX was slowing down,
> maybe a leak or something so I instrumented DNX, by timing
> from the moment it's handed a job until it posts the results
> into the circular results buffer.
> This figure holds steady at 3.5s.
> I am pretty sure all checks are getting executed (at least,
> all the ones that are enabled) eventually. Just more and
> more slowly over time.
> Clearly, some checks are being delayed by something or even
> many things. What I'd like to do is to perhaps extend
> nagiostats to gather information about why latency is at the
> level it is, to see if we can't determine why Nagios is
> waiting to run these checks.
> What should we be looking at, either in the event loop or
> outside of it, to get a good overview of how what and why
> nagios is doing what it's doing?
> We are thinking of adding counters to the different events
> (both high and low) in an attempt to determine the source of
> the latency in detail. For example, if the average check
> latency is 100 seconds, being able to show that 30 of that
> was spent doing notifications, and 20 seconds spent doing
> service reaping, etc. That way we can know where we need to
> make optimizations.
> I'm thinking that if we can instrument the following events
> we should have most of our bases covered (note some of these
> may already be instrumented)...
> log file rotations,
> external command checks,
> service reaper events,
> program shutdown,
> program restart,
> orphan check,
> retention save,
> status save,
> service result freshness,
> host result freshness,
> expired downtime check,
> check rescheduling,
> expired comment check
> host check
> service check
> Is there anything else that could or should be instrumented
> that could give us a good view in what nagios is doing thats
> causing service checks to be executed further and further
> away from when they were scheduled?
> Are these complete? Do these make sense to instrument and
> would they be useful in determining what is contributing to
> check latency?
> Thanks in advance!
> NOTICE: This email message is for the sole use of the
> intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, use,
> disclosure or distribution is prohibited. If you are not the
> intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.
> Crystal Reports - New Free Runtime and 30 Day Trial
> Check out the new simplified licensing option that enables
> unlimited royalty-free distribution of the report engine
> for externally facing server and web deployment.
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
More information about the Nagios-devel