[Nagios-devel] [PATCH] Re: alternative scheduler

Fredrik Thulin ft at it.su.se
Wed Dec 1 11:23:52 UTC 2010


On Wed, 2010-11-24 at 12:52 +0100, Andreas Ericsson wrote:
> On 11/23/2010 01:59 PM, Fredrik Thulin wrote:
> > In case that was too long for y'all, the short story is that I got lousy
> > performance from the service check scheduler in Nagios (3.2.0, sorry -
> > forgot to mention that).
> > 
> > I was able to write a brand new scheduler that works MUCH better - 1160
> > checks per minute, compared to ~60. Any plans to do something drastic
> > about the Nagios service check scheduler?
> > 
> 
> I have no idea what you did to Nagios to make it run only 60 checks per
> minute.

I figured it out, by looking at the scheduler source code to actually
realize what was important in the debug logs.

I was running a distributed check slave without active host checks,
based on my reading of docs/3_0/distributed.html. Crazy, eh? ;)

Why was this a problem?

Host checks were still being scheduled, and every time a host check was
found at the front of event_list_low, Nagios would log "We're not
executing host checks right now, so we'll skip this event." and then
sleep for sleep_time seconds (0.25 was my setting, based on (Ubuntu)
defaults) (!!!).

I made the attached minimalistic patch to not sleep if the next event in
the event list is already due.

This removed the total lack of performance in my installation, but
service reaping is still killing me slowly on my virtual development
server. The dedicated production server is actually fast enough to
execute my ~6k service checks AND endure the painful reaping pause every
5 minutes.

The scheduler really needs much more work (like sub-second precision for
when to start checks - that gave me roughly 25% additional performance
in my Erlang based scheduler), but this at least makes Nagios usable
again for me. Also, why not fork a separate thread to do the reaping?

For more evidence, see the graphs at

http://people.su.se/~ft/test/mrtg_nagios-dev-srv2_2010-12-01/nagios-e.html

additional graphs at

http://people.su.se/~ft/test/mrtg_nagios-dev-srv2_2010-12-01/index.html

Yesterday between 10:00 and 12:30 I was running with
execute_host_checks=1, then some iterations of testing and patching, and
from about 17:00 onwards I'm running a patched version of Nagios (3.2.0)
with execute_host_checks disabled again. Notice the slow decrease in
performance over night - I'm pretty sure it's about reaping etc. which
is making my latency creep upwards. At 10:00 today Nagios was restarted
again, resulting in a small increase.

/Fredrik

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Don-t-idle-when-there-is-more-work-to-do.patch
Type: text/x-patch
Size: 1362 bytes
Desc: not available
URL: <http://lists.nagios.com/pipermail/nagios-devel/attachments/20101201/a20d3370/attachment.bin>


More information about the Nagios-devel mailing list