[Nagios-devel] [PATCH] Re: alternative scheduler

Andreas Ericsson ae at op5.se
Wed Dec 1 14:14:54 UTC 2010


On 12/01/2010 12:23 PM, Fredrik Thulin wrote:
> On Wed, 2010-11-24 at 12:52 +0100, Andreas Ericsson wrote:
>> On 11/23/2010 01:59 PM, Fredrik Thulin wrote:
>>> In case that was too long for y'all, the short story is that I got lousy
>>> performance from the service check scheduler in Nagios (3.2.0, sorry -
>>> forgot to mention that).
>>>
>>> I was able to write a brand new scheduler that works MUCH better - 1160
>>> checks per minute, compared to ~60. Any plans to do something drastic
>>> about the Nagios service check scheduler?
>>>
>>
>> I have no idea what you did to Nagios to make it run only 60 checks per
>> minute.
> 
> I figured it out, by looking at the scheduler source code to actually
> realize what was important in the debug logs.
> 
> I was running a distributed check slave without active host checks,
> based on my reading of docs/3_0/distributed.html. Crazy, eh? ;)
> 
> Why was this a problem?
> 
> Host checks were still being scheduled, and every time a host check was
> found at the front of event_list_low, Nagios would log "We're not
> executing host checks right now, so we'll skip this event." and then
> sleep for sleep_time seconds (0.25 was my setting, based on (Ubuntu)
> defaults) (!!!).
> 

This should only happen if you've set a check_interval for hosts but
have disabled them globally, either via nagios.cfg or via an external
command. It seems weird that we run usleep() instead of just issuing
a sched_yield() or something though, which would be a virtual noop
unless other processes are waiting to run.

> I made the attached minimalistic patch to not sleep if the next event in
> the event list is already due.
> 

Seems sensible, but I think it can be improved, such as issuing either
a sched_yield() or, if sched_yield() is not available, running usleep(10)
every 100 skipped items or so. That would avoid pinning the cpu but would
still be a lot faster than what we have today.

> This removed the total lack of performance in my installation, but
> service reaping is still killing me slowly on my virtual development
> server.

How come?

> The dedicated production server is actually fast enough to
> execute my ~6k service checks AND endure the painful reaping pause every
> 5 minutes.
> 

Reap more often. We use every 2 seconds and have no problems with it.
Also make sure you're stashing the checkresult files on a RAM-disk
instead of on physical media. This is especially true if your system
is memory-starved (strange as that may sound), since the the check result
files would otherwise be purged from the disk cache (as it uses LRU to
evict old-timers). Still though, reaping more frequently means the cache
would more often be hot and reaping will run a lot faster.

> The scheduler really needs much more work (like sub-second precision for
> when to start checks - that gave me roughly 25% additional performance
> in my Erlang based scheduler),

That's not possible. With subsecond precision the program has to do
more work, not less. You're looking at the wrong bottleneck here and
you most certainly botched the implementation the first time around if
adding subsecond precision made such a large improvement for you.

> but this at least makes Nagios usable
> again for me. Also, why not fork a separate thread to do the reaping?
> 

It used to be done in a separate thread. I'm not sure why Ethan changed
it. A much better solution would be to spawn workers to handle the checks
and let the master parent just sit and receive results and update status
files though, but that's a quite invasive change so it'll have to wait a
bit. Getting that done would mean experiments like yours would be a lot
easier to do though and we'd open ourselves up to evolution such that a
flawed scheduler/checker/reaper would quickly be replaced by something
that works well, including all the corner-cases.

> For more evidence, see the graphs at
> 
> http://people.su.se/~ft/test/mrtg_nagios-dev-srv2_2010-12-01/nagios-e.html
> 
> additional graphs at
> 
> http://people.su.se/~ft/test/mrtg_nagios-dev-srv2_2010-12-01/index.html
> 

Ooh, pretty!

> Yesterday between 10:00 and 12:30 I was running with
> execute_host_checks=1, then some iterations of testing and patching, and
> from about 17:00 onwards I'm running a patched version of Nagios (3.2.0)
> with execute_host_checks disabled again. Notice the slow decrease in
> performance over night - I'm pretty sure it's about reaping etc. which
> is making my latency creep upwards. At 10:00 today Nagios was restarted
> again, resulting in a small increase.
> 

Try removing check_interval and retry_interval from your hosts instead,
and set should_be_scheduled=0 in your retention file before restarting.
execute_host_checks is about actually running the checks, whereas you
want to skip even scheduling them.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.




More information about the Nagios-devel mailing list