[Nagios-devel] [PATCH] Re: alternative scheduler
ae at op5.se
Fri Dec 3 13:28:18 UTC 2010
On 12/03/2010 02:05 PM, Fredrik Thulin wrote:
> On Fri, 2010-12-03 at 12:55 +0100, Andreas Ericsson wrote:
>>> No, actually not. Erlang is a soft real time system. My approach was to
>>> ask the Erlang VM to send me a tick every N ms (N = 300s * 1000 / number
>>> of checks). So if N is 50, the VM will signal me once every 50 ms, very
>>> precisely and without any drift.
>> If N is constant, it can't be the lvalue of the above expression.
> I meant to say that N is calculated when the list of checks is
> (re)loaded. As I don't even try to have retry_intervals and such, a
> steady tick interval works great as long as I can finish initiating
> another service check in between ticks.
Ah, right. And initiating a check is quite cheap until they start
piling up when the network goes bad, which you sort of avoid by using
a constant stream of executing checks, so you always know there'll be
constant load on the system you're monitoring from. I'm wondering if
that doesn't sort of solve the problem in the wrong direction though,
since the monitoring system is supposed to serve the other systems and
endure the inconveniences it suffers itself as best it can. Sort of.
> Note that I say initiate, not complete - I have more cores that can
> finish the job of starting the check.
Yes. Adding a separate reaper thread to Nagios would be suspect nr 1
when it comes to revamping the scheduler. Hitting the disk instead of
just adding the results to a constantly emptied queue is just outright
insane. I expect it was done that way since the performance-data handling
wasn't thread-safe earlier. With my recent changes to remedy that, I see
no real reason to hit the disk once for each finished check anymore.
>>> I then just had to finish starting another check command in =< 49 ms,
>>> and go back to sleep. All handling of check results is done completely
>>> asynchronous to this starting of new checks.
>>> This is all in src/npers_spawner.erl if anyone is interested in the
>> That's still "doing more than you did before", on a system level, so the
>> previous implementation must have been buggy somehow. Perhaps erlang
>> blocked a few signals when the signal handler was already running, or
>> perhaps you didn't start enough checks per tick?
> I agree it is more work for the scheduler, but that is better than
> having under-utilized additional CPUs/cores, right?
So long as the net effect is that you can run more checks with it, yes, but
an exponential algorithm will always beat a non-exponential one, so with a
large enough number of checks you'll run into the reverse situation, where
the scheduler eats so much overhead that you no longer have cpu power left
to actually run any checks.
>> If the above expression was correct (N is not constant), this algorithm
>> makes the cost for running a single check exponential with the number of
>> checks to run. Ie, the more checks you have, the more expensive each check
>> will become. The curve will converge on (infinity - 1) faster with a larger
>> exponent. In this case, the exponent is ticks/sec, so reducing the ticktime
>> means you're effectively reducing performance unless there are other
>> factors involved that shaves enough cycles to make this change disappear
>> in the noise.
> Sorry, you lost me here. Perhaps I just failed to explain what N was?
What I mean is that each tick costs a couple of cpu cycles. I don't know how
many, but that's irrelevant. With your algorithm the cost per check is higher
the more checks one have, which is a Bad Thing(tm) indeed, since it inevitably
means that the runtime cost per check converges on infinity.
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
More information about the Nagios-devel