[Nagios-devel] [PATCH] Re: alternative scheduler
Jochen.Bern at LINworks.de
Thu Dec 2 11:36:13 UTC 2010
On 12/02/2010 10:46 AM, Andreas Ericsson wrote:
> On 12/02/2010 10:03 AM, Jochen Bern wrote:
>> Unless I *really* need new glasses, there's only three different kinds
>> of such rescheduling code in the 3.2.x Nagios core:
>> 1. Reschedule *exactly* check_interval / retry_interval from last due
>> time (iff check_period allows this) - e.g., base/checks.c::1301ff :
> This could trivially be changed by the simple expedient of scheduling the
> checks with a random component and offsetting the check backwards in time
> by half the random flex component.
(Which is what I've hacked into the core right now - as I mentioned, a
random offset of -7..0 seconds, typically every check_interval = 5
minutes, takes ~6h to undo the peak-building of the nightly logfile
>> 2. Reschedule to the *very first second* permitted by check_period -
>> e.g., base/checks.c::278ff :
> Here we could do a similar tweak, adding a random number between 0 and 60
> to the scheduler. It wouldn't be perfect, but it would be better than the
> current scheme, and with a half-decent PRNG it would mean checks would
> stay smoothed out for the duration of Nagios' lifespan.
Where "smoothed out" is defined as "randomly distributed in the first
minute of a valid timeframe, spreading further due to check_interval
randomization for as long as the timeframe runs, and losing all the
latter randomization as they skip over the next *in*valid timeframe".
>> Case 2: *Increase* next_check so as to stay within the check_period, but
>> determining a max increment which simultaneously smoothes out the
>> (potentially MANY) affected checks and avoids pushing the chain of
>> subsequent processing (retry_interval / max_check_attempts if found
>> non-OK, running event handlers, ...) *beyond* the valid timeframe is
>> definitely nontrivial.)
> Not really.
Let me play devil's advocate for a second and sketch my (so far)
worst-case thought scenario:
1. A *very* expensive check which should be done only once per day
during a low-load period, as long as the result is OK.
--> check_period approximately == low-load period, check_interval larger
than the length of the check_period's "valid" timeframe.
2. In cases where the test returns non-OK, a certain (low) number of
rechecks shall be done to guard against secondary influences (say, temp
--> max_check_retries and retry_interval such that their product is
still reasonably lower than the length of the "valid" timeframe.
3. As soon as the service turns HARD non-OK (rather random choice, the
formulae would change if we'd instead use the last SOFT non-OK result,
but the problem stays pretty much the same), an event handler triggers
some corrective action (try to fix the problem within the low-load
period). This action needs some time to complete - let's assume it
doesn't agree well with the retry_interval. Once it's completed, we want
a last-ditch check.
Since we already set "too high" a check_period in step 1, we need the
event handler to trigger the action, make an educated guess whether it
might succeed, and if yes, schedule the last-ditch check through the
external command interface (to be executed X seconds later).
4. Now let's do the math: In order to make sure that the last-ditch
check will still fall into the check_period, and not taking any
retry_interval randomization into account, we need the *first* check to
get scheduled between period_begin and
period_end - (max_check_retries-1)*retry_interval - X
- [some time for event handler latency&exec]
where X is a substantial delay programmed into the event handler,
nowhere to be found in the data available to Nagios itself.
Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel
More information about the Nagios-devel