[Nagios-devel] [PATCH] Re: alternative scheduler

Andreas Ericsson ae at op5.se
Thu Dec 2 13:10:58 UTC 2010


On 12/02/2010 12:36 PM, Jochen Bern wrote:
> On 12/02/2010 10:46 AM, Andreas Ericsson wrote:
>> On 12/02/2010 10:03 AM, Jochen Bern wrote:
>>> Unless I *really* need new glasses, there's only three different kinds
>>> of such rescheduling code in the 3.2.x Nagios core:
>>> 1. Reschedule *exactly* check_interval / retry_interval from last due
>>> time (iff check_period allows this) - e.g., base/checks.c::1301ff :
>> This could trivially be changed by the simple expedient of scheduling the
>> checks with a random component and offsetting the check backwards in time
>> by half the random flex component.
> 
> (Which is what I've hacked into the core right now - as I mentioned, a
> random offset of -7..0 seconds, typically every check_interval = 5
> minutes, takes ~6h to undo the peak-building of the nightly logfile
> rotation.)
> 

If you use -15..+15 seconds it will spread a lot faster.

>>> 2. Reschedule to the *very first second* permitted by check_period -
>>> e.g., base/checks.c::278ff :
>> Here we could do a similar tweak, adding a random number between 0 and 60
>> to the scheduler. It wouldn't be perfect, but it would be better than the
>> current scheme, and with a half-decent PRNG it would mean checks would
>> stay smoothed out for the duration of Nagios' lifespan.
> 
> Where "smoothed out" is defined as "randomly distributed in the first
> minute of a valid timeframe, spreading further due to check_interval
> randomization for as long as the timeframe runs, and losing all the
> latter randomization as they skip over the next *in*valid timeframe".
> 

The "losing all the randomization" won't be necessary if the checks
were to be stepped by whatever recheck interval we're currently using
instead of set fixedly to the first second of the next valid timeframe.

>>> Case 2: *Increase* next_check so as to stay within the check_period, but
>>> determining a max increment which simultaneously smoothes out the
>>> (potentially MANY) affected checks and avoids pushing the chain of
>>> subsequent processing (retry_interval / max_check_attempts if found
>>> non-OK, running event handlers, ...) *beyond* the valid timeframe is
>>> definitely nontrivial.)
>> Not really.
> 
> Let me play devil's advocate for a second and sketch my (so far)
> worst-case thought scenario:
> 
> 1. A *very* expensive check which should be done only once per day
> during a low-load period, as long as the result is OK.
> -->  check_period approximately == low-load period, check_interval larger
> than the length of the check_period's "valid" timeframe.
> 
> 2. In cases where the test returns non-OK, a certain (low) number of
> rechecks shall be done to guard against secondary influences (say, temp
> LAN hiccups).
> -->  max_check_retries and retry_interval such that their product is
> still reasonably lower than the length of the "valid" timeframe.
> 
> 3. As soon as the service turns HARD non-OK (rather random choice, the
> formulae would change if we'd instead use the last SOFT non-OK result,
> but the problem stays pretty much the same), an event handler triggers
> some corrective action (try to fix the problem within the low-load
> period). This action needs some time to complete - let's assume it
> doesn't agree well with the retry_interval. Once it's completed, we want
> a last-ditch check.
> Since we already set "too high" a check_period in step 1, we need the
> event handler to trigger the action, make an educated guess whether it
> might succeed, and if yes, schedule the last-ditch check through the
> external command interface (to be executed X seconds later).
> 
> 4. Now let's do the math: In order to make sure that the last-ditch
> check will still fall into the check_period, and not taking any
> retry_interval randomization into account, we need the *first* check to
> get scheduled between period_begin and
> 	period_end - (max_check_retries-1)*retry_interval - X
> 		- [some time for event handler latency&exec]
> where X is a substantial delay programmed into the event handler,
> nowhere to be found in the data available to Nagios itself.
> 

Or we can just inform users that the period for which they want their
such very specialized checks to run should be longer than the desired
check_interval + (retry_interval * (max_check_attempts + 1)) to get
something up and going quickly.

As an aside, the proper way to smooth out load would be to assign to
each check a "load-score", which gets sampled every so ofte with the
checks that ran the past sample fram. Each scheduling queue slot should
get a pre-defined maximum load. This would let hundreds of low-load
checks run at the same time, while heavy checks would be run almost in
serial. The load-score should probably be set automatically and at
least resemble online_cpus * 2 or something.

The code to make that happen wouldn't be exactly trivial though, and
cheap checks that are run in parallel with heavy ones will get unfairly
penalized by this system. That shouldn't matter much though, as they'll
quickly be separated so that checks with a high load-score aren't run
at the same time, and then the values for the lower-load plugins will
auto-adjust over time.

Or we could let users assign certain commands the "heavy-load" warning,
which could let Nagios only schedule few such checks to run in parallel.
check_esx3 comes to mind as a suitable candidate for such an option.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.




More information about the Nagios-devel mailing list