[Nagios-devel] [PATCH] Re: alternative scheduler

Andreas Ericsson ae at op5.se
Thu Dec 2 09:46:49 UTC 2010

On 12/02/2010 10:03 AM, Jochen Bern wrote:
> On 12/01/2010 08:55 PM, Adam Augustine wrote:
>> While DNX and mod_gearman do implement that specific functionality,
>> they are still subject to the scheduler/reaper bottlenecks. We (the
>> institution that started the DNX project) have played around with the
>> check scheduling parameters quite a bit over the years and even with
>> our best scheduling parameters and DNX actually executing the plugins,
>> we still see checks scheduled such that we have a large number of
>> checks scheduled to execute in a single second with several seconds
>> (3-5) of nothing scheduled to execute between.
> Agreed. That's also the reason why I don't use either so far; I don't
> have a problem (yet ...) with the short-term scheduling (scheduling "due
> now" checks onto executors), but I see unnecessary churn in the mid-term
> scheduling (schedule next due time of checks just completed).
> Unless I *really* need new glasses, there's only three different kinds
> of such rescheduling code in the 3.2.x Nagios core:
> 1. Reschedule *exactly* check_interval / retry_interval from last due
> time (iff check_period allows this) - e.g., base/checks.c::1301ff :
>     if(reschedule_check==TRUE)
>        next_service_check=(time_t)(temp_service->last_check
>           +(temp_service->check_interval*interval_length));
>     }

This could trivially be changed by the simple expedient of scheduling the
checks with a random component and offsetting the check backwards in time
by half the random flex component. That shouldn't really be necessary
though. See below.

> 2. Reschedule to the *very first second* permitted by check_period -
> e.g., base/checks.c::278ff :
>     /* make sure we rescheduled the next service check at a valid time */
>     get_next_valid_time(preferred_time,
>        &next_valid_time,svc->check_period_ptr);
>     [...]
>        svc->next_check=next_valid_time;

Here we could do a similar tweak, adding a random number between 0 and 60
to the scheduler. It wouldn't be perfect, but it would be better than the
current scheme, and with a half-decent PRNG it would mean checks would
stay smoothed out for the duration of Nagios' lifespan.

> 3. Special (error) cases falling back to some hardcoded "check interval"
> (five minutes, one week, ...).

These would benefit from just being rescheduled the normal way and pushed
forward by check_interval number of seconds each time they're supposed to

> Neither case even *looks* at the list of already-scheduled check
> executions around the target time, much less does any smoothing.
> (For sake of completeness: A smoothing algorithm IMHO should:
> Case 1: *Decrease* next_check for at most a certain percentage of
> check_interval/retry_interval, so as to avoid consecutive faults in
> freshness checks and performance data processing (in the case of RRDs,
> violation of xff);

Not percentage. A fixed time would be both easier to implement and also
give a lot better behaviour in that it would be a lot less surprising
to users.

> Case 2: *Increase* next_check so as to stay within the check_period, but
> determining a max increment which simultaneously smoothes out the
> (potentially MANY) affected checks and avoids pushing the chain of
> subsequent processing (retry_interval / max_check_attempts if found
> non-OK, running event handlers, ...) *beyond* the valid timeframe is
> definitely nontrivial.)

Not really. The simple way of doing it is like so:

struct scheduled_thingie sched_queue[1024];

uint lowest = maxuint;
for (i = scheduled_time; i < scheduled_time + flex; i++) {
	if (sched_queue[i & 1023].scheduled_items < lowest)
		lowest = i & 1023;
if (sched_item->when > 1023) {
} else {
	sched_item->next = sched_queue[lowest].list;
sched_queue[lowest].list = sched_item;

When running checks, one simply has to grab the items in
sched_queue[sched_last_when].list and run the events there until a
time is encountered that doesn't match time(NULL), and then we
increment sched_last_when and move on to the next slot in the queue.
If that one's empty, we sleep one second and try again, or perhaps
issue an extra reaping (although that would be better done in a
separate thread, as has already been mentioned).

Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

More information about the Nagios-devel mailing list