[Nagios-devel] alternative scheduler

Andreas Ericsson ae at op5.se
Thu Dec 2 14:46:48 UTC 2010

On 12/02/2010 03:08 PM, Max Schubert wrote:
> The problem with any smoothing or readjustment of time intervals comes
> in when performance metrics are being collected along with state - not
> having a stable interval between checks throws off intervals between
> data points in metric databases.

Every scenario where Nagios dumps a bunch of checks in the same second
already have the problem of data not coming in at regular intervals.
When Nagios gets to schedule checks itself and spread them nicely over
the initially calculated checking window and then gets to keep on doing
the checks with the normal check_interval (or even the retry_interval),
checks are smoothed out nicely.

It's when a user schedules an immediate check of all services on a host,
hostgroup, servicegroup or whatever, or when a timeframe in a timeperiod
starts where the inactive timeframe was supposed to contain a bunch of
checks that's the real problem.

> Some amount of jitter in intervals  can be accounted for when
> inserting data points into metric databses with some fairly simple
> math (truncating intervals to nearest minute for example) but if
> intervals are not pretty accurate then using metrics over time for
> trending and comparison gets to be much trickier and requires a lot of
> mathematical adjustments on view if we are say looking at trend lines
> for 10 or 20 elements at once - this then scales very poorly when
> wanting to view hundreds or thousands of metric lines at once - even
> if they are aggregated first (which is usually done in some fashion
> with hugh #s of metrics).
> We have mitigated this issue a bit by adding truncation code before
> inserting metrics into our long term trending data warehouse - that
> means that what goes in falls on even minute intervals, making
> graphing a cheap operation evenr many data points.
> Our longer term resolution to this will be to decouple fault
> management tests from metrics collection as the metrics really make us
> have to watch service latency and intervals for snmp delta metric
> collection hard - it is a PITA.  We plan on having an agent on every
> system that focuses on streaming metrics to collectors, thereby
> freeing the polling based tests from having to be locked into very
> accurate check intervals.

Personally, I was never very happy with Nagios gaining support for
performance data and handling it specially, for this precise reason.
It does make sense not to check a bunch of things multiple times
though, and unless you reuse the data sent by your agents in order
to do faultchecking you'll be doing the work twice. The problem is
that faultchecking should use the values as they are *now*, not as
they were a minute or so ago, so the two are indeed hard to combine.

I suppose everyone will have to choose for themselves between precise
faultchecking, easy graphing or increased load on remote systems and
personnel to get both of them. It boils down to the old "Fast, good,
cheap. Pick any two." mantra again, really.

Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

More information about the Nagios-devel mailing list