[Nagios-users] Alerting based on past-to-current trends?
jim at jimavery.me.uk
Fri Dec 10 16:26:44 UTC 2010
On 6 December 2010 19:02, Ian Ehrenwald <iehrenwald at tripadvisor.com> wrote:
> I was wondering if there was a straight-forward way to alert based on an average of past data plus a current perfdata entry. I understand I'm not explaining it very well that way, so here is the real-world example I am working with -
> I am polling a set of machines via SNMP for CPU load every 1 minute (looking at hrProcessorLoad). If the return value is at or above 95%, send out a WARNING. If the return value is 98% or above, send out a CRITICAL. The problem here is that it's OK for a process to take up 100% CPU for multiple seconds, and sometimes that high CPU usage coincides with the SNMP %CPU query, so I get a lot of false alerts.
> Is there a way to use past perfdata in conjunction with the current returned data to generate an average and send a WARNING or CRITICAL based on that new number? I only care to get alerted from Nagios if, for example, the %CPU has been at 100% for 5 minutes. Or am I just way over-thinking this and should be monitoring 1m, 5m, 15m UNIX load averages (which doesn't seem that accurate anyway)? What are other people doing to monitor CPU usage and alert on abnormal long periods of utilization?
Nagios will alert as soon as the plugin returns a non-OK status. You
can of course configure max_check_attempts and/or
first_notification_delay so that Nagios won't send a notification
until after a given time, but this won't stop it from appearing on on
the web page for problem services straight away.
It would be great if you could get Nagios to display only hard status
alerts - I don't think you can though, not with ordinary Nagios Core
anyway. Some of the third-party Nagios front ends will do it, for
example you can configure the icons in NagVis only to display hard
More information about the Nagios-users