[Nagios-devel] freshness_threshold bug - big problem

Rodney Ramos rodneyra at gmail.com
Fri Dec 17 11:10:32 UTC 2010


Dear Jochen,

Than I understood that you confirm the problem, as your configuration was:
check_interval 15, retry_interval 2 and max_check_attempts 4.

And from your log we have:

18:39:55 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 12s
  (threshold=0d 0h 15m 16s). I'm forcing an immediate check of the host.
18:40:05 HOST ALERT: Unfresh;DOWN;SOFT;1;(null)

18:56:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 59s
  (threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
18:56:23 HOST ALERT: Unfresh;DOWN;SOFT;2;(null)

--> It´s wrong. It should be about 18:42:05, 2 minutes after the SOFT1, as
your retry_interval is 2 minutes.

19:28:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
  (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
19:28:23 HOST ALERT: Unfresh;DOWN;SOFT;3;CRITICAL: All life functions
terminated

--> It´s wrong. It should be about 18:58:23, 2 minutes after the SOFT2, as
your retry_interval is 2 minutes.

19:44:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
  (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
19:44:23 HOST ALERT: Unfresh;DOWN;HARD;4;CRITICAL: All life functions
terminated

--> It´s wrong. It should be about 19:30:23, 2 minutes after the SOFT3, as
your retry_interval is 2 minutes.

I´d like to know if the Nagios Core developers have already realized this
problem and if they are intending to correct it for the next release or
making a patch.

Thanks,
Rodney


On Thu, Dec 16, 2010 at 6:59 PM, Jochen Bern <Jochen.Bern at linworks.de>wrote:

> On 12/16/2010 12:03 PM, Rodney Ramos wrote:
> > As I´ve said before I think that it is a Nagios Core bug. I´ve tested it
> > with Nagios 3.2.1 and I found the same problem.
> > I think it´s a serious problem.
>
>
> Oh, wow. 8-O I can confirm the effect on my 3.2.3, but there seems to be
> *more* of a problem with host freshness checks. Test run with
> check_interval 15, retry_interval 2, max_check_attempts 4; log excerpt:
>
>
> 18:23:55 Warning: Host 'Unfresh' has no services associated with it!
> 18:24:28 EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;Unfresh;0;Manual
> Init to UP|
> 18:24:35 PASSIVE HOST CHECK: Unfresh;0;Manual Init to UP
>
> 18:39:55 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 12s
>   (threshold=0d 0h 15m 16s). I'm forcing an immediate check of the host.
> 18:40:05 HOST ALERT: Unfresh;DOWN;SOFT;1;(null)
>
> 18:51:12 Warning: Host 'Unfresh' has no services associated with it!
>
> 18:56:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 59s
>   (threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
> 18:56:23 HOST ALERT: Unfresh;DOWN;SOFT;2;(null)
> 19:00:12 Warning: Host 'Unfresh' has no services associated with it!
> 19:12:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
>   (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
> 19:12:23 HOST ALERT: Unfresh;DOWN;SOFT;2;CRITICAL: All life functions
> terminated
> 19:28:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
>   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
> 19:28:23 HOST ALERT: Unfresh;DOWN;SOFT;3;CRITICAL: All life functions
> terminated
> 19:44:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
>   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
> 19:44:23 HOST ALERT: Unfresh;DOWN;HARD;4;CRITICAL: All life functions
> terminated
> 20:00:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
>   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
> 20:16:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 41s
>   (threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
> 20:32:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
>   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
> 20:48:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
>   (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
> 21:04:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
>   (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
>
>
> (The additional "no services" crud stems from my not getting the check
> command right the first time 'round, and having to re-reload the config.)
>
>
> I took excerpts of status.dat and retention.dat initially and after the
> first nine active checks, look at these current_attempt numbers:
>
>
> # for FIL in *.dat* ; do echo -n "${FIL}:  " | \
> > sed -e 's/_[a-z]*-/-/' -e 's/\.[a-z]*: */:/' ; \
> > egrep '(current_attempt|state_type|(current|last_hard)_state=)' \
> > $FIL | sed -e 's/\([a-z][a-z][a-z]\)[a-z]*\([_=]\)/\1\2/g' | \
> > tr '\n\t' '  ' ; echo "" ; done
> retention.dat-OK:       cur_sta=0 las_har_sta=0 cur_att=1 sta_typ=1
> retention.dat-1:        cur_sta=0 las_har_sta=0 cur_att=1 sta_typ=1
> retention.dat-2:        cur_sta=1 las_har_sta=0 cur_att=1 sta_typ=0
> retention.dat-3:        cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
> retention.dat-4:        cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
> retention.dat-5:        cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
> retention.dat-6:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
> retention.dat-7:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
> retention.dat-8:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
> retention.dat-9:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
> status.dat-OK:   cur_sta=0  las_har_sta=0  cur_att=1  sta_typ=1
> status.dat-1:    cur_sta=1  las_har_sta=0  cur_att=1  sta_typ=0
> status.dat-2:    cur_sta=1  las_har_sta=0  cur_att=2  sta_typ=0
> status.dat-3:    cur_sta=1  las_har_sta=0  cur_att=2  sta_typ=0
> status.dat-4:    cur_sta=1  las_har_sta=0  cur_att=3  sta_typ=0
> status.dat-5:    cur_sta=1  las_har_sta=0  cur_att=4  sta_typ=1
> status.dat-6:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
> status.dat-7:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
> status.dat-8:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
> status.dat-9:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
>
>
> extinfo.cgi told me "1/4 (SOFT state)" at 19:03 (after the *2nd* active
> check, i.e., matching the data in retention.dat) but tells me "1/4 (HARD
> state)" right now (matching status.dat instead) ...
>
>
> Kind regards,
>                                                                J. Bern
> --
> Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
> Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
> PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
> Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
> Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nagios.com/pipermail/nagios-devel/attachments/20101217/cb3c86ef/attachment.html>


More information about the Nagios-devel mailing list