[Nagios-devel] freshness_threshold bug - big problem

Jochen Bern Jochen.Bern at LINworks.de
Thu Dec 16 20:59:40 UTC 2010


On 12/16/2010 12:03 PM, Rodney Ramos wrote:
> As I´ve said before I think that it is a Nagios Core bug. I´ve tested it
> with Nagios 3.2.1 and I found the same problem.
> I think it´s a serious problem.


Oh, wow. 8-O I can confirm the effect on my 3.2.3, but there seems to be
*more* of a problem with host freshness checks. Test run with
check_interval 15, retry_interval 2, max_check_attempts 4; log excerpt:


18:23:55 Warning: Host 'Unfresh' has no services associated with it!
18:24:28 EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;Unfresh;0;Manual
Init to UP|
18:24:35 PASSIVE HOST CHECK: Unfresh;0;Manual Init to UP

18:39:55 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 12s
   (threshold=0d 0h 15m 16s). I'm forcing an immediate check of the host.
18:40:05 HOST ALERT: Unfresh;DOWN;SOFT;1;(null)

18:51:12 Warning: Host 'Unfresh' has no services associated with it!

18:56:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 59s
   (threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
18:56:23 HOST ALERT: Unfresh;DOWN;SOFT;2;(null)
19:00:12 Warning: Host 'Unfresh' has no services associated with it!
19:12:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
   (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
19:12:23 HOST ALERT: Unfresh;DOWN;SOFT;2;CRITICAL: All life functions
terminated
19:28:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
19:28:23 HOST ALERT: Unfresh;DOWN;SOFT;3;CRITICAL: All life functions
terminated
19:44:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
19:44:23 HOST ALERT: Unfresh;DOWN;HARD;4;CRITICAL: All life functions
terminated
20:00:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
20:16:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 41s
   (threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
20:32:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
20:48:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
   (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
21:04:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
   (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.


(The additional "no services" crud stems from my not getting the check
command right the first time 'round, and having to re-reload the config.)


I took excerpts of status.dat and retention.dat initially and after the
first nine active checks, look at these current_attempt numbers:


# for FIL in *.dat* ; do echo -n "${FIL}:  " | \
> sed -e 's/_[a-z]*-/-/' -e 's/\.[a-z]*: */:/' ; \
> egrep '(current_attempt|state_type|(current|last_hard)_state=)' \
> $FIL | sed -e 's/\([a-z][a-z][a-z]\)[a-z]*\([_=]\)/\1\2/g' | \
> tr '\n\t' '  ' ; echo "" ; done
retention.dat-OK:       cur_sta=0 las_har_sta=0 cur_att=1 sta_typ=1
retention.dat-1:        cur_sta=0 las_har_sta=0 cur_att=1 sta_typ=1
retention.dat-2:        cur_sta=1 las_har_sta=0 cur_att=1 sta_typ=0
retention.dat-3:        cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
retention.dat-4:        cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
retention.dat-5:        cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
retention.dat-6:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
retention.dat-7:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
retention.dat-8:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
retention.dat-9:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
status.dat-OK:   cur_sta=0  las_har_sta=0  cur_att=1  sta_typ=1
status.dat-1:    cur_sta=1  las_har_sta=0  cur_att=1  sta_typ=0
status.dat-2:    cur_sta=1  las_har_sta=0  cur_att=2  sta_typ=0
status.dat-3:    cur_sta=1  las_har_sta=0  cur_att=2  sta_typ=0
status.dat-4:    cur_sta=1  las_har_sta=0  cur_att=3  sta_typ=0
status.dat-5:    cur_sta=1  las_har_sta=0  cur_att=4  sta_typ=1
status.dat-6:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
status.dat-7:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
status.dat-8:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
status.dat-9:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1


extinfo.cgi told me "1/4 (SOFT state)" at 19:03 (after the *2nd* active
check, i.e., matching the data in retention.dat) but tells me "1/4 (HARD
state)" right now (matching status.dat instead) ...


Kind regards,
								J. Bern
-- 
Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Unfresh.tgz
Type: application/x-compressed-tar
Size: 20593 bytes
Desc: not available
URL: <http://lists.nagios.com/pipermail/nagios-devel/attachments/20101216/786f493d/attachment.bin>


More information about the Nagios-devel mailing list