[Nagios-devel] [Nagios-users] external commands and segfault -- again

Andreas Ericsson ae at op5.se
Mon Jan 8 17:40:20 UTC 2007


bobi at netshel.net wrote:
> Hey Fellow Nagios-ites:
> 
> I've been having this *exact* same segfault problem for the last couple o'
> months.
> 
> And, after looking at David's stack trace output, it is segfaulting for
> him in the exact same way/place as it is for me.
> 
> Here's what I've found:
> 
> The core dump's that I've examined are all segfaulting when handling the
> expiration of a scheduled downtime.
> 
> Since David's stack trace looks identical to mine, I don't think it is in
> the external command processing, as he believes, but it is in the downtime
> expiration handling, as well.
> 
> Having examined about a dozen of these identical core dumps, I see that it
> is a corruption of the entire sheduled_downtime structure that is being
> passed into the handled_scheduled_downtime() function.
> 
> The handled_scheduled_downtime() function is being invoked by the high
> priority event processing logic in the event_execution_loop().  So it
> pulls a EVENT_SCHEDULED_DOWNTIME timed_event structure off of the high
> priority event list, and then hands it to handle_timed_event(), which in
> turns invoke the handle_scheduled_downtime() routine to handle the
> expiration of the specified downtime event.
> 
> The problem is, the scheduled_downtime structure is already corrupted
> while sitting in the high_priority list - well before it is dequeued by
> the event_execution_loop() logic.
> 
> I've walked the high priority list in memory with gdb to examine other
> timed_event structures and have noticed that only the scheduled_downtime
> structure associated with EVENT_SCHEDULED_DOWNTIME timed events are
> affected by the memory corruption.  In fact, one time, I found nine
> scheduled downtime expiration event sequentially listed in the high
> priority list and the first three had their scheduled_downtime structures
> corrupted and the remaining six were in pristine condition.
> 
> 
> So, I've narrowed it down to a couple of possibilities (feel free to add
> your own!):
> 
> 1. The scheduled_downtime structure is already corrupted when it is being
> added to the high priority timed event scheduling list, or
> 
> 
> 2. The scheduled_downtime structure is OK when it is added to the high
> priority list, but perhaps a bad pointer access is overwriting it with
> garbage at some other point in the program.  This would might be somewhat
> painful to track down.
> 
> 
> Of the two, I suspect that the second one is the more likely candidate.
> 

I think the first, as it only happens with scheduled downtime stuff. 
Otherwise you'd see it on other high-prio events as well (unless you're 
extremely unlucky each time the crash happens).

> 
> Some other notes:
> 
> 1. The timed event expirations that segfault Nagios seem to be "randomly"
> chosen.
> 
> We have some regularly submitted (via cron) scheduled downtimes that will
> work fine for weeks, and then one of them will come up for expiration and
> trigger this scheduled-downtime-expiration bug.  I've also seen it happen
> with ad-hoc scheduled downtime submissions via the CGI interface.
> 
> I've seen it happen with "regular" scheduled downtimes as well as the new
> "triggered" scheduled downtime.  We thought it might have been related to
> the new triggered downtime, since that was one of the first events causing
> a segfault.  But then after eliminating the use of triggered downtimes
> altogether, the segfaults still occur with the regular scheduled downtime
> expirations.
> 
> 2. I've had this problem with Nagios 2.4, 2.5 and 2.6.  So, "upgrading"
> hasn't gotten rid of it.
> 
> 3. We are currently running Nagios 2.6 on a 64-bit Linux platform: SLES-9
> x86-64, Kernel 2.6.5-7.267-smp
> 

This is the culprit, I guess. As this isn't a widespread problem, I 
wouldn't be surprised if it's related to 64-bit archs (kernel-2.6.5 is 
fairly ancient too, but that shouldn't matter as this is the only app 
you're seeing it in).

I'm guessing this actually is an SMP-system and that SuSE doesn't 
install SMP kernels on all systems, correct? If so, this could also be a 
source of problem for you. Nagios doesn't follow the pthread guidelines 
very closely and does some pretty inappropriate things post-fork() for 
being a threaded application. This could be one of those problems that 
doesn't happen on single-cpu systems because the only cpu doesn't have 
anything to compete with when racing for the memory.


> 4. We don't have any other segfault problems with other other apps on this
> system.
> 
> 
> So I'm still trying figure out *what* is overwriting the
> scheduled_downtime structures with garbage in memory.
> 
> Any ideas, based upon this additional information?
> 

Upgrade glibc and the kernel and pray. Other than that, I guess running 
it in valgrind and/or gdb for a long period of time or chucking 
assert()'s and printf()'s at the Nagios code and seeing where it breaks 
is the only solution.


btw, thanks for the nicely detailed problem report.


-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231




More information about the Nagios-devel mailing list