[Nagios-devel] [Nagios-users] external commands and segfault -- again

Andreas Ericsson ae at op5.se
Mon Jan 8 17:40:20 UTC 2007

bobi at netshel.net wrote:
> Hey Fellow Nagios-ites:
> I've been having this *exact* same segfault problem for the last couple o'
> months.
> And, after looking at David's stack trace output, it is segfaulting for
> him in the exact same way/place as it is for me.
> Here's what I've found:
> The core dump's that I've examined are all segfaulting when handling the
> expiration of a scheduled downtime.
> Since David's stack trace looks identical to mine, I don't think it is in
> the external command processing, as he believes, but it is in the downtime
> expiration handling, as well.
> Having examined about a dozen of these identical core dumps, I see that it
> is a corruption of the entire sheduled_downtime structure that is being
> passed into the handled_scheduled_downtime() function.
> The handled_scheduled_downtime() function is being invoked by the high
> priority event processing logic in the event_execution_loop().  So it
> pulls a EVENT_SCHEDULED_DOWNTIME timed_event structure off of the high
> priority event list, and then hands it to handle_timed_event(), which in
> turns invoke the handle_scheduled_downtime() routine to handle the
> expiration of the specified downtime event.
> The problem is, the scheduled_downtime structure is already corrupted
> while sitting in the high_priority list - well before it is dequeued by
> the event_execution_loop() logic.
> I've walked the high priority list in memory with gdb to examine other
> timed_event structures and have noticed that only the scheduled_downtime
> structure associated with EVENT_SCHEDULED_DOWNTIME timed events are
> affected by the memory corruption.  In fact, one time, I found nine
> scheduled downtime expiration event sequentially listed in the high
> priority list and the first three had their scheduled_downtime structures
> corrupted and the remaining six were in pristine condition.
> So, I've narrowed it down to a couple of possibilities (feel free to add
> your own!):
> 1. The scheduled_downtime structure is already corrupted when it is being
> added to the high priority timed event scheduling list, or
> 2. The scheduled_downtime structure is OK when it is added to the high
> priority list, but perhaps a bad pointer access is overwriting it with
> garbage at some other point in the program.  This would might be somewhat
> painful to track down.
> Of the two, I suspect that the second one is the more likely candidate.

I think the first, as it only happens with scheduled downtime stuff. 
Otherwise you'd see it on other high-prio events as well (unless you're 
extremely unlucky each time the crash happens).

> Some other notes:
> 1. The timed event expirations that segfault Nagios seem to be "randomly"
> chosen.
> We have some regularly submitted (via cron) scheduled downtimes that will
> work fine for weeks, and then one of them will come up for expiration and
> trigger this scheduled-downtime-expiration bug.  I've also seen it happen
> with ad-hoc scheduled downtime submissions via the CGI interface.
> I've seen it happen with "regular" scheduled downtimes as well as the new
> "triggered" scheduled downtime.  We thought it might have been related to
> the new triggered downtime, since that was one of the first events causing
> a segfault.  But then after eliminating the use of triggered downtimes
> altogether, the segfaults still occur with the regular scheduled downtime
> expirations.
> 2. I've had this problem with Nagios 2.4, 2.5 and 2.6.  So, "upgrading"
> hasn't gotten rid of it.
> 3. We are currently running Nagios 2.6 on a 64-bit Linux platform: SLES-9
> x86-64, Kernel 2.6.5-7.267-smp

This is the culprit, I guess. As this isn't a widespread problem, I 
wouldn't be surprised if it's related to 64-bit archs (kernel-2.6.5 is 
fairly ancient too, but that shouldn't matter as this is the only app 
you're seeing it in).

I'm guessing this actually is an SMP-system and that SuSE doesn't 
install SMP kernels on all systems, correct? If so, this could also be a 
source of problem for you. Nagios doesn't follow the pthread guidelines 
very closely and does some pretty inappropriate things post-fork() for 
being a threaded application. This could be one of those problems that 
doesn't happen on single-cpu systems because the only cpu doesn't have 
anything to compete with when racing for the memory.

> 4. We don't have any other segfault problems with other other apps on this
> system.
> So I'm still trying figure out *what* is overwriting the
> scheduled_downtime structures with garbage in memory.
> Any ideas, based upon this additional information?

Upgrade glibc and the kernel and pray. Other than that, I guess running 
it in valgrind and/or gdb for a long period of time or chucking 
assert()'s and printf()'s at the Nagios code and seeing where it breaks 
is the only solution.

btw, thanks for the nicely detailed problem report.

Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

More information about the Nagios-devel mailing list