[Nagios-devel] [Nagios-users] external commands and segfault -- again
bobi at netshel.net
bobi at netshel.net
Mon Jan 8 17:55:43 UTC 2007
Hey Fellow Nagios-ites:
I've been having this *exact* same segfault problem for the last couple o'
And, after looking at David's stack trace output, it is segfaulting for
him in the exact same way/place as it is for me.
Here's what I've found:
The core dump's that I've examined are all segfaulting when handling the
expiration of a scheduled downtime.
Since David's stack trace looks identical to mine, I don't think it is in
the external command processing, as he believes, but it is in the downtime
expiration handling, as well.
Having examined about a dozen of these identical core dumps, I see that it
is a corruption of the entire sheduled_downtime structure that is being
passed into the handled_scheduled_downtime() function.
The handled_scheduled_downtime() function is being invoked by the high
priority event processing logic in the event_execution_loop(). So it
pulls a EVENT_SCHEDULED_DOWNTIME timed_event structure off of the high
priority event list, and then hands it to handle_timed_event(), which in
turns invoke the handle_scheduled_downtime() routine to handle the
expiration of the specified downtime event.
The problem is, the scheduled_downtime structure is already corrupted
while sitting in the high_priority list - well before it is dequeued by
the event_execution_loop() logic.
I've walked the high priority list in memory with gdb to examine other
timed_event structures and have noticed that only the scheduled_downtime
structure associated with EVENT_SCHEDULED_DOWNTIME timed events are
affected by the memory corruption. In fact, one time, I found nine
scheduled downtime expiration event sequentially listed in the high
priority list and the first three had their scheduled_downtime structures
corrupted and the remaining six were in pristine condition.
So, I've narrowed it down to a couple of possibilities (feel free to add
1. The scheduled_downtime structure is already corrupted when it is being
added to the high priority timed event scheduling list, or
2. The scheduled_downtime structure is OK when it is added to the high
priority list, but perhaps a bad pointer access is overwriting it with
garbage at some other point in the program. This would might be somewhat
painful to track down.
Of the two, I suspect that the second one is the more likely candidate.
Some other notes:
1. The timed event expirations that segfault Nagios seem to be "randomly"
We have some regularly submitted (via cron) scheduled downtimes that will
work fine for weeks, and then one of them will come up for expiration and
trigger this scheduled-downtime-expiration bug. I've also seen it happen
with ad-hoc scheduled downtime submissions via the CGI interface.
I've seen it happen with "regular" scheduled downtimes as well as the new
"triggered" scheduled downtime. We thought it might have been related to
the new triggered downtime, since that was one of the first events causing
a segfault. But then after eliminating the use of triggered downtimes
altogether, the segfaults still occur with the regular scheduled downtime
2. I've had this problem with Nagios 2.4, 2.5 and 2.6. So, "upgrading"
hasn't gotten rid of it.
3. We are currently running Nagios 2.6 on a 64-bit Linux platform: SLES-9
x86-64, Kernel 2.6.5-7.267-smp
4. We don't have any other segfault problems with other other apps on this
So I'm still trying figure out *what* is overwriting the
scheduled_downtime structures with garbage in memory.
Any ideas, based upon this additional information?
> David G Schlecht wrote:
>> Helmut W. Januschka <h.januschka <at> krone.at> writes:
>>> You may move the strlen() to a sperate variable
>>> And after that reproduce the gbd session and use "info locales" to see
>>> What are the actual values of the variables and maybe just the strlen
>>> Also try "print name1" and "print name1[i]" and look if there are some
>> unterminated strings
>>> And backtrace to where the call to the hashfunc2 occurs and have a
>>> look at
>> the value wich originally gots
>>> sended in
>>> Maybe its just a old version or a bad implemantion of the tool
>>> submitting the
>> external command (sure nagios
>>> should do a segfault at all ;))
>> Thanks for your reply, Helmut. From the segfault, it appears that
>> both name1 and name2 are corrupt as both are outside the program's
>> address space. Since name1 is received as a const, it's unlikely
>> that this routine (hashfunc2) is causing the problem. The calling
>> routine (find_service) doesn't change the contents of the variable
>> so it's not likely the problem. Tracing all the way back,
>> event_execution_loop seems the most likely cause of the segfault.
>> However this is not a trivial routine and I'm not able to debug it.
>> Is there anyone familiar enough with the code available to take a look?
>> This problem doesn't occur with each external command, but only
>> once every 200-300 times.
>> Any help would be most appreciated.
> I think you'd have better luck posting this to the nagios-devel list, so
> I'm cross-posting it there now. Provided you're subscribed there, we
> should be able to drop this from the nagios-users list where it really
> doesn't belong.
> What command is being sent into the command-pipe when nagios crashes?
> Have you made any modifications to the code?
> Does this happen with latest CVS code (without any local modifications)?
> Andreas Ericsson andreas.ericsson at op5.se
> OP5 AB www.op5.se
> Tel: +46 8-230225 Fax: +46 8-230231
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> opinions on IT & business topics through brief surveys - and earn cash
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
More information about the Nagios-devel