[Nagios-devel] [PATCH] circular dependency check speedup

Glenn Herteg gherteg at gwoslabs.com
Mon Dec 20 20:02:31 UTC 2010

 >> I've done some organoleptic analysis of the patch (looked at it).

The trouble with organoleptic analysis is that it's axiomatic in the
benchmarking business that your guess (or mine) as to where time is
being spent is simply wrong.  The software+hardware computational
model is much too complex for people to have accurate intuition
about how all the elements interact.  Actual measurements are needed.

 > I'll respond in more detail in a few days.

GroundWork developed the circular dependency patch by profiling
the code while running preflight operations on a large customer
configuration.  A pre-flight operation lists the following counts:

     Checked 35471 services.
     Checked 3362 hosts.
     Checked 116 host groups.
     Checked 0 service groups.
     Checked 290 contacts.
     Checked 225 contact groups.
     Checked 0 service escalations.
     Checked 51031 service dependencies.
     Checked 0 host escalations.
     Checked 0 host dependencies.
     Checked 475 commands.
     Checked 256 time periods.

Total size of the set of Nagios config files is under 18 MB.

To skip immediately to the final results, here are the timings from
test runs while successive stages of this patch were being developed:

     baseline:  pre-flight:  859.113 seconds
     stage 1:   pre-flight:  185.139 seconds
     stage 2:   pre-flight:   42.856 seconds
     stage 3:   pre-flight:    3.107 seconds

The baseline run was taken using a stock Nagios 3.2.0 release.
The remaining runs were taken using patched versions of Nagios 3.2.1,
which does not differ from 3.2.0 in its calculations for circular
dependency analysis.

The timings shown above are wall-clock measurements for end-to-end
pre-flight operations.  We also took detailed call-graph profile
timings (using gprof) during this development effort to guide
the code changes, but those reports were not retained afterward,
so I don't have corresponding function-level timings available
now to document each stage of patch development.  The claim of a
speedup of around 20,000x for the circular dependency analysis is
based on improvements seen in those function-level timings during
the development.


Stage 1:

Initial profiling showed that most of the time was being spent
in pre_flight_circular_check().  There's no i/o in that routine
or the ones it calls, so right away we know the problem we face is
computational, not due to any system file-cacheing effects.  Also,
there's not much code in that routine that could be the source of the
problem, so the nature of the bad performance was fairly obvious:
an O(n^^2) clearing of the circular_path_checked values within
independent data structures in a linked list.  To address that,
we made the following changes for the first stage of patching:

* make a contiguous array for
   servicedependency_circular_path_checked, effectively removing the
   circular_path_checked flags from the individual data structures
   and also changing the boolean flags from 4 bytes down to 1 byte
* use a call to memset() to clear the entire array in one operation,
   rather than walking a linked list

Extra time is now taken to prepare the new contiguous array, but
the timing results clearly show the benefit far outweighs the cost.

The revised code is still O(n^^2) (as we still have to clear just
as many flags, and just as often), but one dimension of this square
has been radically shrunk.  Modern processors are highly optimized
for clearing a continuous swath of memory to a fixed value, since
clearing large areas to zero is such a commonly needed action.
So we use that mechanism instead of jumping randomly around in
memory.  (Aside from benefits of using probably-microcoded processor
capabilities for the contiguous clearing, such as highly-efficient
multimedia instructions, we also avoid a tremendous amount of
potential cache-line aliasing and attendant inefficiency in memory
accesses, and possibly extra page faults.)  With contiguous boolean
flags, we were also able to change their type from int to unsigned
char, reducing the space and thus further reducing the amount of
time needed for the clearing action.

These changes resulted in a 4.64x speedup in overall pre-flight time
from the baseline (down from 859.113 seconds to 185.139 seconds).


Stage 2:

Profiling now showed check_for_circular_servicedependency_path()
as the bottleneck.  So we made the following changes for the second
stage of patching:

* make a contiguous array for servicedependency_list, so it can be
   walked more efficiently, similar to above
* factor some repeated pointer dereferencing out of loops
* re-order servicedependency_struct members, based on timing tests
   and examination of the compiled assembly code
* simplify some boolean tests (e.g., no need to test for
   (found==TRUE) [a specific non-zero value] when (found!=FALSE) or
   just (found) compares against zero [a highly optimized operation]
   and gives the same useful result; and no need to even store
   a transient boolean value in a memory location when it can be
   tested directly after production)
* drop some redundant NULL-pointer tests

Once again in this stage, extra time is now taken to prepare the new
contiguous array, but the timing results clearly show the benefit
far outweighs the cost.

This stage of patching reduced the critical bottleneck in the
processing to a single loop, compiled to 6 assembly instructions, one
of which was a no-op filler instruction, with the loop having only
one external memory reference per iteration.  Basically, you can't
get any tighter than that and do any useful work.  The remaining
performance problem was that this loop was being executed for
1.3 billion iterations when analyzing the customer dataset, which
(even at only about 27.6 nanoseconds per iteration) still adds up
to macroscopic time.  (Remember what an O(n^^2) algorithm is all
about, at the scale of 51031 service dependencies, half of which
are execution dependencies and half of which are notification
dependencies.  (51031/2)^^2 + (51031/2)^^2 = 1.3 billion.)

These changes resulted in a 20.0x speedup in overall pre-flight
time from baseline (down from 859.113 seconds to 42.856 seconds),
and a 4.32x speedup from the previous stage (down from 185.139
seconds to 42.856 seconds).


Stage 3:

check_for_circular_servicedependency_path() remained as the
bottleneck, so we analyzed why we had so many iterations of its
internal looping.  That turned out to be because a lot of data in
essentially random order (with respect to what we care about in
this analysis) was being searched linearly, and again by jumping
around in memory while walking the full length of a linked list --
another clear opportunity for structure+algorithm optimization.
So we made the following changes for the third stage of patching:

* sort the servicedependency_list contiguous array
* use a specialty binary search to find the first dependent service
   we care about in the sorted list, if present
* change the search for dependent services to examine only a short
   sequence of actual candidates, starting at that first dependent
* re-order more servicedependency_struct members, again based on
   a timing study
* perform some trivial code cleanup from the previous patch stages

One might wonder whether the additional O(n*log(n)) time now taken
to sort might be a significant overhead to bear, but the timing
results clearly show the benefit far outweighs the cost.

Why do we use a specialty search instead of some standard library
function?  With respect to the usual suspects:

* lfind(3) would be slow; its linear search of an array is exactly
   what we're trying to avoid here.
* bsearch(3) [binary search of a sorted array] and tfind(3) [find
   a matching element in a binary tree] both return an unspecified
   element if there are multiple elements that match the key.

What we need instead is to always find the first element in a
contiguous series of matching elements, if any such elements exist,
so that we can walk just the full length of that segment of the
sorted array and then be done.  So we need to implement our own
routine to perform the search.

This stage of patching avoids walking the full list of dependencies
when only a few (now in contiguous sequence, and easy to quickly
identify) need to be analyzed.  Profiling showed that the dependency
analysis, which was taking many hundreds of seconds in the original
code, had now been reduced to only a few hundredths of a second;
hence the claimed 20,000x speedup.

These changes resulted in a 276.5x speedup in overall pre-flight
time from baseline (down from 859.113 seconds to 3.107 seconds),
and a 13.79x speedup from the previous stage (down from 42.856
seconds to 3.107 seconds).


The discussion above reflects improvements in circular service
dependency analysis.  Exact corresponding changes were also made to
improve circular host dependency analysis.  That accounts for the
nearly-identical code in the patch, as different data structures
for hosts and services are being handled.  In this regard, we
are simply mirroring the structure of the code we are patching,
following the requirements of C-language type handling.


An additional part of testing involved purposely introducing a number
of circular dependencies of various types into the test dataset,
to establish that the revised code still found exactly the same
errors as the original baseline code.


What's the practical impact of this patch?  Well, it can prevent
lengthy monitoring outages when Nagios is bounced.  For example,
I would be completely unsurprised if the effect that Rodney Ramos
reported a couple of weeks ago ("NSCA dropping packet with stale
timestamp") is due to the existing dependency-analysis performance,
if his configuration involves a lot of dependencies.  While Nagios
is spending time analyzing dependencies, it won't be reading its
command pipe, and pipe writers such as NSCA will quickly fill the
pipe and be unable to write more.  That could stall its reading
of its own client connections to the point that when it finally
does read them, the packets it finds are then stale.  Look back in
the archives and you'll see that nobody responded to his request
for help.  This patch may be the solution for his situation.


So what's the upshot of these changes, with regard to the previously
published critique?

* The several stages of patch development make it clear that
   algorithm changes and structure changes can go hand-in-hand to
   achieve significant performance improvements.

* All of the structure changes were made based on observed timing
   improvements at the stage when they were implemented, not on
   guesswork about what the compiler might or might not be doing.

* When we re-order certain structure members, there is a potential
   tradeoff:  some member accesses will become much faster, while
   others that were presumably formerly fast will now become slower.
   This tradeoff seemed reasonable on the following grounds.

   The accesses we care about for circular dependency analysis are
   grouped together at a singular point in time, and have a very
   visible effect then, during process startup.  Slower accesses for
   other members will be spread out during normal operation across
   the entire day, and should therefore be essentially unnoticeable,
   causing imperceptible and acceptable latency by eating very
   slightly into the processor idle time.  No, we haven't analyzed
   the rest of the code, nor have we made timing measurements to
   prove this hypothesis, but that's our gut feeling.  Yes, I know
   how I started this writeup.

* All tests were run in an x86_64 environment, as these days it
   is the sweet spot of the computing industry.  Five years ago,
   I would not have said that, but in the last couple of years, it
   has become difficult to purchase a serious machine which is not
   multi-core and 64-bit capable.  That's clearly where the market is
   today and in the future, for anyone who cares about performance
   at scale (which is what this patch is intended to address).
   The x86_64 architecture suffers slightly with respect to 32-bit
   x86 because it uses more memory bandwidth to fetch wider objects,
   but it more than makes up for this by architectural extensions
   that make more registers available.

* An i386 (32-bit x86) environment was not tested, nor did we run
   tests on UltraSPARC, PowerPC, Itanium, or any other architectures.
   Naturally, we have no guarantees that improvements targeted to
   the x86_64 won't slightly decrease performance on other machines.
   But given the wide popularity of the x86_64 architecture, targeting
   the most commonly used machines as the basis for optimization seems
   more than reasonable.  Whatever small degradation might appear
   on these other platforms should be much more than made up for by
   the dramatic overall performance improvement afforded by the patch.

Rather than fighting this patch because the structure member changes
will break NEB compatibility in the 3.2.X releases, and wasting
everyone's time trying to disentangle the synergistic effect of
both algorithm and structure changes, perhaps a different approach
is in order.  The commentator has already expressed a strong desire
to move to the 3.3.X release series:

 >> Do you have any ETA about the release date of the 3.3 series ?
 > No. Ethan is still the only one who can push the release button.
 > Otherwise I'd have cut a release quite a while ago.

So perhaps adoption of this patch should be treated as the triggering
event to move forward and declare this new minor release level.



More information about the Nagios-devel mailing list