[rbldnsd] ipv6 support: beginnings of an ip6trie dataset

Tue Mar 19 22:59:44 MSK 2013

On Tue, Mar 19, 2013 at 12:26:14PM +0400, Michael Tokarev wrote:
> 13.03.2013 01:44, Jeff Dairiki wrote:
> > On Sat, Mar 02, 2013 at 11:26:55AM -0800, Jeff Dairiki wrote:
> []
> > Okay, it took longer than two or three days, but I've completely
> > redone the trie implementation (several times, at this point...) I
> > think it was worth it. The current implementation uses less memory
> > than the previous one by roughly a factor of three.  For details on
> > the implementation, see the essay-like comments at the top of btrie.c.
> > 
> > The code is at:
> > 
> >    https://github.com/dairiki/rbldnsd
> > 
> > The actual code is in the 'ipv6' branch:
> > 
> >    https://github.com/dairiki/rbldnsd/tree/ipv6
> > 
> > If you don't want to deal with the git, you can fetch a tarball from:
> > 
> >    https://github.com/dairiki/rbldnsd/tarball/ipv6/rbldnsd.tar.gz
> 
> This is really _excellent_ stuff.

Thank you!

> I never really bothered to
> understand how these tries work, I just borrowed the trie
> implementation from somewhere else, still not understanding
> the inner workings.  To my shame.
>
>
> But your code looks very good.  I have a few minor comments,
> but these are just that - minor, the core thing is good.
> 
> Now looking at it all, maybe we should change all the types
> used in the code to always be in network byte order, the
> same way your trie works?  I mean, the ip4addr_t &friends?

Personally, I don't think that is particularly important.
Performance-wise I think it is a non-issue; the only reason I see to
consider the change is for code readability or aesthetics.  If one
were starting from scratch, then possibly it would be worth using
network byte-order (though there are still advantages and
disadvantages to either choice.)  As things stand, there are only a
few (three?) places in the code where there is a byte-order impedance
mismatch (where byte-order needs to be swapped).

> On the other hand, how do you think, if we change the trie
> prefix to use words instead of bytes?  I *guess* using native
> word comparisons instead of memcmp should work better?

I started out using words instead of bytes, but there are a few
reasons why storing the prefix as bytes worked out better.

1. Bytes pack better.  The whole point of much of this exercise was to
optimize the memory usage of the trie.  As currently implemented, the
LC-node structure contains either 3 or 7 bytes (depending on processor
word/pointer size) one pointer (to point to child nodes) of prefix.
Each node also has one pointer (to point to child nodes), and one
byte (containing the prefix length, and a couple other bit flags).
The whole arrangement is chosen so that the node structure fits into
a structure the size of two pointers.   It would be hard to do this
if the relative prefix were stored as words.

2. Bytes are closer to bits.  Conceptually, the "relative prefix"
stored in each LC-node starts at some arbitrary bit position of the
overal prefix.  In my implementation, however, I've chosen to store
the relative prefix starting at the closest previous whole-byte
boundary.  This "wastes" some bits, but it also eliminates the need to
perform 128-bit shifts of ip6 addresses (in other words, we shift by
bytes instead of bits) as we traverse the trie.  Using words instead
of bytes "wastes" significantly more bits in the node structure.

> But I'm not sure how significant this speed difference is,
> to start with.

I'd guess the difference is almost certainly insignificant.  (Gcc, at
least, is very good at optimizing memcmp() invocations.)

With modern processors, their multi-level caches, and the increase in
CPU speed relative to main memory access speed, I think cache
efficiency is the main thing to keep in mind when looking at this sort
of code.  It is the cache-misses that are going to take the time.  To
that end, the important things for us are: minimizing the total trie
depth; and having compact node structures which fit well into a CPU
cache line.  Using bytes rather than words for the prefix helps with
both of those criteria.

It is possible that my common_prefix() could be optimized to do
comparisons by words rather than bytes, at least on architectures
which can efficiently load non-aligned words.  Then again, I suspect
that most compilers are probably doing that for us already (when
optimization is enabled).

https://github.com/dairiki/rbldnsd/blob/ipv6/btrie.c#L648

In any case, I've done some (not a lot) of profiling on my code, and
have yet to see common_prefix() show up as being a significant
consumer of time.

> When I wrote rbldnsd initially, computers
> were much slower, and it was really important to get every
> CPU cycle out of it, -- for example, my mirror of CBL&DSBL&SH,
> which were delivering about 10000PPS (packets per second)
> was run on a PIII-class 600MHz celeron CPU.  Nowadays it
> should be much much faster...

Yes, just a bit! ;-)

> > Some notes, in order of decreasing significance:
> > 
> > Untested on Big-Endian Systems
> > ------------------------------
> > 
> > The new trie code does some hairy bit-field packing.  As of yet, it
> > has only been tested in an Intel, little-endian environment.  I've
> > attempted to ensure that it should work on big-endian machines, but
> > that is currently untested.  (Please send me reports of success or
> > failure.)  (Also see 'tests' below.)
> 
> In a few days I'll test on several architectures, including big and
> little endian and 32&64 bits.  These are Debian porter boxes.

Very good.

BTW, What are the significant non-linux/non-gcc targets for rbldnsd?

I ask because it seems that it would significantly clean up some parts
of the code if we could rely on having a working <stdint.h>.
(E.g. just use uint8_t instead of the various usages of ip6oct_t,
'unsigned char', and now btrie_oct_t.)

> > Other Changes
> > -------------
> > 
> > I've also plugged the new trie code into the ip4trie (and acl)
> > datasets.  For the ip4trie, with 64k random 32-bit IP4 addresses, the
> > new code uses less memory by a factor of roughly three.  That means
> > that the ip4trie dataset (containing only 32-bit addresses) now uses
> > only 50% more memory than the ip4set dataset.
> 
> With this, maybe we should just drop all other ip4* things altogether,
> how do you think?  Maybe ip4tset can be kept, but ip4set (with its
> very confusing semantic) may go.  Ip4tset is okay with ip4addr_t
> being in network byte order too (it does not matter how to order
> things there).  With nowadays memory sizes it does not matter anymore
> that new ip4trie uses 50% more memory than ip4set.

So long as the syntax differences between ip4set and ip4trie don't
cause much pain for users, that sounds reasonable.  E.g. Are there
valid ip4set zone files which are not valid ip4trie zone files?  (I'm
not familiar enough with the syntax details to say for sure.)

> Did you try comparing _speed_ of these?  Especially the load speed?
> What is very nice about ip4*set is their very fast loading, due to
> inline qsort implementation and native comparisons.  Ip4trie was quite
> a bit slower here.

I have only done a very cursory speed comparison --- it would be worth
looking into this a bit further --- but as far as I know, the speed
differences are on the same order as the memory usage differences.

While comparing memory consumption of the various datasets, I did note
the elapsed 'user time' for the zone load reported by rbldnsd.  Here
are my raw notes (the numbers after the slashes are the load times):

   64k randomly generated /32 ip addresses

   | implementation      | 32-bit arch | 64-bit arch |
   |---------------------+-------------+-------------|
   | old ip4trie         | 2.5M /.2s   | 4.0M /.2s   |
   | ip4trie using btrie | 0.73M /.2s  | 1.4M /.2s   |
   | ip4set              | 0.50M /.2s  | 1.0M /.2s   |

   1M randomly generated /32 ip addresses

   | implementation      | 32-bit arch | 64-bit arch |
   |---------------------+-------------+-------------|
   | old ip4trie         | 40.0M /.54s | 64.0M /.58s |
   | ip4trie using btrie | 11.7M /.36s | 23.4M /.27s |
   | ip4set              |  8.0M /.18s | 16.0M /.17s |

   64k randomly generated /64 ip6 addresses (only from IANA-assigned
   prefixes - not sure that matters)

   | implementation      | 32-bit arch | 64-bit arch |
   |---------------------+-------------+-------------|
   | ip6trie             | 1.2M /.2s   | 1.4M /.3s   |
   | ip6tset             | 0.5M /.4s   | 0.5M /.3s   |

There is a significant difference in load time only for very large (1
million IP address) data sets.  Even then, the difference is not large
--- on the same order as the differences in memory use (and I suspect
that is not a coincidence.)

I ran these tests on a fairly fast CPU (i7-870 2.93GHz).  It would
probably be enlightening to try this again on slower hardware (and, of
course,with more precise timing, and more than one run per data
point).

> Lookup speed shouldn't be an issue here, or at least not LARGE issue.

Agreed.

> But dataset loading time is important.

> > The ACL dataset now allows both IP4 and IP6 entries.
> > 
> > The NEWS and debian/changelog files, and the rbldnsd.8 man page have
> > been updated to reflect the changes I've made.  Refer to them for
> > slightly more detailed descriptions of what has been done.
> 
> There's no need to update debian/changelog really.  It is specific
> to debian and doesn't even belong to the main source tarball.  But
> that doesn't hurt either :)
>
> 
> > Tests
> > -----
> > 
> > There are a number of tests for the new trie code.  Some are written
> > in C; there are also now some python scripts which test rbldnsd by
> > running it as a foreground daemon and sending it DNS queries.  Both
> > the C self-tests and the python tests provide near 100% coverage of
> > the trie insertion and lookup code. (The test coverage of the "zone
> > dump" code is spotty at present.)
> 
> I'm not sure we really need this "zone dump" for the larger datasets
> anymore.  The v6 datasets may just print some comment saying the
> operation isn't supported, and that's all.  This whole code was more
> or less just a compatibility hack, when rbldnsd was new, so it was
> easy for a DNSBL to provide master-format data for secondaries who
> is running BIND, so it was possible to combine rbldnsd and BIND
> nameservers for one zone.  I don't know how relevant this is today,
> plus the fact that rbldnsd does not support any way of zone transfers..
> dunno.

I don't have strong feeling either way.  I guess the question here is
whether anyone is currently using the feature.

> > You can run the tests like this:
> > 
> >   ./configure
> >   make
> >   make check
> > 
> > To run the python test scripts, you'll need to have python 2.6 or 2.7
> > installed, as well as the pydns library (which in debian is in the
> > python-dns package.)
> > 
> > The new trie code also makes use of run-time assertions (or "sanity
> > checks").  By default these are disabled, to compile them in, run
> > configure with the --enable-asserts argument.
> 
> Yeah, that's an example of minor things which I don't like -- this
> your game with asserts.  You redefine standard meaning of DEBUG/NDEBUG
> slightly, and make it confusing a bit.

Okay, how so?  Tell me what you'd rather see.  I'd be happy to
change it.  (Or feel free to change things yourself.)

I agree that the 'configure --enable-asserts' is a bit awkward.  My
goal with that was to ensure that -DNDEBUG gets set by default for all
production compilations.  If there's a better way, I'm for it.

> Especially with your own assert-like constructs.

> Another such example is the "realloc" code on top of mempool, -- maybe
> it is better to use real malloc/realloc here, and use mempool just for
> some memory objects whcih dont need realloc?  The idea of mempool was
> to simplify memory deallocation when the object isn't needed anymore,
> to free it all very fast in one go without worrying about memory leaks.
> But it really wasn't designed for realloc.  If we can walk over whole
> tree freeing all objects while we go, that should be better than to
> hack realloc on top of mempool, I think, but sure thing I may be wrong.

In this case, I think, the trie code does benefit from the specialized
memory management.

The trie code has a need for frequent resizing of fairly small memory
allocations.  Also, the majority of allocations are of just a few
discrete sizes.  (Allocations are made in multiples of the node size,
from 1 to a theorectical maximum 96 nodes (on 64-bit machines).  In
practice, the majority of allocations are for smaller size arrays ---
the histogram of allocation sizes is roughly exponentially decreasing,
with most allocations being below eight or so nodes in size.  The
plurality of allocations is for single-node arrays.)

This is a tough usage pattern for a general malloc/realloc design to
handle efficiently.  I know that it used to be (15-20 years ago --- I
don't know how true it still is) that many malloc implementations were
horribly inefficient (in terms of memory usage) under conditions like
this (frequent resizing of small allocations.)  In any case, the
system malloc/realloc has to add a (private) header to each allocation
to track its size.  This is inefficient for small allocations.  (The
trie code always knows how big its arrays are, so our custom allocator
does not need to store size information separately.)

In our case where we build the trie once, then then toss it and
rebuild it from scratch (rather than incrementally modifying it) when
reloading, I suspect the mempool strategy of requesting memory from
malloc in larger chunks, then freeing it en masse is still helpful
in terms of memory fragmentation and total memory usage.  (Of course,
that would depend heavily on the peculiarities of whatever malloc
implementation is in use.)

Anyhow, the point is that the trie code has rather specialized memory
requirements, and that (knowing those requirements) it is possible for
a custom allocator to allocate memory more efficiently than a
generalized system malloc can do.  (Especially so in our case where we
do care about total memory usage and memory fragmentation.)

(Which is not to say, that there is not room for improvement in my
memory allocation code.  But I did do enough testing/profiling to
convince myself that performance of my memory management code is not
an issue under the test cases I ran (i.e. mostly datasets of randomly
generated IP CIDR ranges).)

> Now, the question is: are you okay to integrate this stuff into main
> code?

Yes, of course!  That was my hope/intention all along.

> We've a few difference in licensing terms here, which may be
> problematic a bit.

No problem.  I chose the BSD license because it's one of the least
restrictive of the "open source" licenses.  If it turns out to be a
problem, I'm happy to release it under a different license.

> Thank you very much for all this hard work.  I really apprecate it!
> 
> /mjt
> 

Jeff