[rbldnsd] Re: rbldnsd-0.995 and duplicate entries

Thu Oct 6 20:55:40 MSD 2005

[Cc'ing rbldnsd mailinglist, for future reference.
  If you want to discuss rbldnsd, please use the mailinglist.
  Thanks]

Sami Farin wrote:
> hi.
> 
> I wanted to sort and remove duplicates from my list of
> dynamic hosts, which I serve with rbldnsd.
> There are some duplicates and some stupidity
> like listing a /16 netblock by listing all of the 256
> /24 netblocks.  First I thought I could modify rbldnsd
> to spit out human-readable text file with -D option
> (dumphumanzone()) but I noticed rbldnsd does not remove
> duplicated entries, when the network sizes differ:
> for example if both 82.22 and 82.22.33 are listed, rbldnsd does
> not remove 82.22.33.  That causes extra /24 to be listed
> in the stats:
> 
> 2005-10-06 17:53:21.915621500 rbldnsd: ip4set:easynet.nl/rbldnsd.dynablock.txt: 20051006 145254: e32/24/16/8=579103/319512/2071/0
> 
> and it also gives too many useless (duplicated) lines with -d option.
> 82.22 and 82.22.33 have the same TXT record, so there's no point
> in including 82.22.33.

Looks like you somewhat misunderstood what rbldnsd is for.
It isn't suitable to optimize your data, it's primary goal
is to be a nameserver, not net-range processor/optimizer.
It performs some minor optimisations for the data, so that
real duplicates will not go into the DNS replies.  Minor
optimisations which only very little affects its speed
(data loading speed in this case).

The case you described above requires quite major and slow
optimisation pass, and is not relevant for DNS operations -
it does not matter whenever rbldnsd will reply to the query
using 82.22.33 entry or 82.22 entry, if both gives the same
A/TXT result, *and* there are only two entries like that for
the given query.

Note that bold "and" - suppose you have *another* 82.22.33
entry, with different A/TXT (eg, dynamic block like you have,
*and*, say, spammer-infested block, both marked with appropriate
TXT records).  In this case, without 82.22.33 entry, rbldnsd
will return TWO records in reply, while with that "small entry",
it will return only one, for 82.22.33.  Like:

      *.22.82  Dynamic block
      *.22.82  Spammer XYZ
   *.33.22.82  Dynamic block

Note that this is exactly how named will handle this set of
records.  "Most specific entry wins", something like that
(this rule only works when all entries are on octet boundary,
even in rbldnsd ip4set - due to how DNS works and due to some
speed/correctness compromise, but that's another story).

Note also that master-format data dump (-d) was implemented
as a side feature, like a quick hack, sort of, without much
optimisations, because I don't think bind is suitable to
serve large DNSBL zones (or else why rbldnsd exists?), and
for small zones, it's basically irrelevant whenever data
is compact/optimized or not, while it is correct.

> When this bug gets fixed, I start coding the -D feature...

Thanks.  I consider it to be somewhat rude to name something a
bug without understanding it.

> Any free tips and tricks?  -d is line-oriented, but -D should
> print sequential /16 /24 and /32 networks in one line,
> e.g.
> 80.237.53.225-230
> instead of
> 80.237.53.225
> 80.237.53.226
> 80.237.53.227
> 80.237.53.228
> 80.237.53.229
> 80.237.53.230

If you want to create an optimizer for network ranges (I think
there are several already available, but I haven't looked), you'd
better use somewhat different data structures internally, at least
not the same as in ip4set.  Well, if you want to write optimal
master-format dump, ip4set structures may be of some use...  For
CIDR optimisation, take a look at ip4trie - it's optimized for
single-IP lookups, but with minor modification it can be used
for range sets optimisation (but the whole structure is quite
trivial anyway).

You can take some infrastructure from rbldnsd, like address
parsing etc, but that's basically it - such optimizer should
be a separate application, without all the DNS/network/etc
baggage of rbldnsd.  IMHO ofcourse.

> I guess this is the hardest part of the coding.
> I am still studying the code to see where to start.

I think the hardest part is to try to mix-n-match two quite
different tasks in one application, basing them on the same
data structures.

And speaking of dynablock lists, I have several separate
points, most important of which is:

  Usually, if you maintain such a list, you know where each
  entry come from and why (using appropriate comments etc),
  so that it will be possible to find what's wrong after
  receiving a complaint and so on.  When you optimize it,
  you lose this info, and make maintenance of the list more
  difficult.  For such a list, maintenance is the #1 problem,
  efficiency/optimisation of data does not really matter -
  tools such as rbldnsd will do their best to keep it running
  without spending much CPU cycles or memory - +/- several
  millisecounds on each reload and several kilo-bytes of
  memory is nothing compared to maintenance costs...

Or something like that, anyway.

/mjt