[rbldnsd] Re: rbldnsd-0.995 and duplicate entries

Thu Oct 6 22:20:00 MSD 2005

On Thu, Oct 06, 2005 at 08:55:40PM +0400, Michael Tokarev wrote:
> [Cc'ing rbldnsd mailinglist, for future reference.
>  If you want to discuss rbldnsd, please use the mailinglist.
>  Thanks]
> 
> Sami Farin wrote:
> >hi.
> >
> >I wanted to sort and remove duplicates from my list of
> >dynamic hosts, which I serve with rbldnsd.
> >There are some duplicates and some stupidity
> >like listing a /16 netblock by listing all of the 256
> >/24 netblocks.  First I thought I could modify rbldnsd
> >to spit out human-readable text file with -D option
> >(dumphumanzone()) but I noticed rbldnsd does not remove
> >duplicated entries, when the network sizes differ:
> >for example if both 82.22 and 82.22.33 are listed, rbldnsd does
> >not remove 82.22.33.  That causes extra /24 to be listed
> >in the stats:
> >
> >2005-10-06 17:53:21.915621500 rbldnsd: 
> >ip4set:easynet.nl/rbldnsd.dynablock.txt: 20051006 145254: 
> >e32/24/16/8=579103/319512/2071/0
> >
> >and it also gives too many useless (duplicated) lines with -d option.
> >82.22 and 82.22.33 have the same TXT record, so there's no point
> >in including 82.22.33.
> 
> Looks like you somewhat misunderstood what rbldnsd is for.

I know what it's for.

... 
> Note also that master-format data dump (-d) was implemented
> as a side feature, like a quick hack, sort of, without much
> optimisations, because I don't think bind is suitable to

I hoped I could do my own quick hack and sort out that
file once and for all (only one time needed)...

> serve large DNSBL zones (or else why rbldnsd exists?), and
> for small zones, it's basically irrelevant whenever data
> is compact/optimized or not, while it is correct.
> 
> >When this bug gets fixed, I start coding the -D feature...
> 
> Thanks.  I consider it to be somewhat rude to name something a
> bug without understanding it.

In my example there was only one TXT record I like to give out.

But I can call it a feature I don't like, then :-O

> >Any free tips and tricks?  -d is line-oriented, but -D should
> >print sequential /16 /24 and /32 networks in one line,
> >e.g.
> >80.237.53.225-230
> >instead of
> >80.237.53.225
> >80.237.53.226
> >80.237.53.227
> >80.237.53.228
> >80.237.53.229
> >80.237.53.230
> 
> If you want to create an optimizer for network ranges (I think
> there are several already available, but I haven't looked), you'd

I know verge.net.au's aggregate, but for the example shown
above it would give this:

80.237.53.225/32
80.237.53.226/31
80.237.53.228/31
80.237.53.230/32

and when I convert that again to "range" type list:

80.237.53.225 - 80.237.53.225
80.237.53.226 - 80.237.53.227
80.237.53.228 - 80.237.53.229
80.237.53.230 - 80.237.53.230

I rather edit and maintain human-readable data.

Those example /31 and /32 are easy, but if there are loads
of them to maintain by hand, it's no fun.

> better use somewhat different data structures internally, at least
> not the same as in ip4set.  Well, if you want to write optimal
> master-format dump, ip4set structures may be of some use...  For
> CIDR optimisation, take a look at ip4trie - it's optimized for
> single-IP lookups, but with minor modification it can be used
> for range sets optimisation (but the whole structure is quite
> trivial anyway).
> 
> You can take some infrastructure from rbldnsd, like address
> parsing etc, but that's basically it - such optimizer should
> be a separate application, without all the DNS/network/etc
> baggage of rbldnsd.  IMHO ofcourse.

Surely.  I just happen to maintain a list whose format rbldnsd eats,
and I haven't found an optimizer for the ip4set format, yet.

If I hack -D option which spits out CIDRs, maybe I can
pipe it to modified aggregate which makes saner "ranges".
Yes.  I probably do that, if nobody knows of a program
which optimizes the file like I wanted..

> >I guess this is the hardest part of the coding.
> >I am still studying the code to see where to start.
> 
> I think the hardest part is to try to mix-n-match two quite
> different tasks in one application, basing them on the same
> data structures.
> 
> And speaking of dynablock lists, I have several separate
> points, most important of which is:
> 
>  Usually, if you maintain such a list, you know where each
>  entry come from and why (using appropriate comments etc),
>  so that it will be possible to find what's wrong after
>  receiving a complaint and so on.  When you optimize it,
>  you lose this info, and make maintenance of the list more
>  difficult.  For such a list, maintenance is the #1 problem,
>  efficiency/optimisation of data does not really matter -

Now when it's not optimized, there can be many matches
for any one IP address, so it turns into a maintenance problem.

the file available via rsync of zone dul.dnsbl.sorbs.net 
is 139284 lines, and when optimized[1] 86975 lines.
but I don't know what's their "master" input file format -- maybe
they have secret comments for each line.

[1] not perfect optimization, but: lines not starting
with ! and lines starting with ! were aggregated and
86975 was the sum of the two.

>  tools such as rbldnsd will do their best to keep it running
>  without spending much CPU cycles or memory - +/- several
>  millisecounds on each reload and several kilo-bytes of
>  memory is nothing compared to maintenance costs...

That's right.  I don't care much about a couple of kilobytes
extra mem used and I am happy with rbldnsd,
but not with the maintenance of the data.

> Or something like that, anyway.

-- 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.corpit.ru/pipermail/rbldnsd/attachments/20051006/f59d381d/attachment-0001.pgp