[rbldnsd] Re: rbldnsd and CIDR heirarchy

Michael Tokarev rbldnsd@corpit.ru
Thu, 07 Aug 2003 16:00:55 +0400


Aaron Hopkins wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> 
> 
>>>    0.0.0.0/0:0.0.0.0:use=unallocated
>>
>>Hmm.  This one is invalid for rbldnsd.  If you want to list the whole 'net,
>>you have to use at least two entries:
> 
> Sorry if I wasn't clear here.  I wasn't providing the data set that I had
> hacked to make partially work, but rather the one that I had hoped to work.

Ok... I'm assuming the next example is of the same sort... ;)

[]
>>Another issue with storing netmask is that lookups will need to be done
>>differently (e.g. using patricia trie), which in turn requires different
>>data structures (e.g. for patricia trie, each entry will be at least 20
>>bytes long compared to currrent 8).
> 
> Looking over the code, I believe it should be possible to do what I want
> without needing a patricia tree.  By sorting the input data by masklen
> (either by defining the input data as order-dependent or having rbldnsd do
> it itself) and offering the parser something like:
> 
>     $SOA 2d noupdate.mxes.org. root.mxes.org. 0 1h 15m 14d 1d
>     $NS 2d a.ns.mxes.org
>     $NS 2d b.ns.mxes.org
>     $TTL 1d
>     209.151.224.0/22	127.0.0.2	use=allocated;asn=65432
>     209.151.224.0/19	127.0.0.2	use=allocated;asn=11051

It should be either
     209.151.224.0/19	use=allocated;asn=11051
or
     209.151.224.0/19	:127.0.0.2:use=allocated;asn=11051
(127.0.0.2 is the default anyway).  I showed the wrong syntax too
in my previous email ;)

>     192.168.0.0/16	127.0.0.2	use=rfc1918
>     172.16.0.0/12	127.0.0.2	use=rfc1918
>     10.0.0.0/8		127.0.0.2	use=rfc1918
>     224.0.0.0/3		127.0.0.2	use=multicast
>     0.0.0.0/1		127.0.0.2	use=unallocated
>     128.0.0.0/1		127.0.0.2	use=unallocated

ok

> The parser could be made to just not add duplicate (expanded) entries into
> the lookup table.  209.151.224.0/22 would fill in 209.151.224 through 
> 209.151.227 with "use=allocated;asn=65432", and then when it
> goes to insert 209.151.224.0/19, it skips over 209.151.224 through
> 209.151.227 as they already have entries and only applies
> "use=allocated;asn=11051" to 209.151.228 through 209.151.255.

I see what you mean now.  Ok... ;)

> 10 and 224-255 also get filled in and are left alone when 0/1 and 128/1 get
> added.
> 
> Not using a patricia tree for this does require completely rebuilding the
> lookup table to add new data, but rbldnsd seems to do that anyway.

Well, not quite.  See below.

>>But in fact, this is the result of mixing different things together, not
>>the design.  You're trying to mix "networks by usage" together with hosts
>>with various problems.
> 
> This isn't a problem specific to hosts.  In my example above,
> 209.151.224.0/22 returning "asn=65432" and the remainder of 209.151.224.0/19
> returning "asn=11051" shows the same problem.  Returning two TXT records,
> one with "asn=11051" and one with "asn=65432" is just wrong.
> 
> What is really happening is that rbldnsd is using CIDR notation as an input
> format without any expectation that you'll ever have overlapping or
> conflicting entries, which is the whole point of CIDR in routing.

ok.  You're looking at this from the routing table perspective, which,
by definition, gets sorted by prefix length.  Which is not the case IMHO
for a DNSBL, where all matching entries actually *may* comes together.
E.g, a hypotetical SPEWS case:

   10.8.1.0/24  spammer XYZ
   10.8.0/25    ISP A hosting spammer XYZ
   10.8.0/19    ISP B netspace (upstream of ISP A, hosting spammer XYZ)

Well, this set of records may be handled by the same rules as a routing
table (returning most close record only).  But it is not a mistake to
return ALL records, as in fact all 3 applies.

Perhaps you're right here: rbldnsd should return most close entry only.
This will solve your problem as well.  But I'm afraid this will not as
easy as it may look like - from the implementation point of view, that
is.  Such a cases are rare for a DNSBL, and even if all 3 records will
be returned for 10.8.1.0 (note again results will NOT be consistent if
e.g. the spammer is on 10.8.0.1/25, which will be stored in /32 array),
and I just thought I'd leave this as is (until you wrote this your
email.. ;)

The problem is with current implementation.  I first build all the
arrays as data comes in, in the order of input.  At this stage,
original prefix lengths gets lost (they aren't stored anywhere as
there was no usage for them).  And only after the whole data has
been read, arrays gets sorted using qsort.  I.e., when a record
gets added into array, no lookup is done whenever something is
present already or not: at this stage, array isn't sorted yet, so
the lookup will be very slow.  I tried to keep arrays sorted when
reading data (by always inserting next entry into an appropriate
place), but ugh...

  $ time rbldnsd-load-test-current x:ip4set:list.dsbl.org
  2 sec
  $ time rbldnsd-load-test-keep-sorted x:ip4set:list.dsbl.org
  ^C (i'm tired to wait ;)
  10 min

;)
(not a real things, but this is very close to what I got when
probing several various methods).  The difference is HUGE - qsort
at the end compared to all other variants I tried.

I'm afraid this current implementation just cannot be used for this
task, unless input is formatted in a way to avoid intersecting ranges.
Which, in turn, requires some rather funny preprocessing.  I tried to
keep rbldnsd as simple as possible (not to say that was successeful ;).

[]
>>It seems you're trying to classify the whole IPv4 space,
>>which is umm.. difficult.
> 
> I am in fact trying to classify the whole of IPv4 space.  I'm collecting and
> parsing a lot of data from many different sources and trying to make as much
> information available about each IP as possible with a single TXT lookup.

Note that single TXT *lookup* does not really mean single TXT *record*
(at least, host entries may come in a TXT separate from ASN info - which
is much simpler to maintain anyway.  Still, there's a problem with several
ASNs "covering" the same range).

>>Either way, this is here we go.  It *seems* you're trying
>>to do something "unusual" (as if there "usual" things ;).
> 
> I'm not running a DNSBL, and you won't be able to treat this zone as such. 
> But I believe tools like SpamAssassin might be able to make very interesting
> use of it.

Yes, I see.  Current implementation will NOT work for this.  It is possible
however to implement e.g. that same patricia trie as a separate dataset
(ptrie? in addition to ip4set; note it is really *set*, not a routing table).
Or, to "pre-sort" the data.. which may be rather difficult too, hmm (requires
substraction operation for CIDRs).

>>BTW, this question you asked is interesting.  May I share
>>this email with rbldnsd@corpit.ru mailinglist subscribers?
>>(I wanted to Cc this list, but thought I'd ask first).
> 
> Please, go ahead.

Thank you.

/mjt