[rbldnsd] Re: rbldnsd and CIDR heirarchy

Michael Tokarev rbldnsd@corpit.ru
Sun, 17 Aug 2003 19:40:44 +0400


[intentionally top-posting]

Aaron, please review current 0.98pre version: it includes ip4trie
dataset (implemented as an excersise - i tried to understand how
the ptrie works), which seems to be exactly what you need.  Currently,
it only allows *one* value per CIDR range, but this seems to be ok.
Note please this is experimental feature, and I can't guarantee it
will not be removed (there is no real application for such a dataset
IMHO).

/mjt

Michael Tokarev wrote:
> Aaron Hopkins wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>>
>>
>>>>    0.0.0.0/0:0.0.0.0:use=unallocated
>>>
>>>
>>> Hmm.  This one is invalid for rbldnsd.  If you want to list the whole 
>>> 'net,
>>> you have to use at least two entries:
>>
>>
>> Sorry if I wasn't clear here.  I wasn't providing the data set that I had
>> hacked to make partially work, but rather the one that I had hoped to 
>> work.
> 
> 
> Ok... I'm assuming the next example is of the same sort... ;)
> 
> []
> 
>>> Another issue with storing netmask is that lookups will need to be done
>>> differently (e.g. using patricia trie), which in turn requires different
>>> data structures (e.g. for patricia trie, each entry will be at least 20
>>> bytes long compared to currrent 8).
>>
>>
>> Looking over the code, I believe it should be possible to do what I want
>> without needing a patricia tree.  By sorting the input data by masklen
>> (either by defining the input data as order-dependent or having 
>> rbldnsd do
>> it itself) and offering the parser something like:
>>
>>     $SOA 2d noupdate.mxes.org. root.mxes.org. 0 1h 15m 14d 1d
>>     $NS 2d a.ns.mxes.org
>>     $NS 2d b.ns.mxes.org
>>     $TTL 1d
>>     209.151.224.0/22    127.0.0.2    use=allocated;asn=65432
>>     209.151.224.0/19    127.0.0.2    use=allocated;asn=11051
> 
> 
> It should be either
>     209.151.224.0/19    use=allocated;asn=11051
> or
>     209.151.224.0/19    :127.0.0.2:use=allocated;asn=11051
> (127.0.0.2 is the default anyway).  I showed the wrong syntax too
> in my previous email ;)
> 
>>     192.168.0.0/16    127.0.0.2    use=rfc1918
>>     172.16.0.0/12    127.0.0.2    use=rfc1918
>>     10.0.0.0/8        127.0.0.2    use=rfc1918
>>     224.0.0.0/3        127.0.0.2    use=multicast
>>     0.0.0.0/1        127.0.0.2    use=unallocated
>>     128.0.0.0/1        127.0.0.2    use=unallocated
> 
> 
> ok
> 
>> The parser could be made to just not add duplicate (expanded) entries 
>> into
>> the lookup table.  209.151.224.0/22 would fill in 209.151.224 through 
>> 209.151.227 with "use=allocated;asn=65432", and then when it
>> goes to insert 209.151.224.0/19, it skips over 209.151.224 through
>> 209.151.227 as they already have entries and only applies
>> "use=allocated;asn=11051" to 209.151.228 through 209.151.255.
> 
> 
> I see what you mean now.  Ok... ;)
> 
>> 10 and 224-255 also get filled in and are left alone when 0/1 and 
>> 128/1 get
>> added.
>>
>> Not using a patricia tree for this does require completely rebuilding the
>> lookup table to add new data, but rbldnsd seems to do that anyway.
> 
> 
> Well, not quite.  See below.
> 
>>> But in fact, this is the result of mixing different things together, not
>>> the design.  You're trying to mix "networks by usage" together with 
>>> hosts
>>> with various problems.
>>
>>
>> This isn't a problem specific to hosts.  In my example above,
>> 209.151.224.0/22 returning "asn=65432" and the remainder of 
>> 209.151.224.0/19
>> returning "asn=11051" shows the same problem.  Returning two TXT records,
>> one with "asn=11051" and one with "asn=65432" is just wrong.
>>
>> What is really happening is that rbldnsd is using CIDR notation as an 
>> input
>> format without any expectation that you'll ever have overlapping or
>> conflicting entries, which is the whole point of CIDR in routing.
> 
> 
> ok.  You're looking at this from the routing table perspective, which,
> by definition, gets sorted by prefix length.  Which is not the case IMHO
> for a DNSBL, where all matching entries actually *may* comes together.
> E.g, a hypotetical SPEWS case:
> 
>   10.8.1.0/24  spammer XYZ
>   10.8.0/25    ISP A hosting spammer XYZ
>   10.8.0/19    ISP B netspace (upstream of ISP A, hosting spammer XYZ)
> 
> Well, this set of records may be handled by the same rules as a routing
> table (returning most close record only).  But it is not a mistake to
> return ALL records, as in fact all 3 applies.
> 
> Perhaps you're right here: rbldnsd should return most close entry only.
> This will solve your problem as well.  But I'm afraid this will not as
> easy as it may look like - from the implementation point of view, that
> is.  Such a cases are rare for a DNSBL, and even if all 3 records will
> be returned for 10.8.1.0 (note again results will NOT be consistent if
> e.g. the spammer is on 10.8.0.1/25, which will be stored in /32 array),
> and I just thought I'd leave this as is (until you wrote this your
> email.. ;)
> 
> The problem is with current implementation.  I first build all the
> arrays as data comes in, in the order of input.  At this stage,
> original prefix lengths gets lost (they aren't stored anywhere as
> there was no usage for them).  And only after the whole data has
> been read, arrays gets sorted using qsort.  I.e., when a record
> gets added into array, no lookup is done whenever something is
> present already or not: at this stage, array isn't sorted yet, so
> the lookup will be very slow.  I tried to keep arrays sorted when
> reading data (by always inserting next entry into an appropriate
> place), but ugh...
> 
>  $ time rbldnsd-load-test-current x:ip4set:list.dsbl.org
>  2 sec
>  $ time rbldnsd-load-test-keep-sorted x:ip4set:list.dsbl.org
>  ^C (i'm tired to wait ;)
>  10 min
> 
> ;)
> (not a real things, but this is very close to what I got when
> probing several various methods).  The difference is HUGE - qsort
> at the end compared to all other variants I tried.
> 
> I'm afraid this current implementation just cannot be used for this
> task, unless input is formatted in a way to avoid intersecting ranges.
> Which, in turn, requires some rather funny preprocessing.  I tried to
> keep rbldnsd as simple as possible (not to say that was successeful ;).
> 
> []
> 
>>> It seems you're trying to classify the whole IPv4 space,
>>> which is umm.. difficult.
>>
>>
>> I am in fact trying to classify the whole of IPv4 space.  I'm 
>> collecting and
>> parsing a lot of data from many different sources and trying to make 
>> as much
>> information available about each IP as possible with a single TXT lookup.
> 
> 
> Note that single TXT *lookup* does not really mean single TXT *record*
> (at least, host entries may come in a TXT separate from ASN info - which
> is much simpler to maintain anyway.  Still, there's a problem with several
> ASNs "covering" the same range).
> 
>>> Either way, this is here we go.  It *seems* you're trying
>>> to do something "unusual" (as if there "usual" things ;).
>>
>>
>> I'm not running a DNSBL, and you won't be able to treat this zone as 
>> such. But I believe tools like SpamAssassin might be able to make very 
>> interesting
>> use of it.
> 
> 
> Yes, I see.  Current implementation will NOT work for this.  It is possible
> however to implement e.g. that same patricia trie as a separate dataset
> (ptrie? in addition to ip4set; note it is really *set*, not a routing 
> table).
> Or, to "pre-sort" the data.. which may be rather difficult too, hmm 
> (requires
> substraction operation for CIDRs).
> 
>>> BTW, this question you asked is interesting.  May I share
>>> this email with rbldnsd@corpit.ru mailinglist subscribers?
>>> (I wanted to Cc this list, but thought I'd ask first).
>>
>>
>> Please, go ahead.
> 
> 
> Thank you.
> 
> /mjt
> 
> _______________________________________________
> rbldnsd mailing list
> rbldnsd@corpit.ru
> http://www.corpit.ru/mailman/listinfo/rbldnsd