[rbldnsd] Re: rbldnsd and CIDR heirarchy
Michael Tokarev
rbldnsd@corpit.ru
Sun, 17 Aug 2003 19:40:44 +0400
[intentionally top-posting]
Aaron, please review current 0.98pre version: it includes ip4trie
dataset (implemented as an excersise - i tried to understand how
the ptrie works), which seems to be exactly what you need. Currently,
it only allows *one* value per CIDR range, but this seems to be ok.
Note please this is experimental feature, and I can't guarantee it
will not be removed (there is no real application for such a dataset
IMHO).
/mjt
Michael Tokarev wrote:
> Aaron Hopkins wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>>
>>
>>>> 0.0.0.0/0:0.0.0.0:use=unallocated
>>>
>>>
>>> Hmm. This one is invalid for rbldnsd. If you want to list the whole
>>> 'net,
>>> you have to use at least two entries:
>>
>>
>> Sorry if I wasn't clear here. I wasn't providing the data set that I had
>> hacked to make partially work, but rather the one that I had hoped to
>> work.
>
>
> Ok... I'm assuming the next example is of the same sort... ;)
>
> []
>
>>> Another issue with storing netmask is that lookups will need to be done
>>> differently (e.g. using patricia trie), which in turn requires different
>>> data structures (e.g. for patricia trie, each entry will be at least 20
>>> bytes long compared to currrent 8).
>>
>>
>> Looking over the code, I believe it should be possible to do what I want
>> without needing a patricia tree. By sorting the input data by masklen
>> (either by defining the input data as order-dependent or having
>> rbldnsd do
>> it itself) and offering the parser something like:
>>
>> $SOA 2d noupdate.mxes.org. root.mxes.org. 0 1h 15m 14d 1d
>> $NS 2d a.ns.mxes.org
>> $NS 2d b.ns.mxes.org
>> $TTL 1d
>> 209.151.224.0/22 127.0.0.2 use=allocated;asn=65432
>> 209.151.224.0/19 127.0.0.2 use=allocated;asn=11051
>
>
> It should be either
> 209.151.224.0/19 use=allocated;asn=11051
> or
> 209.151.224.0/19 :127.0.0.2:use=allocated;asn=11051
> (127.0.0.2 is the default anyway). I showed the wrong syntax too
> in my previous email ;)
>
>> 192.168.0.0/16 127.0.0.2 use=rfc1918
>> 172.16.0.0/12 127.0.0.2 use=rfc1918
>> 10.0.0.0/8 127.0.0.2 use=rfc1918
>> 224.0.0.0/3 127.0.0.2 use=multicast
>> 0.0.0.0/1 127.0.0.2 use=unallocated
>> 128.0.0.0/1 127.0.0.2 use=unallocated
>
>
> ok
>
>> The parser could be made to just not add duplicate (expanded) entries
>> into
>> the lookup table. 209.151.224.0/22 would fill in 209.151.224 through
>> 209.151.227 with "use=allocated;asn=65432", and then when it
>> goes to insert 209.151.224.0/19, it skips over 209.151.224 through
>> 209.151.227 as they already have entries and only applies
>> "use=allocated;asn=11051" to 209.151.228 through 209.151.255.
>
>
> I see what you mean now. Ok... ;)
>
>> 10 and 224-255 also get filled in and are left alone when 0/1 and
>> 128/1 get
>> added.
>>
>> Not using a patricia tree for this does require completely rebuilding the
>> lookup table to add new data, but rbldnsd seems to do that anyway.
>
>
> Well, not quite. See below.
>
>>> But in fact, this is the result of mixing different things together, not
>>> the design. You're trying to mix "networks by usage" together with
>>> hosts
>>> with various problems.
>>
>>
>> This isn't a problem specific to hosts. In my example above,
>> 209.151.224.0/22 returning "asn=65432" and the remainder of
>> 209.151.224.0/19
>> returning "asn=11051" shows the same problem. Returning two TXT records,
>> one with "asn=11051" and one with "asn=65432" is just wrong.
>>
>> What is really happening is that rbldnsd is using CIDR notation as an
>> input
>> format without any expectation that you'll ever have overlapping or
>> conflicting entries, which is the whole point of CIDR in routing.
>
>
> ok. You're looking at this from the routing table perspective, which,
> by definition, gets sorted by prefix length. Which is not the case IMHO
> for a DNSBL, where all matching entries actually *may* comes together.
> E.g, a hypotetical SPEWS case:
>
> 10.8.1.0/24 spammer XYZ
> 10.8.0/25 ISP A hosting spammer XYZ
> 10.8.0/19 ISP B netspace (upstream of ISP A, hosting spammer XYZ)
>
> Well, this set of records may be handled by the same rules as a routing
> table (returning most close record only). But it is not a mistake to
> return ALL records, as in fact all 3 applies.
>
> Perhaps you're right here: rbldnsd should return most close entry only.
> This will solve your problem as well. But I'm afraid this will not as
> easy as it may look like - from the implementation point of view, that
> is. Such a cases are rare for a DNSBL, and even if all 3 records will
> be returned for 10.8.1.0 (note again results will NOT be consistent if
> e.g. the spammer is on 10.8.0.1/25, which will be stored in /32 array),
> and I just thought I'd leave this as is (until you wrote this your
> email.. ;)
>
> The problem is with current implementation. I first build all the
> arrays as data comes in, in the order of input. At this stage,
> original prefix lengths gets lost (they aren't stored anywhere as
> there was no usage for them). And only after the whole data has
> been read, arrays gets sorted using qsort. I.e., when a record
> gets added into array, no lookup is done whenever something is
> present already or not: at this stage, array isn't sorted yet, so
> the lookup will be very slow. I tried to keep arrays sorted when
> reading data (by always inserting next entry into an appropriate
> place), but ugh...
>
> $ time rbldnsd-load-test-current x:ip4set:list.dsbl.org
> 2 sec
> $ time rbldnsd-load-test-keep-sorted x:ip4set:list.dsbl.org
> ^C (i'm tired to wait ;)
> 10 min
>
> ;)
> (not a real things, but this is very close to what I got when
> probing several various methods). The difference is HUGE - qsort
> at the end compared to all other variants I tried.
>
> I'm afraid this current implementation just cannot be used for this
> task, unless input is formatted in a way to avoid intersecting ranges.
> Which, in turn, requires some rather funny preprocessing. I tried to
> keep rbldnsd as simple as possible (not to say that was successeful ;).
>
> []
>
>>> It seems you're trying to classify the whole IPv4 space,
>>> which is umm.. difficult.
>>
>>
>> I am in fact trying to classify the whole of IPv4 space. I'm
>> collecting and
>> parsing a lot of data from many different sources and trying to make
>> as much
>> information available about each IP as possible with a single TXT lookup.
>
>
> Note that single TXT *lookup* does not really mean single TXT *record*
> (at least, host entries may come in a TXT separate from ASN info - which
> is much simpler to maintain anyway. Still, there's a problem with several
> ASNs "covering" the same range).
>
>>> Either way, this is here we go. It *seems* you're trying
>>> to do something "unusual" (as if there "usual" things ;).
>>
>>
>> I'm not running a DNSBL, and you won't be able to treat this zone as
>> such. But I believe tools like SpamAssassin might be able to make very
>> interesting
>> use of it.
>
>
> Yes, I see. Current implementation will NOT work for this. It is possible
> however to implement e.g. that same patricia trie as a separate dataset
> (ptrie? in addition to ip4set; note it is really *set*, not a routing
> table).
> Or, to "pre-sort" the data.. which may be rather difficult too, hmm
> (requires
> substraction operation for CIDRs).
>
>>> BTW, this question you asked is interesting. May I share
>>> this email with rbldnsd@corpit.ru mailinglist subscribers?
>>> (I wanted to Cc this list, but thought I'd ask first).
>>
>>
>> Please, go ahead.
>
>
> Thank you.
>
> /mjt
>
> _______________________________________________
> rbldnsd mailing list
> rbldnsd@corpit.ru
> http://www.corpit.ru/mailman/listinfo/rbldnsd