[rbldnsd] BL data preprocessing
Michael Tokarev
rbldnsd@corpit.ru
Sun, 30 Mar 2003 17:45:33 +0400
Dmitry Agaphonov wrote:
> On Sun, 30 Mar 2003, Michael Tokarev wrote:
>
> MT> Dmitry, I've a question: why you asked at a first place? Do you
> MT> fell rbldnsd is too slow loading data?
>
> No, its data loading looks fast enough. Well, what's the data and what
> I'm caring of.
>
> I'm collecting a set of public dnsbl zones, unrolling some structures like
> $GENERATE or $ORIGIN into a set of IPs, stripping any records other than
> IN A (stripping TXT too), converting to a plain list of IPs (123.45.67.89)
> or subnets (123.45.67 or 123.45) and merging the results into one list.
Aha. Ok.
> I'm stripping TXT because the purpose of the is just block as much abusive
> hosts as possible without explaining the reason when blocking. The reason
> is given to a blocked client in some other way.
>
> The merged data is more than 1300000 entries (will grow up) and contains
> duplicates (1). I tried to kill duplicates in a very simple way by making
> a hash containing all records in a perl script, but it uses too much
> memory:
Hmm.. 1300000 entries is about 5000 Kb or 5 Mb of IP addresses. That's quite
ok - provided you're using ip4set, NOT ip4vset (for the latter, memory reqiriment
is 3 times more than for ip4set, and you don't use TXTs or different As anyway).
It should be pretty OK to have such amount of data.
Rbldnsd will not try to remove dups: it's impossible _when loading_ (well, it's
certainly possible, ofcourse, but it'll made rbldnsd much slower). Once a data
is loaded into unsorted array, qsort is damn fast sorting it - provided your OS
have good qsort implementation, like e.g. the one in glibc: not all qsorts are
made equal, and "traditional" qsort is not good when sorting already sorted
data - pre-sorted data is a worst case for traditional qsort, algorithm complexity
becomes N*N (where N is the number of entries), as opposed to N*log(N) for normal
case). After reading and sorting, it's possible to walk over sorted array and
remove dups, but I didn't bothered doing so, because at that time, data is already
read into memory (so memory is allocated already anyway), and removing some entries
will not made much difference in subsequent searches (search complexity is log(N)),
and in most cases, data will NOT contain much duplicates (so checking large array
for dups will be unnecessary waste of time in a common case).
Well, with TXT records included, things aren't that nice anymore, but in this
case, no dups may be removed either because there is no dups (entries with the
same IP refers to different As or TXTs). Hmm. Since ip4vset uses one large
array for all entries, and every entry is 12 bytes (i386) long (as opposed to
ip4set, where one entry is 4 bytes long - only an IP address), I wondering if
it's ok to make memory requiriments for one entry even bigger (to one extra
pointer), and use array of POINTERS to entries, not array of entries. This
way, rbldnsd will require more total memory, but largest object will be 3
times smaller. Interesting, is it worth an effort? ;) But we can't address
more than 2Gb anyway... ;)
> VSZ RSS
> 171276 149972
>
> It removes about 250000 duplicate entries. Then, the list still contains
> some excessive data (2) like this:
That's less than 1/4. I don't think you'll gain anything by removing dups,
even if rbldnsd will do that.
Either way, for this very case, I think you may wrote a little program that'll
do the work. Take a look at rbldnsd_ip4set.c file, how data gets read into
memory. This reading part may be extracted into a separate executable and
duplication removing may be performed after that, writing results to another
file. (But be warned about two things: qsort implementation as per above,
and - don't attempt to remove dups in a "dumb" way, that is, move every entry
at most once, not many times... ;)
[]
> So, my question about data preprocessing turns into the following. Is
> there a need to handle duplicates entries (1) and excessive data (2) in a
> list before rbldnsd loads it?
I hope I answered this above. But if you have really many dups (e.g. after
removing dups, you'll have half of entries), it may be worth it to remove
them. All depends on what do you want to get and with what cost: if it works
on original data, why bother removing dups and waste resources _here_ (and
making it less easy to use)?
BTW, it seems rbldnsd may benefit from an ability to load data from several
files into one zone... Ok, point taken, and that is already good to have
for me too... ;) The only question is how to specify all this stuff.
> MT> Please try out rbldnsd-0.74pre2 released today (see
> MT> http://www.corpit.ru/mjt/rbldnsd/) - it contains some
> MT> rather significant modifications in zone loading code.
>
> Will try definitely!
Hmm. In your case, it will make NO difference AT ALL. Changes
affected TXT records only, not the way how IP addresses are stored
or read.
/mjt