[rbldnsd] BL data preprocessing

Dmitry Agaphonov rbldnsd@corpit.ru
Sun, 30 Mar 2003 12:30:57 +0400 (MSD)


On Sun, 30 Mar 2003, Michael Tokarev wrote:

MT> Dmitry, I've a question: why you asked at a first place?  Do you
MT> fell rbldnsd is too slow loading data?

No, its data loading looks fast enough.  Well, what's the data and what
I'm caring of.

I'm collecting a set of public dnsbl zones, unrolling some structures like
$GENERATE or $ORIGIN into a set of IPs, stripping any records other than
IN A (stripping TXT too), converting to a plain list of IPs (123.45.67.89)
or subnets (123.45.67 or 123.45) and merging the results into one list.

I'm stripping TXT because the purpose of the is just block as much abusive
hosts as possible without explaining the reason when blocking.  The reason
is given to a blocked client in some other way.

The merged data is more than 1300000 entries (will grow up) and contains
duplicates (1).  I tried to kill duplicates in a very simple way by making
a hash containing all records in a perl script, but it uses too much
memory:

   VSZ    RSS
171276 149972

It removes about 250000 duplicate entries.  Then, the list still contains
some excessive data (2) like this:

123.45.67      (or even 123.45)
123.45.67.89
123.45.67.90
123.45.67.91
123.45.67.92
123.45.67.123
123.45.67.234
...


So, my question about data preprocessing turns into the following.  Is
there a need to handle duplicates entries (1) and excessive data (2) in a
list before rbldnsd loads it?


MT> Please try out rbldnsd-0.74pre2 released today (see
MT> http://www.corpit.ru/mjt/rbldnsd/) - it contains some
MT> rather significant modifications in zone loading code.

Will try definitely!


-- 
Dmitry Agaphonov