We have created the RXL plugin. The RXL plugin implements RBL (Realtime Blackhole Lists) and RWL (Realtime Whitehole List). These techniques work using the same technology. The difference between then is the score asigned to rules searching in RBL and/or RWL. RBL rules should present a positive score and RWL a negative one.
The technology under the Realtime Lists is basically the DNS (Domain Name System). In order to check if 18.104.22.168 is in the list DNSWL.ORG (http://www.dnswl.org) we resolve the domain 22.214.171.124.list.dnswl.org according with the instructions provided by DNSWL.ORG. The queryes to other lists are the same but replacing the suffix. For instance, to check the server in the SpamHaus ZEN list we use the suffix zen.spamhaus.org (126.96.36.199.zen.spamhaus.org). Check this list of RXLs
We are now working in the development of a cache for the queryes. The documentation also should be updated consequently.
You can now call to rxl_check and rxl_check_octect in order to develop your rules. rxl_check(<list_suffix>[, <number_received_heder>]) and rxl_check(<list_suffix>, <octect number>, <octect value>[, <number_received_heder>]). For instance:
header IN_ZEN_SPAMHAUS rxl_check("zen.spamhaus.org", 3)
score IN_ZEN_SPAMHAUS 3
header IN_SBL_SPAMHAUS rxl_check("zen.spamhaus.org", 4, 2, 3)
describe IN_SBL_SPAMHAUS according SpamHaus doc is in SBL when result is 127.0.0.2
score IN_ZEN_SPAMHAUS 1
header IN_DNSWL rxl_check("list.dnswl.org", 3)
score IN_DNSWL -3
SPF is now working. We are now including SPF response caches in order to reduce time. The idea behind cache is avoiding computing the same SPF response for an email 5 times. We had used the same scheme in Bayes plugin.
We had completed HTML part parsing functionality in Wirebrush. Bayes processing has increased speed due to the removal of HTML tags considered as tokens in last versions.
Wirebrush will be presented on March 15 at IPL, Leiría (Portugal). David, Noemi and Moncho will explain to computer science students the main advantages included in Wirebrush to improve speed on Spam Filters.
Parsing HTML is now possible. Next Monday we get wirebrush parsing HTML parts. The HTML parser is again a finite state machine so it will run fast.
The SPF and RelayLists plugins are comming soon. We believe in get it working on next monday.
EMLStructureParser is now able to dump text from text/* parts. We are finishing the code. Naïve Bayes will be able to take advantage from this feature.
The extracion of text from an email is only done 1 time. We use a caching scheme to avoid computing it a lot of times.
We also cheked some errors during the parsing to detect malformed rfc2822 files.
And all of this is made using a finite-state machine scheme with a stack when parsing body parts.
Imagine a filter containing more than 1 call to check_bayes function provided in bayes plugin (for instance 9 rules similar to those included in Debian or Ubuntu SpamAssassin filters). Every time check_bayes function is called, the probability of the email being spam is computed involving some searching processes on a berkeley database. This situation involves a lot of unneeded computational process. Why compute the same probability 9 times?. In order to solve this problem, we had implemented a small configurable cache. Using a cache able to store 1 element, the problem is avoided. Moreover, the cache can be useful to get classified the same emails delivered to more than one addresses in the same domain (filtered with the same filter).
Now, a filter with 9 bayes rules is executed in less than 35 miliseconds. Really fast. I want to congratulate us :).
Yesterday, we had completed the development of MIME part parsing. Now we will work on improve tokenizing for Naïve Bayes in order to avoid tokenising of non-text parts.
Yesterday we had worked in solving some memory leaks and getting wirebrush more stable. We are now working in solving some final issues about free some dinamic memory.