Spam filtering via modified Bloom filter
I attended a good talk today, by Kang Li of the University of Georgia. He's experimenting with new spam filtering technology based on a modified Bloom filter. The goal is very high performance and he seems to have achieved it.
Bloom filtering was invented in the '70s by Burton Bloom. It's an extremely fast method for determining if an element is in a set. False positives are possible but false negatives are not. The odds of a false positive can be made very low if the hash bitmap is sufficiently large.
While Bloom filtering will only tell you if an element is a member of a set, Kang Li's variant allows you to attach a coarse quantitative value to the set elements. In this case, the quantity is "spaminess".
The bottom line is that Kang's variant allows extremely fast lookups of tokens for the purpose of measuring spaminess. the algorithm is a couple orders of magnitude faster than most of the Bayesian filter techniques currently in use. If the hash bitmap is large enough, the accuracy is also comparable.
Disadvantages: limited incremental training. Once the filter is made, tokens can be added to it, but the spaminess of existing tokens can't be changed.
Unfortunately, Kang's research is not yet on line. I'll try to update this post as new information becomes available.
Bloom filtering was invented in the '70s by Burton Bloom. It's an extremely fast method for determining if an element is in a set. False positives are possible but false negatives are not. The odds of a false positive can be made very low if the hash bitmap is sufficiently large.
While Bloom filtering will only tell you if an element is a member of a set, Kang Li's variant allows you to attach a coarse quantitative value to the set elements. In this case, the quantity is "spaminess".
The bottom line is that Kang's variant allows extremely fast lookups of tokens for the purpose of measuring spaminess. the algorithm is a couple orders of magnitude faster than most of the Bayesian filter techniques currently in use. If the hash bitmap is large enough, the accuracy is also comparable.
Disadvantages: limited incremental training. Once the filter is made, tokens can be added to it, but the spaminess of existing tokens can't be changed.
Unfortunately, Kang's research is not yet on line. I'll try to update this post as new information becomes available.
3 Comments:
This *does* sound interesting; I'd been looking into the use of Bloom filters for anti-spam a while back, but never came up with anything useful.
Thanks for the pointer!
This looks like the paper:
Fast Statistical Spam Filter by Approximate Classifications
"Disadvantages: limited incremental training. Once the filter is made, tokens can be added to it, but the spaminess of existing tokens can't be changed."
A counting Bloom filter is an improvement to standard a Bloom filter as it allows dynamic additions and deletions of set membership information.
http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters
Post a Comment
<< Home