Bayesian Filtering vs Spammers

Like Joel I've been using Spambayes to keep spam in control for a while now, and it really does work great. But spammers are a crafty lot, and currently most spam comes with a large list of valid words, to try to get around the way Bayesian filtering works. 


Take a look at this email:


Ujs aws Have more fr.e.e dkmr gm o Be in control of your destiny... s pmxipl hu vuock Say no to the workforce and the rat race... rx vnymi lc ko Don't be emp1oyed . . . be Se1f-employed. c lxb uf dh We show you how to make the change. ya dxemf ii uy

Find the pot of Go1d at the end of the rainbow. ok m r mewmys r fifiek k mq Quit service available at website  n cs qb robgu d dldekp dg tl The JanKang Co located at No.101 Village Guanjiayin SongShan District Chifeng city the Inner Mongolia Autonom.ous Regi.on China gives you the advertis.ement below. t les p wd ds yjaib jq utrjg  Francis Bacon, BAritish phlosopher , h 10ba59g,ryvxx ck cnxh.

The bogus text in this message is rendered invisible (the message is sent as HTML and the background colour is set to the same colour as the garbage text that's just there to confuse the spam filters).  Words that would have a negative score in the filter, such as “free” and “self-employed” are obfuscatedas “fr.e.e” and “se1f-employed” (a one instead of the letter l). 

Another message has paragraphs like:

item blockade caviar teleprinter stamford fomalhaut catatonia achromatic dallas yamaha preoccupy alluvial egypt winslow abdomen selector discordant cornucopia video typewritten administratrix coach catatonia parr percy bergson whirlpool walkover deceptive profane drudgery glutamic enlargeable donnelly consolidate imprecision anagram decimate mathieu sentinel ambition chemisorb monitor repeal shutoff confocal ampex repairman consume burette cb package bold catkin commonality airflow certiorari speedup nanking seek stem squashberry ethos abridge refinery exclamatory dichloride mahogany hello columbine mackintosh dusk peaceful cortex don't baboon brilliant ace transoceanic centric detain mini attract daybed insurmountable tote fmc already ambiguity empathy olympia murre playwriting bohemia aforethought heathkit hymnal seismic carbonate monitor pacifist valiant wildcat tuesday canterelle problem integrand rhenish epicurean arcsin reflect finessing burro eerie belligerent anthony burdock charcoal augur canto brag acre camille o'dell glisten appointe plaid wheel schiller lacuna conscript indubitable loquacity hero gossip filigree hughes speck bivalve crap mcdaniel mendelevium seminarian spumoni fiendish bodybuilder podge accession awesome dovetail familial extol stellar case choreograph arisen indomitable bloomington instruct flanders claire axolotl quintet lipread discrepant bettor aphid damascus

Which of course are all valid words, so when you do click the “This is Spam“ button to train Spambayes, you're probably undermining it's ability to correctly identify good mail (generating more false positives).

What's the solution?  No solution is going to work forever but the solution to this round of threats to the effectiveness of Bayesian filtering would be to make the parser smart enough to know when spammers are using tricks to hide text (and there are lots of ways of doing this, so the parser would have to be pretty darn smart), and to look at the source of the email you're filtering before clicking the “Delete as Spam“ button.  If the message has lots of words that appear in regular conversation, you're probably better off just deleting the message.