An End to Spam
John Phillips | November 18, 2004
If you are getting overwhelmed with spam, you can put and end to it, and you won’t have to pay Bill Gates a nickel for every message you send, or wait until 2006.
The solution? Something called a bayesian filter.
A bayesian filter is a type of content filter. Content filters examine the text of a message, and make a judgment about whether the message is spam or not. Simple content filters are effectively just scanning for words commonly used in spam messages, words like “viagra”, “mortgage” or “free”. In these simple filters, the list of spam predictive words is hard wired into the filter. The more sophisticated ones, including the bayesian filters, effectively learn what you consider spam, and use this “knowledge” to block incoming spam.
How Bayesian Filters Work
Named after the Reverend Thomas Bayes (b. 1702, London - d. 1761, Tunbridge Wells, Kent) and based on a statistical theory he developed, bayesian filters work by examining a collection of spam and non-spam messages and by classifying the words[1] used in these collections. To be effective you need to “train” the filter with about 600[2] spam and non-spam messages.
A bayesian filter examines all the words in the spam and non-spam collections. Then it assign a probability to each word, based on the likelihood of that word appearing in a spam or “good” message. Think of it this way: Most words are neutral, they don’t predict anything at all about the messages. Words like “the” and “is” are likely to appear with equal frequency in the spam and good message collections. Words like “free” “TODAY” and “Viagra!” are far more likely to be used in spam messages. The words that best predict a non-spam message are harder to generalize; they depend on what you regularly discuss in email. [3]
A well-written content filter doesn’t simply mark every message containing a “spammy” word as spam. Instead an intelligent content filter will use a small number of words that have the highest predictive value. Say that all words are assigned a probability of between 0 and 100%. The neutral words are all around a 50% probability. The words that appear mostly in legitimate messages have spam probabilities that are close to zero. The best “spam predictor” words have probabilities near 100%. An intelligent filter will find the dozen or so words whose probabilities are closest to 0% and 100% [4]. A combined probability of the words with the most predictive value can then be assigned. So in a message with several “spammy” words and some words that strongly predict that the message is not spam, the probabilities sort of cancel out, and the message isn’t marked as spam.
It seems like spammers would be able to defeat this kind of filter by just including a lot of benign words in the message, words that would tend to offset the “spammy” words. However, and here’s the rub, the words that predict a good message vary a lot between people. They are based on the vocabulary common to their messages. So it’s hard for spammers to predict what words are used in your legitimate messages, and these words are different for each person.
Also, a well written filter will only consider the words with the highest predictive value, both for and against. Further it will consider an equal number of words each side of the spam and non-spam argument. So a message containing the phrase , “buy Viagra without a prescription” will have several words that strongly predict it is spam. To offset this they’d need to include several words that are never used in any messages you’ve marked as spam, and are used in messages you’ve. They can’t include commons words, these get used both spam and legitimate messages. They can’t just add novel words, like “hydrogenated” or “nadir”, because any words that don’t appear in either you spam or legitimate collections have no predictive value. Adding these kinds of words doesn’t offset the “spammy” ones.
Also, most filters don’t actually work with the words, but rather with unique strings of characters. This allows them to consider and weigh punctuation, capitalization, and “misspellings” like “v1agra”. (In fact, since these mispellings are never used real messages, they have even higher predictive value than the words they replace. [5] Only the simplest content filters are fooled by these tricks.)
In practice, the only way to fool a good bayesian filter is not to include a bunch of tame words, but rather to avoid words that have an obvious sales pitch. And this hits the spammers where they live. Very hard to sell viagra without ever mentioning it. And that’s how spammers make money, because some very small percentage of credulous people respond to their sales pitch. Eliminate the sales pitch, and you decrease the effectiveness of spam.
A Word About False Positives
A false positive is when a legitimate message gets marked as spam. It’s the email equivalent of failing a urine test because you had a poppy seed muffin for breakfast. It’s easy to make a filter than catches every spam message. You just set up a rule that diverts all incoming mail to the junk folder. It catches all the spam, right? The trick is to make a filter that traps spam, but lets legitimate messages through. The test of any spam filter is the number of false positives it generates.
This is my beef with SpamAssasin, a filter my host uses. It catches most of the spam I get, but has very high false positive rate. Nearly every confirmation email I get, whether for an online purchase or a web site like the New York Times, is marked as spam. I don’t know the exact percentage, but it feels like at least 20% false positives. The upshot of all this is that I don’t trust the filter at all.
Baysian filters have very low false positive rates.
Paul Graham started much of the interest in Bayesian filter with his essay A Plan for Spam. Graham has written a Bayesian filter with a very low false positive rate. He reports it as being between .03 and .06%[6], although he writes that these numbers are not trustworthy, “It’s hard to say what the overall false positive rate is, because we’re up in the noise, statistically.” ( from http://paulgraham.com/better.html )
John Udell writes about SpamBayes:
Meanwhile, a minor miracle has occurred. I actually look forward to fetching my e-mail. Scanning the many messages landing in Spam, and marking them as read, is quick because there have so far been no — I repeat, no — false positives. I haven’t yet delegated ultimate power to my new assistant; I still review its decisions. But my confidence grows daily, and I’m close to routing the crap straight to the bit bucket where it belongs. ( from http://www.infoworld.com/article/03/05/16/20TCspam1.html?s=tc )
The following comes from the testimonials page of SpamSieve:
I have been using SpamSieve now for almost a year and I find that it zaps 99.5% of all the spam I get. It’s so accurate that I have taken to emptying the Spam folder without reviewing for false positives. Yeah it’s THAT good. (And, yeah, I have turned the Junk Mail Filter off, all I use is SpamSieve.)
Bob Williams, Microsoft MVP Entourage-Office Mac ( from http://www.c-command.com/spamsieve/testimonials.shtml )
So if you are suffering from an overload of spam get a hold of one of the products below. You Windows users may have a harder time installing a filter, and may find that your favorite email client isn’t supported, if your favorite email client isn’t Outlook.
SpamSieve (Mac OS X) commercial product http://www.c-command.com/spamsieve/
SpamBayes (Windows) Free Outlook Plug-in (hard to install) http://spambayes.sourceforge.net/
InBoxer (Windows) Based on SpamBayes, commercial product http://www.inboxer.com/index.shtml
Bogofilter (need to be a UNIX geek to install this one) http://bogofilter.sourceforge.net/
Footnotes
[1] The term “word” is an oversimplification. Messages are broken down into tokens. A token is a string of characters in the message. Other characters, spaces, parenthesis, commas and the like, are used to break the message into tokens. So a filter might break a message into the following different tokens, “free” , “free!”, “free!!”, “FREE”, and “viagra”. In fact, I think this was the subject of a spam the Suzy Q. Hotpants sent me this morning.
[2] This figure is according to Paul Graham’s essay a plan for spam. My essay is based largely on his essay.
[3] I’m really generalizing here about what words are in the average spam message. The main point of bayesian filters is that by examining messages that you have marked as spam, they build a model of what you consider spam. Also, if the words “free today only!” appeared regularly in your both your legitimate and spam message collection, the analysis of the spam collection would yield words that predict the spam messages. The only way this wouldn’t be true would be if messages were assigned to the spam and non-spam collection in a statistically random way. In other words, a pattern will emerge that can be used to effectively predict spam and good messages, as long as there is any consistency in either the good or spam collections.
[4] It’s important not to use just a simple average of all the words. Spammers could defeat this by including a lot of innocent words, say by including Martin Luther King’s “I Have Dream” speech at the end of their sales pitch for penis enlargement. A really perverse spammer might even try to combine the speech and the sales pitch. By avoiding a simple average, a filter avoids a simple dilution of the probability.
[5] The word “v1agra” has no predictive value the first time your filter sees it, but it has enormous predictive value the second time, assuming that filter catches the message or you mark it as spam.
[6] From http://paulgraham.com/better.html “When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than .03% false positives.” Later in the same article he writes, “I’ve had a total of five false positives so far, out of about 7740 legitimate emails, a rate of .06%.”