A very interesting, thorough and clueful discourse on spam from James Gleick in today's New York Times (via Slashdot). It's an in-depth examination of spam, its history and its mechanisms, and it's well worth reading. Although I'd recommend you read it all, and not skip to the conclusion, the conclusion is pretty damn good:
[...] two simple measures might be enough to stem the tide:
We need to be able to say no. No, I'm not looking for a good time. No, I don't want to ''e-mail millions of PayPal members.'' No, I don't want an anatomy-enlargement kit. No, I don't want my share of the Nigerian $25 million. I just want my in-box. It belongs to me, and I want it back.
- Forging Internet headers should be made illegal. The system depends on accurate information about senders and servers and relays; no one needs a right to falsify this information.
- Unsolicited bulk mail should carry a mandatory tag. That alone would put consumers back in control; all the complex technological challenge of identifying the spam would vanish.
One thing of note: it mentions SpamSieve, which is a Bayesian filtering tool that works for pretty much any Mac OS X email client. SpamSieve works pretty much like an expert system, in the following way:
- First you find a whole bunch of spam, and you tell it that these messages are spam
- Then you find a bunch of legitimate emails, and you tell it that these are OK
- Whenever you receive email, it analyses it, and if it thinks it's spam it marks it as such. You tell it whenever it gets anything wrong, and in time it gets better.
All this works very well, at least up to a point. At one point I was getting 97.2% accuracy, as SpamSieve was increasingly better trained to handle my email. Then I subscribed to a new mailing list - and it marked nearly all of the messages as spam. I had to tell it that all these messages were in fact legit - and the accuracy rate plummeted to 96%. (It has since recovered to 96.3%.)
Spammers are now, according to James Gleick, misspelling words like penis or viagra to confuse such programs. I haven't seen any of these misspellings yet - according to SpamSieve - but I don't doubt they'll come. And this is the weakness of such spam-filtering software: they base their filtering entirely on knowledge of what was bad in the past. They're fighting the last war.
Happily, trainable expert systems like SpamSieve will be able to adapt more easily than rule-based systems. But still, this is something to bear in mind: there is an arms race between spammers and anti-spammers, and there is no guarantee that the anti-spammers will win.