Statistics vs the Internet, part 2

Bayesian filtering is a wonderful thing. Early spam filtering used a hard-coded list of spam phrases and features, which worked fine as long as the spammers didn't know about said list. In 2002, Paul Graham proposed using statistics rather than human-originated lists to fight spam, and chances are these days that your ISP and/or your mail client uses Bayesian spam filtering to trap spam. When you train Apple's Mail client to look for spam, this is what it's doing: building up a list of words and phrases that occur in spam messages, and words and phrases that occur in genuine mail that you want to read, and slowly building up a huge corpus of words and associated scores, so eventually it can look at an incoming email, total up the score of each individual word, and decide whether it's likely to be spam or not. When spammers start writing spams a bit differently the first few will slip through the filter, but as you train it with the new spams, any further spams of a similar ilk will get caught.

Barring the occasional blip, statistical analysis has effectively solved the spam problem. My email address has been on web pages on the Internet since about 1998 or thereabouts (so I'm on every single spammer's list), and any email to illuminated.co.uk goes straight to my inbox (so if anyone tries to send random email to xyiagr@illuminated.co.uk, I'll get it too). But because I've got a bloody good client-side spam filter, maybe 2 or 3 of the 2,000-odd spams I get every day ever gets through to my inbox.

So the question now becomes: if we can use statistics to effectively neutralise spam, can we apply statistics to other annoyances on the web? We already have popup-blockers and ad blockers; is there a way of streamlining the Internet experience even more?

These guys think so (via Ben Hammersley). They're trying to build up a corpus of stupidity, so eventually you'll be able to check web pages - and parts of web pages - against the stupidity corpus, and silently remove the ones that don't pass the shambling moron test.

So far, so good. What takes this beyond "fun idea, might work" and into "exceptionally brilliant" territory is the corpus they're using to train the stupid filter. They're using Youtube comments.