Bayesian Experiences

2003-04-28

As y'all probably know I've been interested in Bayesian filtering. I've been using the Mozilla 1.3 implementation, because even though I don't think it's going to work in the long term I figure you ought to get it while the getting's good.

Plus side, it's been pretty good, with easily a 97%+ success rate on both correct positives and correct negatives.

The downside is that the false positives have been pretty bad. I cleaned out my spam folder today and here's what got thrown into the spam folder:

The renewal notice for jerf.org.
An order confirmation for an order I was just about to call the company about, because I thought they hadn't confirmed my order, but I saw they had charged my credit card.
A couple of job notices from my college (i.e., not spam)
A newsletter from my department

My experience with the Bayesian filters are matching what I would expect from my earlier experiments. The problem is that the spammers end up owning certain types of emails, and even legitimate emails of that type get filtered as spam. Since I am now easily getting 100 spams every couple of days, the risk of missing a mail in a sea of spam is becoming increasingly real.

Domain renewal spams are pretty common, so my registrar doesn't stand a chance out of getting through the filter. I've received more "order confirmation" spams then I've gotten order confirmations. Real job opportunities, which are valuable emails around these times, are swamped by the crap shoveled out by the spammers, including Monster.com and "affiliates". And finally, the number of fake "newsletters" I get is pretty impressive.

As spammers continue to widen their targets, the range of topics they're going to "spammify" is going to get worse for me.

Ah well. It's an improvement, and there's still some time before the Bayesian filter is of negative value to me. But I fear it may well precede even the wide-spread use of Bayesian filters I predicated my original predictions on. By the time "the public" gets access to this technology it may already be useless.