As y'all probably know I've been interested in Bayesian filtering. I've been using the Mozilla 1.3 implementation, because even though I don't think it's going to work in the long term I figure you ought to get it while the getting's good.
Plus side, it's been pretty good, with easily a 97%+ success rate on both correct positives and correct negatives.
The downside is that the false positives have been pretty bad. I cleaned out my spam folder today and here's what got thrown into the spam folder:
- The renewal notice for jerf.org.
- An order confirmation for an order I was just about to call the company about, because I thought they hadn't confirmed my order, but I saw they had charged my credit card.
- A couple of job notices from my college (i.e., not spam)
- A newsletter from my department
My experience with the Bayesian filters are matching what I would expect from my earlier experiments. The problem is that the spammers end up owning certain types of emails, and even legitimate emails of that type get filtered as spam. Since I am now easily getting 100 spams every couple of days, the risk of missing a mail in a sea of spam is becoming increasingly real.
Domain renewal spams are pretty common, so my registrar doesn't stand a chance out of getting through the filter. I've received more "order confirmation" spams then I've gotten order confirmations. Real job opportunities, which are valuable emails around these times, are swamped by the crap shoveled out by the spammers, including Monster.com and "affiliates". And finally, the number of fake "newsletters" I get is pretty impressive.
As spammers continue to widen their targets, the range of topics they're going to "spammify" is going to get worse for me.
Ah well. It's an improvement, and there's still some time before the Bayesian filter is of negative value to me. But I fear it may well precede even the wide-spread use of Bayesian filters I predicated my original predictions on. By the time "the public" gets access to this technology it may already be useless.
One major outcome of this project is the discovery that simple tree matching algorithms don't work.
I tried an algorithm where I took the three best nodes from a tree and compared that to a database of the three best nodes from the collected text works. The expectation is that the numerical value of each sense represents meaning spread out across many words and would to some degree adequately represent the "high points" of the text snippet.
By simply matching the senses and their parents (with a bit of a decay factor), the hope was that similar topics of discussion could be bundled together.
This did not work well. One problem was that text with very few identifiable nouns tended to easily be the best match for a given text snippet despite having no value as a match. Another problem was that the common nouns tended to dominate the matching, as if everything was matching on "thing", and that made the resulting matches almost random as a result.
The root problem seems to be that there is too much commonality in your average text snippet, where the commonality dominates the differences, and it is actually the differences we are interested in.
To get around this, we hope to use better matching methodologies by matching on whole trees. Unfortunately, on the timescale of this project this will preclude frequent searching as matching a tree against 1000 other trees will be time consuming.
An idea I had for my next blog-style thing, since iRights is within a few months of basically wrapping up.
I think it would be great to have a blog-like thing that tracks predictions: Who makes them, when they make them, and whether or not they come true. Kinda like Long Bets, but tracking anyone who makes a prediction at all, and no money; just reputation points. Also, rather then waiting for people to enter them, we record predictions of people who may not even be aware of the site at all.
I've wanted this sort of thing before, but what really makes me want it is all the disaster predictions that people on the left were predicting about the war. I want to hold them accountable, and not just let the wrong predictions slide off.
For that matter, I want to hold myself accountable. I refrain from making a lot of predictions because I try to pretend this site already exists and I'm being tracked. So the few I have on record are carefully chosen. A mechanism for actually doing so would be nice.
Maybe I'll try to work that into my copious spare time someday. To really work this needs some formal help and some tools. For instance, it's not as easy to define a prediction as you might like. My Bayesian spam filtering prediction from a few months ago has preconditions, which if they are never met means the prediction never goes into effect. So you need the idea of a "precondition", for instance, which you might not immediately consider. And we'd need a scoring system. And for extra fun, allow site members themselves to declare a vote. (This makes it a little like an ideas common, which have been implemented, but I want to focus on the accountability of people, not just abstract ideas.)
I think this would be really helpful in a lot of ways, not just as entertainment. Is there someone who is almost always right? Is there an ideology that is almost always right? Is "the left" more right then "the right" when it comes to predictions? Is my Senator constantly spewing rhetoric, or do his predictions of doom/success come true? Am I personally full of it?
I thought this was really touching (via InstaPundit):
A captured Iraqi colonel being held in one of the hangars listened in astonishment as his information minister praised Republican Guard soldiers for recapturing the airport.
He looked at his captors and, as he realised that what he had heard was palpably untrue, his eye filled with tears. Turning to a translator, he asked: "How long have they been lying like this?"
A man who finds out suddenly and undeniably that his entire career, and quite possibly his entire life, has been built on a foundation of lies, with nowhere to hide from the awful truth. The thought of it almost brought me to tears; there are no words for how horrible that must have been. All I can do is thank God I'm not an Iraqi.
The Justice Department lifted a requirement Monday that the FBI ensure the accuracy and timeliness of information about criminals and crime victims before adding it to the country's most comprehensive law enforcement database.
The system, run by the FBI's National Crime Information Center, includes data about terrorists, fugitives, warrants, people missing, gang members and stolen vehicles, guns or boats. [Privacy Digest]
|<- Future Posts||Past Posts ->|