Free idea for aggregator writers

2003-03-06

The next generation of News Aggregators should use Bayesian-style filtering to allow a user to indicate what kind of stories they like. Possibly even sorting them out by categories, because you may not be able to capture my preferences in a single filter.

This is not a new idea any more then Bayesian email filtering is, but perhaps the focus it has received in the email filtering role will encourage people to recognize its power in other applications as well.

There are some interesting complications that arise: You probably still want to have the aggregator display all of your featured channels, but have the Bayesian filter cast a wider net and select out the best articles from 200 or 300 channels that you would not want to read all of. In other words, have the Bayes-filtered channels be in addition to current channel selection mechanisms, not in place of. This also makes the bandwidth considerations more importent, because if there's anything worse then being slammed by aggregators, its having most of them discard the content before showing it to a user.

One possible bandwidth answers is to allow semi-centralized servers to receive many people's filter specifications (which if transmitted efficiently need not be that large, on the order of 100KB to 1MB), and have the semi-centralized server run the filter on a wide variety of channels that only it subscribes to, and send the stories directly back to the users it is proxying for. There's no reason that desktop aggregators can't set this up on an ad-hoc basis, within their own software groupings. (While you're at it you can do some of the more conventional caching schemes that have already been proposed elsewhere.)

The best part is that Bayesian filtering can actually work fairly well in this application because unlike spam, the channel writers are not trying to attack and bypass the filters. It's not a hostile environment in that sense.

I'd love to write this but I'm already booked with other personal projects. Do see if you can take Bayesian code from the email filter projects to make it easier; in the case of the code I've examined, there's little or nothing email-specific about the Bayesian code, certainly nothing that would get in the way or be hard to remove if it did.