- iRi

2006-04-22

I've come to realize that this isn't a "blog" in any meaningful sense. I just don't update it enough. It's more a website that happens to have an RSS feed if you care.

So as blogs go, this site is fairly poorly linked.

Nevertheless, I periodically check Technorati, just in case. I'm sometimes surprised. But at least as of this writing, the top three hits are clearly spam blogs. What I find most odd is how they pluck postings not from my current feed, but from the distant past. This is probably to foil the easiest countering methodology, which would be noticing certain blogs that just clone other blogs. (In that case, regardless of whether they are spam or legitimate aggregators, you probably don't want them in your Technorati results, if you're Technorati.)

The only posts I notice in the spam blogs are posts I made that linked to other posts, which then get picked up by Technorati as links back to this site. Since this is relatively rare, presumably there are also a lot of other spam blogs picking up my stuff that I don't notice.

This is interesting because this blog is a nothing. Several of the "canonical" weblog lists don't have iRi. Whoever is collecting the list of sites to... ahhh... "repurpose" is pretty thorough.

This bring up the interesting question of whether this can be prevented from destroying the value of Technorati both as a search engine and as a way of telling who is linking back to you. Probably the most powerful approach would be something that most people would probably call "Bayesian Fingerprinting", which would be wrong but that won't stop anybody who isn't an academic. You could "fingerprint" a blog with some sort of "bag of words"-type vector, which could tell with high reliability, albeit not perfect reliability, whether many blog posts are authored by the same person. Then, any blog that continuously has radically different fingerprints from posts to posts could be assumed to be a spam blog.

Can this be defeated? Yeah, probably. First of all while I'm sure this would work to some degree, I have no idea exactly how discriminating it could be. It may not be good enough. Second, any attempt at all at drawing similar posts from the pool of "all weblog posts" and using similar posts in one spam blog would defeat the system; defeating it would be easier than staying ahead.

Then again, I said the same thing about Bayesian spam filtering, and while I still believe my mathematical analysis is correct that Bayesian filters can be destroyed, it is evident that no spammer has ever had the background to figure out the process I outlined. (I know some spammers read it, but I've not yet seen any evidence they comprehended it.) The same thing could happen here.

I find myself continuously coming back to my old "LinkBack" model, a "trackback" kind of system I invented about a year or two before Trackback, but never deployed beyond a test system. It would have worked much like Trackback does now, except that instead of a blog "pinging" another blog, a central server tracked the blogs and extracted the "Trackbacks" automatically, and then compiled the "LinkBacks" into a report that the central server could push up to the linked blog. The core sociological aspect of my idea was based around the belief that any global, open resource will be polluted, no matter what, so the key is to not have a global, open resource. Thus, my goal was to have many, many, many "LinkBack" communities and corresponding servers, each managed according to whatever policies that community saw fit; I figured most would go through an initial openness period, until they discovered that didn't work, then close the membership up somehow. Any blog could belong to any number of these communities.

Come to think of it, this is a trivial addition to Trackback if the problem even becomes acute enough; accepting Trackbacks only from known blogs, or whitelisting known blogs and sharing the whitelists with "communities" of blogs, would be pretty easy.

None of this is an option for Technorati, of course, so I'll be intrigued to see how they handle this problem, if indeed they do at all. iRi will most likely continue to be a wonderful way to check up on their progress, as I'm not expecting a glut of real incoming links anytime soon.