Bandwidth efficiency idea for RSS - rproxy

2003-09-10

"Bandwidth efficiency" for RSS has come and gone as an issue, but it will come again; all improvements on the last pass were linear in nature, meaning that as more people come online the problem will rear its head again later. And next time, the "low-hanging" fruit will be gone.

There are two basic, fundamental problems with the current system:

When an RSS file changes, the entire RSS file is transferred. This provides basic limitation on how much bandwidth can be saved currently, through any technique of not downloading the entire file unless necessary, such as using Etags. For instance, Instapundit's RSS 1.0 file is 10KB; Scripting New's RSS file is 15KB. Start doing the bandwidth math and that's a lot of transfer. Many sites are even less efficient and have multi-hundred KB RSS files, many without knowing it. Every time the site changes, everybody gets a whole new copy. Very inefficient.
There is only one source for the changes. When Scripting News changes, everybody has to hammer Scripting.com.
Update: A third problem is that the only way to scale up right now is to spend more on bandwidth, money the blogger may not have. See this later posting.

Ideally, to keep the RSS system from imploding as more people come online, we need to reduce the number of bytes flowing per update, and we need to partially decentralize the system.

Observation: Weblogs follow a Zipf distribution, so it suffices to only "fix" the model for large-flow websites. Small fry like, *ahem*, iRights do not have problem and probably never will.

rproxy is a now defunct proxy program that I found from a link on Slashdot today. The page still has a good explanation of how it works, but the useful bit of code is actually librsync, which is an LGPL'ed library for the rsync protocol. (In simple English, LGPL'ed software may be freely used by all, including commercial programs, as long as improvements to the library are distributed; it's the GPL where the "viral" nature is explicitly limited to the licensed code, not all code that uses the licensed code.)

With two changes to the current RSS system, I think all practical bandwidth problems into the indefinate future can be solved:

Get RSS clients and servers using librsync to transfer RSS files, instead of direct downloading them. This saves bytes.
Create a new tag for RSS that specifies alternate download locations, and allow the RSS aggregator to randomly choose which source is used. (Sources that have files that are older then the current RSS file will be discarded for some period of time; may also mean needing to timestamp the RSS file itself if the software is not already doing it.) These mirrors can be set up like current mirrors of free software are, using rsync to update them, and allowing people to rsync off of the mirrors. If a weblog attains sufficient popularity to need this, they'll probably come up with people who can host mirrors without any particular problem.

It's simple, the two changes are orthogonal, and they complement each other nicely. The only "problem" with this plan is it will require foresight from the RSS aggregator and producer community to start implementing this before it's a problem. By the time it's a problem, the time pressure to implement a solution means half-baked solutions will be implemented, but, frankly, all the half-baked solutions that will work were mined out last time. It's time for the community to turn their attention to this proactively, before it's a problem.

While I'd be happy to help out with this, the fact is that I am neither an RSS aggregator author, nor a site large enough to merit this treatment, nor do I anticipate becoming either any time soon. I'm not certain there's much I can do right now, except point out the problem and a potential solution that would not be too difficult to implement, and doesn't depend on "boiling the ocean". So I guess this is "bread cast upon the water".