Compression

2007-09-25

Part of the BlogBook: Programming Wisdom

Below the fold, a discussion about compression, using this as a clear example of a principle I intend to relate to other programming principles, and indeed engineering principles in general.

The Pigeonhole principle is named for the fact that if you have five holes and six pigeons that need holes, no matter how you arrange the pigeons in the holes you have at least one hole with two pigeons in it.

A simple and air-tight proof that no algorithm can possible compress all sets of data is based on this principle. A real-world compression algorithm defines a reversible mapping of input bytes to output bytes. (It has to be "reversible", or you can't write a perfect decompresser.) Imagine starting with a compression algorithm that maps all input back to itself (the identity mapping). Now, if you map the five-byte input ABCDE to the four-byte output WXYZ (20% compression), the input string WXYZ now must go somewhere else. It can only increase in size in the putatively compressed output, because all outputs four bytes and smaller are now taken. In order for a compression algorithm to shrink some inputs, some others must grow.

This is not a rigorous proof, but the rigorous proof takes this form.

If this is true, than how can compression algorithms do useful work? Like many mathematical concepts, there are many equivalent ways of looking at the answer. The best formulation for my purpose is that algorithms can take advantage of the fact that not all data sets are equally probable.

The best way we can measure how much of the total possible space we will probably use is with a mathematical concept from information theory called entropy. This gives us an algorithm that can look at a data set and give us the number of bits it "really uses". Truly random data will tend to have an entropy measurement around 8 bits of entropy per byte, meaning that it is "really using" all 8 bits and is therefore incompressible; an endlessly repeating character will have an entropy very close to 0 bits per byte, meaning that it isn't "really" using those bits and is therefore compressible.

The entropy of English text is around 0.6 to 1.3 bits per character. Let's use 1 bit for our computing convenience. A modern ASCII-based encoding of this text uses 8 bits per character. Using these numbers, we can compare the total number of possible data strings for a given number of bytes, and the total amount of this possibility space that is "really consumed" by possible English phrases.

For a message of a mere 15 characters, the number of possible byte-based messages is 2^8*15, which is approximately 1.3*10^³⁶. The number of possible English messages using this measure is 2^1*15, or 32,768.

Don't lose sight of the point here by taking that number too literally; it's a probabilistic thing. Arguing about piddling differences here or there is unimportant when the gulf is already 31 orders of magnitude, and growing exponentially with every byte!

I've taken some liberties, especially around the scare-quoted bits, but I believe everything I said is true enough. For more real information, follow the links or consult your choice of sources.

Note: To anyone who wishes to argue about the compression points made here, I will accept absolutely nothing less than fully functional source code sent to me that implements your "universal compressor/decompressor".

If you take a moment to digest those numbers (10³⁶ is not an easy thing to wrap your mind around!), you might intuitively glimpse at how compression algorithms manages to work. The data that we are interested in is a vanishingly small island in a sea of useless random data. Any compression algorithm must expand inputs to shrink others, but we can let that useless random garbage "take" the expansion, while we concentrate our compression mojo on that small little island.

Nothing is free. If you want compression, you have to pay. But sometimes it's a really, really good deal.