Foreword
When I first began programming, I had no idea what data compression was nor why
it mattered. Luckily, my Apple II Plus computer came with 0.000048 GB of memory
(48 KB), which was quite a lot in 1979, and was enough to let me explore programming and computer graphics without realizing that my programs and data were constantly being compressed and decompressed behind the scenes in order to reduce their size in memory. Thanks, Woz!
After programming for a few years, I had discovered:
- Data compression took time and could slow down my software.
- Changing my data organization could make the compressed data smaller.
- There are a bewildering variety of complicated data compression algorithms.
This led to the realization that compression was not a rigid black box; rather, it’s a flexible tool that greatly influenced the quality of my software and could be manipulated in several ways:
- Changing compression algorithms could make my software run faster.
- Pairing my data organization with the right compression algorithm could make
my data smaller. - Choosing the wrong data organization or algorithm could make my data larger
(and/or run slower).
Ah! Now I knew why data compression mattered. If things weren’t fitting into memory or were decompressing too slowly, I could slightly change my data organization to better fit the compression algorithm. I’d simply put numbers together in one group, strings in another, build tables of recurring data types, or truncate fractions into integers.I didn’t need to do the hard work of evaluating and adopting new compression algorithms if I could fit my data to the algorithm.
Then, I began making video games professionally, and most of the game data was created by not-so-technical artists, designers, and musicians. It turned out that math wasnot their favorite topic of discussion, and they were less than excited about changing the game data so that it would take advantage of my single go-to compression algorithm.
Well, if the data organization couldn’t be improved, that left choosing the best
compression algorithm to pair up with all of this great artistic data.
I surveyed the various compression algorithms and found there were a couple of
broad categories suitable for my video game data:
Lossless
- De-duplication (LZ)
- Entropy (Huffman, Arithmetic)
Lossy
- Reduced precision (truncation or decimation)
- Image/video
- Audio
For text strings and binary data, I used LZ to compress away repeating duplicate data patterns. Pixel data went through lossy vector quantization (VQ) to map pixels to a color palette. Audio data went through lossy decimation and linear predictive coding (LPC) to reduce the bits per second. The output of any of those compressors could then go into a lossless Huffman compressor for additional statistical entropy compression, if the CPU was fast enough.
During the 1980s and 1990s, I worked on about 30 games, most of which used these
compression algorithms along with simple data build tools that performed limited
optimizations of the data organization.
But then, around the year 2000, the situation became more complex. There is an
ongoing arms race between data generation tools and data display and analysis. The consequences have been software performance, storage size, network congestion, and the efficient pairing of compression algorithms with data organization.
This data flood has been partially offset by larger storage (Blu-ray discs, terabyte hard drives, cloud storage), faster multicore CPUs, new lossless compression algorithms such as BWT, ANS, and PAQ, as well as dramatic improvements in lossy codecs for
image, video, and audio data. However, data sizes are growing faster each year and dwarf the slow improvements in network bandwidth, compression algorithm
improvements, and storage capacity.
Which brings us to the present and why this book matters.
How can a programmer learn which algorithms to use on their data and which data
changes will help or hinder a particular algorithm? What would really help is an overview of the major data compression algorithms to guide developers through themyriad choices now available. Most developers don’t need to wade through all the theory and math details required to implement these algorithms; instead, they need a road map of the strengths and weaknesses of these algorithms, and how to take advantage of them for specific use cases.
I’ve greatly enjoyed implementing, using, and watching the evolution of data compression algorithms over the past 37 years. I hope this book will help demystify data compression and provide a starting point for software engineers to learn about compression algorithms and help them make better software.
CHAPTER 1
Let’s Not Be Boring
Welcome to the first chapter of a book about a niche section of computing. We’re supposed to set the stage here for the entire book (that’s what the publisher says), and really hook the reader (that would be you). We’re expected to talk about history, a handful of basics, and anything else that we can do to try and ease you into the topic of compression as gently (but interestingly) as possible. Without math. Because math is hard.(Yep, that’s the last time we’ll say this)
But let’s be real, that’s boring for you to read, and for us to write.
So here’s what we’re going to do, instead. This book is about compression. And compression is all about the most compact representation of data. So, we’re going to run through this introductory stuff in the shortest, most compressed form possible.
First, we’re going to talk about buckets. Then, we’re going to introduce you to this rebel named Claude Shannon, who pretty much ruined our life while simultaneously creating every important thing that you love about computers. Finally, we are going to reveal to you the one essential thing you need to know about data compression. And without going out of our way (hardly!), we’ll make clear how compression pays off in better, cheaper, and faster apps.
Do we have a deal?
The Five Buckets of Compression Algorithms
Data compression algorithms are a really, really big space. Fortunately, these algorithms fall into a few buckets, which makes things a lot easier to understand. To throw the words at you, they are variable-length codes, statistical compression, dictionary encodings, context modeling, and multicontext modeling. Each of these five highlevel buckets contains a horde of algorithm variations, which is a good thing; each variation differs slightly in intended input data, performance, memory constraints, and output sizes. Picking the correct variant means carrying out tests on your data and the encoders to find the one that works best.
Now, you can use these buckets together, because some buckets contain algorithms
whose entire purpose is to transform the data so that another bucket can be more
efficient at compressing it.
For you to be viewed as a compression guru, you need to understand the buckets,
how they fit together, and what types of variants to use from which bucket for your own data sets.
Let’s get started
Claude Shannon Is Infuriating!
Back in the 1940s, a statistical researcher named Claude Shannon published several papers detailing research he did while working in the military during World War II, and later at Bell Labs.
Claude was a pretty smart guy (and very good at math). Before he left the University of Michigan in 1936, he’d racked up bachelor’s degrees in engineering and mathematics.
He then went on to do a bunch of crazy post-graduate stuff at the Massachusetts
Institute of Technology, and his master’s thesis, “A Symbolic Analysis of Relay and Switching Circuits”, became the foundation of modern electrical switch-based computing.
In 1948, Shannon published A Mathematical Theory of Communication, which
detailed how to best encode information that a sender wants to transmit, thus inventing the entire field of Information Theory. Messages can be encoded in many ways—think “alphabet” or “Morse code”—but for every message, there is a most efficient way to encode it, where “efficient” means using the fewest possible letters or symbols (or bits, or units of information). What “fewest” boils down to depends on the information content of the message. Shannon invented a way of measuring the information content of a message and called it information entropy.
Data compression is a practical application of Shannon’s research, which asks, “How compact can we make a message before we can no longer recover it?”(It’s important to note that according to modern information theory, there is a point at which removing any more bits removes the ability for you to uniquely recover your data stream properly. So, our compression goal
is to remove as many bits as possible to get to this point, and then remove no more.)
So wait...why is he infuriating?
Well, although we can thank Mr. Shannon for helping to create the modern computers on which this book is being typed (and on which you’re most likely reading it), his work on information theory is directly the thing we’re trying to defeat. You can look at data compression as a rebellion against information entropy. Every compression algorithm computer scientists write tries to disprove Claude Shannon’s research, and compress the data further than its measured entropy. We scrape and pull and steal any bits we can from a message, to make it as small as possible, each time trying to break below Shannon’s definition of entropy and get to a new level of information understanding.Millions upon millions of hours of engineering time over the past 60 years have been solely dedicated to creating algorithms to defeat—or cleverly sneak around —a concept created by this brilliant man.
The Only Thing You Need to Know about Data Compression
OK, here’s what you need.
Data compression works via two simple ideas:
Reduce the number of unique symbols in your data (smallest possible “alphabet”).
Encode more frequent symbols with fewer bits (fewest bits for most common “letters”).
Boom. Done. That’s it.
Sixty years of compression research boiled down to two bullet points. Every singlelgorithm in data compression focuses on doing one of these two things. It transforms the data to be more compressible by shuffling or reducing the number of symbols, or it takes advantage of the fact that some symbols are more common than others, and encodes more common symbols with fewer bits.
What makes applied data compression so complex is that there’s a gazillion ways to do these two things, depending on the kind of data you have. You’ll need to take the following considerations into account:
- Different data needs to be treated differently. Words in a book and floating-point numbers, for example, respond to very different algorithms.
- Some data can be transformed first to make it more compressible.
- Data might be skewed. For example, temperature data taken in summer might be
skewed toward high temperatures; that is, it might contain a lot more high temperatures than near-freezing ones.
Your challenge as a programmer is to figure out the best way, or combination of ways, for compressing any block of data that a user throws at your application. And your challenge as a content developer is to figure out how to throw data at your users and not break their bank accounts.3
That, my dear adventurer, is what the rest of this book is about. It’s your field guide to understanding what in the compression world is worthy of your attention, and how the algorithms work conceptually, so that you can choose the right ones and apply them to your super awesome social/mobile/web/media application data.
A World Built on Data Compression
Let’s be clear about this: the computing world that you live in, right now, is built entirely on the back of data compression algorithms.
Yup. Every piece of it.
Every web page, image, song, cat video, streaming Internet movie, selfie, video game download, microtransaction, and OS update works only because of compression
algorithms. In fact, you can’t throw a single bit of data around the Internet without running into some compressed content.
What’s so amazing about data compression technology is that it’s responsible for
some of the largest changes in personal computing over the past 40 years, and no one knows about it.
For example, do you download or stream music instead of buying a CD? If so, you
have compression to thank.
Music compression
See, in 1996, a joint working group (a bunch of smart people from different companies) unveiled the MP3 file format. This new audio format changed the nature of audio on computers. Until that point, the WAV file format was the most dominant and accepted format for creating, storing, and transferring audio data. Everyone used it, but the files were irresponsibly huge. A three-minute song could be roughly 30 MB in size and take around 9 minutes to download.4 Forget about streaming!
The invention of MP3 meant that anyone could get a full-length, three-minute song
as about 1–3 MB of data at impressive audio quality levels.5 Users could even plop CDs into their computers and convert an entire album to the MP3 format to listen to digitally.
This combination of smaller file size and good quality gave birth to one of the biggest consumer innovations of our time: Napster. This service made it possible for people to trade MP3s with one another, free of cost. Of course, this opened up a massive legal problem: folks would buy a CD, convert the audio to MP3, and then share it with their friends, who never had to pay for the original disc. As you can imagine, the companies who make money off CD sales were infuriated and did everything within their power to successfully shut down the Napster service.
And so, the late 1990s/early 2000s were riddled with legal battles and governmental policy changes attempting to stop this kind of music sharing. There was even legislation proposed that would make the use of the MP3 format illegal.
Apple, rather than fighting this new digital phenomenon, decided to build a product around it. In 1998, it launched the iPod, one of the first portable devices dedicated to storing and playing MP3 files. With it came the iTunes Store, where customers could legally purchase MP3 files for personal use.6 Today digital music distribution has become the new normal, with a plethora of companies trying to find better ways to sell music to you.
The massive success of the iPod product eventually led to the development and
release of the iPhone device, changing the face of personal computing forever. (But that’s a different story.)
Image compression
Let’s cycle back in time a bit further to the birth of the Internet. In 1978, when the first connections of the Internet structure were created, the amount of data sent was pretty minimal. The small number of users would primarily send and receive text data, or images that were created entirely out of characters, as demonstrated in Figure 1-1.7
Figure 1-1. A castle, made of ASCII art. Source: Wikipedia, no author.
The issue at hand was that real image information, stored in 24-bits-per-pixel format, was entirely too hefty for early connection modems. So, of course, the compression gurus went right at it. To test their new image compression algorithms, they needed a corpus of images. Being part of a male-dominated industry, they might have had a bias toward gentlemen’s magazines for source material; thus, they ended up with the now famous Lena image (see Figure 1-2), a picture of Lena Söderberg from the pages of the November 1972 issue of Playboy magazine.
Figure 1-2. “Lena.” Original full portrait photographed by Dwight Hooker and published as “Playmate of the Month” in the November 1972 issue of Playboy magazine. This 512 x 512 electronic/mechanical scan of a section of the full portrait was created by Alexander Sawchuk et al. and is available from the USC-SIPI image database. Licenses under Fair Use via Wikipedia.
When they unveiled the results of their research, they used a cropped, PG-13 version of the image in their paper, and provided the original version for others to test their own compression algorithms on, as well. For a long time, Lena was the gold-standard corpus image for testing the majority of image compression algorithms. Thankfully, since then, less controversial image corpora have been created. (The Kodak company’s image test suite is our personal favorite.) However, Lena is often still included as a litmus test in many image compression papers today.
Video compression
Fast-forward to 2001 and the launch of YouTube, a website where users could upload any video they recorded, for free, for everyone to see.
Until this point, the dominant way of sending around video information had been in the MOV format, which was nothing more advanced than a series of JPG images
strung together in order. Unsurprisingly, the files were insanely large. So, the idea that you could just load a web page and watch a video was mind-boggling.
Genome mapping
In 2008, in an attempt to tackle disease and human mortality, scientists started to map and test the human genome. A single genome sequence represents an enormous
amount of data—more than 14 GB just to describe the makeup of a human. These
data sizes were larger than most systems were able to handle (and cloud computing
hadn’t become a big thing yet).
Compression, once again, came to the rescue. Researchers were able to find that
BWT8 was the most efficient way to store DNA information in a compressed form,
and they could even perform operations on it without having to decompress it first.
By 2014, researchers had created one of the fastest protein folders on the planet, combining scalable cloud computing and compressed data transfer between host computers.
Compression and the economy
So, you see, compression has been at the heart of many massive changes in computing technology and culture. The reason for this lies in simple economic theory: compressed files are smaller files. Meaning, it takes less time to transfer them, and it costs less to do so, as well. Distributors pay less to distribute, and customers pay less to consume. In a modern world in which computing time is literally money, compression represents the most economically viable way to shorten the gap between content distributors and content consumers.