Compression has an important role to play in Big Data technologies. It makes both storage and transport of data more efficient. Then why are so many compression formats, and what are the things we have to balance while making a decision about which compression format is better?

When data is compressed, it becomes smaller so both disk I/O and network I/O become faster. It goes without saying you also save storage space. In the same regard, it would take some time to compress and decompress data so the CPU load increases. These are essentially two parameters on which trade-off happens. Besides that, some compression formats support splits while others do not.

So. to summarize, on one side of trade-offs are storage space, disk I/O and network I/O. And on the other side are CPU load.

Since disk space is cheap, optimization is preferred on CPU load vs compression ratio. Two algorithms which lead the way in this approach are LZO and Snappy. So no wonder both are very popular algorithms in Big Data space. Snappy is faster than LZO but has one drawback. And that is, it is not splittable. This means the file has to be compressed before writing and decompressed after forming the whole file again (as opposed to block level at node).

Top