Huffman compression

The Huffman compression algorithm is named after its inventor, David Huffman, formerly a professor at MIT.
Huffman compression is a lossless compression algorithm that is ideal for compressing text or program files. This probably explains why it is used a lot in compression programs like ZIP or ARJ.

How Huffman compression works

Huffman compression belongs to a family of algorithms with a variable codeword length. That means that individual symbols (characters in a text file for instance) are replaced by bit sequences that have a distinct length. So symbols that occur a lot in a file are given a short sequence while others that are used seldom get a longer bit sequence.

A practical example will show you the principle: Suppose you want to compress the following piece of data:
ACDABA

Since these are 6 characters, this text is 6 bytes or 48 bits long. With Huffman encoding, the file is searched for the most frequently appearing symbols (in this case the character ‘A’ occurs 3 times), and then a tree is built that replaces the symbols by shorter bit sequences. In this particular case, the algorithm would use the following substitution table: A=0, B=10, C=110, D=111. If these code words are used to compress the file, the compressed data look like this:
01101110100

This means that 11 bits are used instead of 48, a compression ratio of 4 to 1 for this particular file.

Huffman encoding can be further optimized in two different ways:

Adaptive Huffman code dynamically changes the codewords according to the change of probabilities of the symbols.
Extended Huffman compression can encode groups of symbols rather than single symbols.

Advantages and disadvantages

This compression algorithm is mainly efficient in compressing text or program files. Images, like they are often used in prepress, are better handled by other compression algorithms.