Also mentioned in lectures was the concept of a Huffman code, which is a method of lossless data compression (reducing the amount of data needed to store or transmit information). Recall that a Huffman code can be extracted from a Huffman tree – one basic algorithm for building a Huffman tree is:
Start with an empty set T
Add all the symbols to set T
While there are multiple "trees" in the set:
Remove the lowest probability tree from set T (break ties in terms of smaller tree size), and call this tree A
Remove the lowest probability tree from set T (break ties in terms of smaller tree size), and call this tree B
Make a new tree C by joining A and B, and set the probability of C to p(A) + p(B)
Add the new tree C to set T
Return the only tree in set T as the Huffman tree
Note: we will cover algorithm concepts is more detail in later classes, for now you only need to know the algorithm itself.
As with entropy, a worked example of building a Huffman tree is discussed in lectures. For example, one possible Huffman tree corresponding to the system shown earlier (with entropy of 2.171 bits) would be:
{{/Labs/01/Images/fig1.svg}}
The Huffman coding can then be extracted from this tree by walking the path along each symbol:
Symbol | P(Symbol) | Code | |Code| |
---|---|---|---|
A | 0.2 | 11 | 2 |
B | 0.1 | 100 | 3 |
C | 0.3 | 00 | 2 |
D | 0.3 | 01 | 2 |
E | 0.1 | 101 | 3 |
On average, the code length in this system is:
$$ 0.2 \times 2 + 0.1 \times 3 + 0.3 \times 2 + 0.3 \times 2 + 0.1 \times 3 = 2.2 \text{ bits} $$
Recall that the entropy of this system is $2.171$ bits, so the efficiency of this Huffman coding is:
$$ \frac{2.171}{2.200} = 98.7\% $$
Contrast this with a naïve encoding of our symbols, with three bits per symbol:
Symbol | P(Symbol) | Code | |Code| |
---|---|---|---|
A | 0.2 | 000 | 3 |
B | 0.1 | 001 | 3 |
C | 0.3 | 010 | 3 |
D | 0.3 | 011 | 3 |
E | 0.1 | 100 | 3 |
Symbol | P(Symbol) | Code | |Code| |
---|---|---|---|
A | 0.625 | ||
B | 0.25 | ||
C | 0.5 | ||
D | 0.0625 | ||
E | 0.125 |