Huffman Compression

Experimenting with simple text compression

Overview

This project was created in my ADEN class to learn how Huffman Compression works.

We created a program that uses Huffman Compression to encode and decode text files, then analyzing the compression based on how large the source text was.

Project Planning

The project was divided into five phases:

Generating the weights
Creating a Huffman Tree
BinaryIO
Encoding
Decoding

Generating the weights

This was the simplest part of the project. For this phase, we wrote functions to load and read a file, and keep track of how many times each character shows up in that file. Having the character frequency is the basis for the whole compression -- if we know how often a character shows up, then we can create variable-length shorthands for every character.

After generating the weights for the novel War and Peace, I visualized it in a graph.

A graph showing the character frequency of the novel War and Peace

The numbers on the bottom represent the ASCII encoding for each character. You can see how item 32 shows up in a relatively high percentage of the novel. This is because 32 translates to a space.

Creating the Huffman Tree

In this stage of the project, we took the weights that were generated in the last step to create a frequency-weighted tree. Characters that showed up more frequently were closer to the top of the tree. While traversing the tree to look for a specific character, you keep a running string containing 1's and 0's. If you move to the right child, you add a 1 to the string, and if you move to the left you add a 0. Every character has a unique string, and the length of the string is inversely proportional to the frequency of that character in the source text. So if you use a character like a frequently, the string matching with a will be short.

BinaryIO

In this phase of the project, I wrote methods to convert the binary string created from the previous step to an actual binary number. This was done through bitmasks.

Encoding

This phase of the project combined everything from earlier steps to encode (or compress) a text file.

For every character in the input file, we added the binary string encoding to a long running string. Every 8 characters, we added that string to a new file.

Decoding

In order to decode a compressed file, we need both the huffman tree (containing the specific weights for the file) and the file we want to decode.

For every binary string written in the encoded file, we traversed the huffman tree to find the character equivalent (going left at a 0 and right at a 1).

This process was just repeated until the entire file was read.

Results

An analysis on weight usage.

Below is a graph showing how changing which weights you used to compress a file ultimately resulted in different levels of file compression.

The Cat and the Hat weights were not as good as the War and Peace weights, because the Cat and the Hat is a much shorter text and does not see enough characters to accurately represent how language is distributed.

A graph showing the compression of various files based on which source file weights were used to compress them.

Key Learnings

I learned how binary strings can be used for compression, and gained a better understanding of how ASCII works.
I found it interesting how a college student managed to create this system in one exam.
I wonder how newer versions of compression work / differ from this system.