Huffman Compression
Experimenting with simple text compression
Overview
This project was created in my ADEN class to learn how Huffman Compression works.
We created a program that uses Huffman Compression to encode and decode text files, then analyzing the compression based on how large the source text was.
Project Planning
The project was divided into five phases:
Creating the Huffman Tree
In this stage of the project, we took the weights that were generated in the last step to create a frequency-weighted tree. Characters that showed up more frequently were closer to the top of the tree. While traversing the tree to look for a specific character, you keep a running string containing 1's and 0's. If you move to the right child, you add a 1 to the string, and if you move to the left you add a 0. Every character has a unique string, and the length of the string is inversely proportional to the frequency of that character in the source text. So if you use a character like a frequently, the string matching with a will be short.
BinaryIO
In this phase of the project, I wrote methods to convert the binary string created from the previous step to an actual binary number. This was done through bitmasks.
Encoding
This phase of the project combined everything from earlier steps to encode (or compress) a text file.
For every character in the input file, we added the binary string encoding to a long running string. Every 8 characters, we added that string to a new file.
Decoding
In order to decode a compressed file, we need both the huffman tree (containing the specific weights for the file) and the file we want to decode.
For every binary string written in the encoded file, we traversed the huffman tree to find the character equivalent (going left at a 0 and right at a 1).
This process was just repeated until the entire file was read.
Results
An analysis on weight usage.Below is a graph showing how changing which weights you used to compress a file ultimately resulted in different levels of file compression.
The Cat and the Hat weights were not as good as the War and Peace weights, because the Cat and the Hat is a much shorter text and does not see enough characters to accurately represent how language is distributed.
Key Learnings