Lab 3: Huffman Coding
is a fundamental encoding in the field of data compression. It forms
the basis for many modern compression approaches, including the DEFLATE algorithm
used to compress data in ZIP files. It is a
may have a different number of bits (rather than 8 for each, as in ASCII).
The principle that allows the Huffman code to obtain very good compression rates is the
representation of the most frequent characters with the
smallest number of bits. T
hat is, if
the letter “E” occurs most in a corpus of text, “E” will be represented by a short string of
bits, such as “1”. Conversely, infrequent letters, such as “X”, will be represented with
longer codes, such as “001011”.
ough Huffman encoding saves space by using only as many bits as are required, the
length encoding does present problems: if “1” represents “E” and “0” represents
“S”, what happens when we need to represent “R” with “01”? How do we tell “R” apart
The solution is to avoid using codes that could clash with each other. For example, we
would not represent “S” with “0”; we would represent it with “01”, which can be
distinguished from “1” (the leading “0” clues us in).
This encoding can be na
turally modeled using a tree where moving left corresponds to a
“0”, while moving right corresponds to a “1”
. No codes can conflict because no node can
have two parents:
Huffman’s discovery was to realize that this tree coul
d be constructed by assigning each
character a priority based on its frequency in the input, storing the characters and
frequencies in leaves of the tree, and iteratively building the tree
by merging the two
elements with the lowest frequency into one n
ode. Elements with low frequencies do this
early and frequently, so they are placed at the bottom of the tree (long code lengths),
while elements with higher frequencies do this later, and end up near the top (short code
Once the tree is built,
the characters are replaced by their codes:
The text “EEEEESSREE”, which consumes 80 bits in
ASCII, would be
encoded as: 1111100000111, which is only 13
less than 2 characters in ASCII.
If the mappings are
also stored, it is possible to decompress the code by simply re
creating the tree from the mappings and traversing either left or right depending on the
next bit in the code, then outputting the letter when a leaf is reached. The code we just
Would then translate back to “EEEEESSREE”, the original input string.
I have given you classes that will generate frequency tables from the text, construct a
Huffman tree from the frequency table, and Huffman
encode a string. Es
sentially, I have
written the encoding algorithm for you. For this lab, I would like you to write the
algorithm, given the encoded string and the
characters and codes. This entails:
the mappings. See the
section below on Serialization.
Reconstructing the Huffman tree from the mappings (remember, “0” means “left”
and “1” means “right”; create nodes as you traverse. Also remember that only the
values in the leaves matter
the interior nodes can have null va
Reading the input string one character at a time (in a real Huffman code, it should
be one bit at a time, but I’ve simplified your task by outputting ASCII “1”s and
“0”s instead of packing bits) and traversing the tree left or right based on the
ue of the character.
You only output a character once you hit a leaf.
The mappings between are stored in a HashMap<Character, String>.
I’ve provided Javadoc documentation for the classes I’ve written. Both the
documentation and the source are available o
n the course website.
is the process of converting an object into a string of text that can be stored
in a file, transmitted over a socket, or otherwise sent around in a way that objects usually
he process of converting this representation back into an
Any object that implements the
interface (which defines no methods)
can be serialized.
In Java, you may serialize an object by writing it to an
takes another (text
based) output stream
, which is the
underlying stream the object will be written to. For example, you may write an object
called “HashMap<Character, String> targ” out to disk as follows:
FileOutputStream fout = new
ObjectOutputStream oout = new ObjectOutputStream(fout);
Or you may read it from disk as follows:
FileInputStream fin = new FileInputStream(“serializedobject.txt”);
tStream oin = new ObjectInputStream(fin);
HashMap<Character, String> targ = (HashMap<Character, String>) oin.readObject();