Lab 3: Huffman Coding

rangesatanskingdomSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

105 views

Lab 3: Huffman Coding



Overview:


The
Huffman code

is a fundamental encoding in the field of data compression. It forms
the basis for many modern compression approaches, including the DEFLATE algorithm
used to compress data in ZIP files. It is a
variable
-
length code
, meaning
each character
may have a different number of bits (rather than 8 for each, as in ASCII).


The principle that allows the Huffman code to obtain very good compression rates is the
representation of the most frequent characters with the
smallest number of bits. T
hat is, if
the letter “E” occurs most in a corpus of text, “E” will be represented by a short string of
bits, such as “1”. Conversely, infrequent letters, such as “X”, will be represented with
longer codes, such as “001011”.


Alth
ough Huffman encoding saves space by using only as many bits as are required, the
variable
-
length encoding does present problems: if “1” represents “E” and “0” represents
“S”, what happens when we need to represent “R” with “01”? How do we tell “R” apart
f
rom “SE”?


The solution is to avoid using codes that could clash with each other. For example, we
would not represent “S” with “0”; we would represent it with “01”, which can be
distinguished from “1” (the leading “0” clues us in).


This encoding can be na
turally modeled using a tree where moving left corresponds to a
“0”, while moving right corresponds to a “1”
. No codes can conflict because no node can
have two parents:





Huffman’s discovery was to realize that this tree coul
d be constructed by assigning each
character a priority based on its frequency in the input, storing the characters and
frequencies in leaves of the tree, and iteratively building the tree
up

by merging the two
elements with the lowest frequency into one n
ode. Elements with low frequencies do this
early and frequently, so they are placed at the bottom of the tree (long code lengths),
while elements with higher frequencies do this later, and end up near the top (short code
lengths).


Once the tree is built,
the characters are replaced by their codes:


E

S

R

0

1

00

01

E
-
> 1

S
-
> 00

R
-
> 01


The text “EEEEESSREE”, which consumes 80 bits in
uncompressed
ASCII, would be
encoded as: 1111100000111, which is only 13
bits long


less than 2 characters in ASCII.


If the mappings are

also stored, it is possible to decompress the code by simply re
-
creating the tree from the mappings and traversing either left or right depending on the
next bit in the code, then outputting the letter when a leaf is reached. The code we just
discovered:


1111100000111


Would then translate back to “EEEEESSREE”, the original input string.


Lab Work:


I have given you classes that will generate frequency tables from the text, construct a
Huffman tree from the frequency table, and Huffman
-
encode a string. Es
sentially, I have
written the encoding algorithm for you. For this lab, I would like you to write the
decoding

algorithm, given the encoded string and the
Serialized

mappings between
characters and codes. This entails:


1.

Deserializing

the mappings. See the
section below on Serialization.

2.

Reconstructing the Huffman tree from the mappings (remember, “0” means “left”
and “1” means “right”; create nodes as you traverse. Also remember that only the
values in the leaves matter



the interior nodes can have null va
lues
).

3.

Reading the input string one character at a time (in a real Huffman code, it should
be one bit at a time, but I’ve simplified your task by outputting ASCII “1”s and
“0”s instead of packing bits) and traversing the tree left or right based on the
val
ue of the character.
You only output a character once you hit a leaf.


The mappings between are stored in a HashMap<Character, String>.


I’ve provided Javadoc documentation for the classes I’ve written. Both the
documentation and the source are available o
n the course website.


Serialization:


Serialization

is the process of converting an object into a string of text that can be stored
in a file, transmitted over a socket, or otherwise sent around in a way that objects usually
can’t be.
Deserialization

is t
he process of converting this representation back into an
object.

Any object that implements the
Serializable

interface (which defines no methods)
can be serialized.


In Java, you may serialize an object by writing it to an
ObjectOutputStream
.
ObjectOutput
Stream’s constructor
takes another (text
-
based) output stream
, which is the
underlying stream the object will be written to. For example, you may write an object
called “HashMap<Character, String> targ” out to disk as follows:


FileOutputStream fout = new
FileOutputStream(“serializedobject.txt”);

ObjectOutputStream oout = new ObjectOutputStream(fout);

oout.writeObject(targ);

oout.close();


Or you may read it from disk as follows:


FileInputStream fin = new FileInputStream(“serializedobject.txt”);

ObjectInpu
tStream oin = new ObjectInputStream(fin);


HashMap<Character, String> targ = (HashMap<Character, String>) oin.readObject();