CS23005 Design and analysis of algorithms, Autumn 2004

CS23005 Design and analysis of algorithms

Autumn 2004--2005

Assignment 3

Deadline

Part Deadline Files to submit
I August 20, 2004       `codegen1.c`, `holmes1.hc`. (corrsponding to holmes_adv.in)
II August 27, 2004       `codegen2.c`, `holmes2.hc`. (corrsponding to holmes_adv.in)
III September 3, 2004       `hzip.c`, `hunzip.c`, `huffman.res`.

Part	Deadline	Files to submit
I	August 20, 2004	`codegen1.c`, `holmes1.hc`. (corrsponding to holmes_adv.in)
II	August 27, 2004	`codegen2.c`, `holmes2.hc`. (corrsponding to holmes_adv.in)
III	September 3, 2004	`hzip.c`, `hunzip.c`, `huffman.res`.

Summary

In this exercise you work on compression of English text files. Huffman codes are prefix codes, that are popularly used for compression and that are optimal under certain assumptions. Read Cormen-Lieserson-Rivest-Stein (pp 385-392) to know how Huffman codes are generated.

Part I: Character-based Huffman coding

Read an input text file and compute the frequency of each character occurring in the entire text. For your convenience three text files are given below. These files consist only of lower-case letters, digits, punctuation symbols and white space characters (space, new line, tab). Based on the frequency of the different characters generate the corresponding Huffman tree. Use either a heap or a priority queue for the construction of the tree. Write in a file the Huffman codes for the different characters. In order to distinguish between codes generated by different input files, replace the extension of the input file name by .hc. For example, for input file infile.txt, the code file should be named infile.hc.

Command line: codegen <input_file>

Example:

$ codegen file1.txt
Code file "file1.hc" generated...
$ codegen file2.asc
Error: File "file2.asc" not found.
No code file generated...
$ codegen
Give a file name : file3.in
Code file "file3.hc" generated...
$ cat file1.hc
a 001
b 10010
c 110011
d 1010
...
z 111101111
. 010
* 111010
...

Part II: Exploiting repetition of words

For further compression one may replace frequently occurring words by some short codes. Modify your program codegen of Part I so that hundered most frequent words are considered along with individual characters. Here "word" means a contiguous sequence of lower-case alphabetic letters. For example, "i" and "am" are words, whereas "i'm", "i am" and "am." are not words. Words of length one are essentially characters and need not be considered.

Read the input text file to identify and store in a binary search tree all the words appearing in the text. Store against each node of the tree the frequency (number of occurrence) of the word. After the entire input is read, prepare the list of hundred most frequently occurring words.

Finally, construct the Huffman tree whose leaves are the characters present in the input file together with the most frequently occurring words located as above. Save the corresponding Huffman codes in the .hc file as in Part I.

Example:

$ codegen file1.txt
Code file "file1.hc" generated...
$ codegen file2.asc
Error: File "file2.asc" not found.
No code file generated...
$ codegen
Give a file name : file3.in
Code file "file3.hc" generated...
$ cat file1.hc
a 001
b 10010
c 110011
d 1010
...
z 111101111
. 010
* 111010
...
is 1011011111101
holmes 011001101000101011
...

Part III: Encoding and decoding

Write the Huffman encoding and decoding routines based on the code generated in Part II. This part requires manipulation of characters bit-by-bit. Use bit-wise routines (shift, and, or etc.).

For encoding, read the code file (the corresponding .hc file by default) and store the information in a temporary hash table. Then read the input file. For each word read from the input, check the existence of it in the hash table. If the code is found, replace the entire word by its code, else replace the individual letters of the word by their codes. Also replace non-alphabetic characters by their codes.

Rewrite the input file by its compressed version which has a .hz appended to the name of the input file. Your program should support an optional command-line argument that supplies the directive of using a code file of the user's choice (in place of the default .hc file).

Command line: hzip [optional_code_file] <input_file>

Examples:

$ hzip file1.txt
Input file        : "file1.txt"
Code file         : "file1.hc"
Compressed file   : "file1.txt.hz"
Compression ratio : 0.85
$ hzip file2.in
Input file        : "file2.in"
Code file "file2.hc" not found. Aborting...
$ hzip file1.hc file2.in
Input file        : "file2.in"
Code file         : "file1.hc"
Compressed file   : "file2.in.hz"
Compression ratio : 0.88

For decoding, first read the code file (by default, the corresponding .hc file, unless another specific file name is supplied in the command line). Create a hash table indexed by codes.

Read the input file bit-wise and substitute each bit-sequence which is the encoding of a character or a word by the corresponding character or word. Also replace the zipped file (the .hz) file by the unzipped file.

Command line: hunzip [optional_code_file] <input_file>

Examples:

$ hunzip file1.txt.hz
Using code file "file1.hc".
"file1.txt.hz" is uncompressed to "file1.txt".
$ hunzip file2.in.hz
Code file "file2.hc" not found. Aborting.
$ hunzip file1.hc file2.in.hz
Using code file "file1.hc".
"file2.in.hz" is uncompressed to "file2.in".

Report in the file huffman.res the results you get from the following experiments:

Correctness of compression and decompression
Comparison of compression ratios for codegen1 and codegen2. For example, compare:
```
hzip holmes1.hc holmes_adv.in
hzip holmes2.hc holmes_adv.in
```

Comparison of compression ratios for native and foreign code files. For example, compare:

hzip holmes1.hc holmes_adv.in
hzip holmes1.hc holmes_mem.in
hzip holmes1.hc holmes_ret.in
hzip holmes2.hc holmes_adv.in
hzip holmes2.hc holmes_mem.in
hzip holmes2.hc holmes_ret.in

Input files

You are asked to experiment with the following English text files.

Home