CS15100 Lab 7: File compression

C151 Lab 7: File compression Fall 26 November 14, 26 Complete the first 3 chapters (through the build-huffman-tree function) in lab (optionally) with a partner. The rest you must do by yourself. Write both your name and your partner s name on the homework when you hand it in. f you are in the 9:3am MWF section, handin your solution by emailing it to robby@cs.uchicago.edu. f you are in the TTh section, email your solution to bboven@gmail.com and mulmuley@cs.uchicago.edu. t must be in the appropriate mailbox before lab starts in week 8. 1 ntroduction At the lowest level, computers represent data as sequences of bits ( or 1). The normal way to represent a message as a sequence of bits is to use a table that associates bit patterns with characters and then translate each letter the message according to the table. The standard table that many computers use is called the AC table, which represents every character on the keyboard (and a few more besides) as a sequence of exactly 8 bits. Here is one portion of the AC table: Character AC encoding 1 M P n AC, the message MPP would be represented like this: M P P 1 1 1 1 1 1 f you save the word MPP in Drcheme, that sequence of bits is how it will be written out in the saved file. There are a number of advantages to representing messages with AC, but it is not particularly good for generating short encodings of particular messages. n situations where we really need messages to be short (maybe because we want to transmit a message quickly across a network, or save it on a disk that doesn t have much space left) we can often do dramatically better. The message MPP, for instance, doesn t use most of the letters of the alphabet at all, so an encoding scheme that didn t let us write those letters down at all would be fine. Furthermore, it uses and four times each, but P only twice and M only once: for that reason, it would be a good trade to use an encoding table that had short representations for and and longer representations of P and M. The following alternative encoding produces a much shorter encoding for the message MPP: 1

Character Alternative encoding M 1 P M 1 While AC needs 88 bits, the alternative encoding needs just 21. The goal of this lab is to implement an algorithm called Huffman coding that determines the best encoding table for a particular message, and then encodes or decodes messages according to that table. As a demonstration of the technique s practical application, you will use it to write a program that compresses and decompresses files. For this lab, you will need to use the following teachpack: P P http://www.cs.uchicago.edu/ jacobm/151-26-fall/huffman-utils.ss Huffman coding is named after its inventor, David Huffman (1925 1999). He invented it in 1951 as a final project for a class he was taking his instructor listed it as a possible paper topic without mentioning that it was a major unsolved problem at the time! 2 Gathering statistics The first step of the algorithm is to determine the frequencies of each letter in the input. ;; A statistics is a (listof frequency) ;; A frequency is: ;; (make-frequency character number) (define-struct frequency (token count)) Note. The frequency structure is provided by the teachpack. Do not define it yourself. Characters are a built-in category of primitive values each representing one letter (or numeral, or punctuation mark, et cetera). They can be written down directly with the syntax #\x (for the character corresponding to a lower-case x). Characters can be tested for equality using char=?. The main advantage of characters is that we can get them out of strings: for instance, given the string "MPP" we can use the built-in function string list: (string list "MPP") produces (list #\M #\ #\ #\ #\ #\ #\ #\ #\P #\P #\). Write a function frequencies : (listof character) statistics, which takes a message represented as a list of characters and produces statistics containing the frequency with which each token appears in the message. For instance, (frequencies (list #\M #\ #\ #\ #\ #\ #\ #\ #\P #\P #\)) shouldbe (list (make-frequency #\M 1) (make-frequency #\ 4) (make-frequency #\ 4) (make-frequency #\P 2)) 2

3 Building Huffman trees ;; A huffman-tree is either: ;; - (make-leaf character number) ;; - (make-branch huffman-tree huffman-tree (listof character) number) (define-struct leaf (token count)) (define-struct branch (l r tokens count)) The key idea behind Huffman coding is the Huffman tree. Given a particular message, a Huffman tree for that message is a binary tree whose leaves are character, one per distinct character in the message. Additionally, for every subtree, the total frequency of all the tokens on the left side is as nearly equal to the total frequency of all the tokens on the right side as possible. Huffman s algorithm for building these trees is as follows. t takes as its input the statistics generated in the last section, for instance: (list (make-frequency #\M 1) (make-frequency #\ 4) (make-frequency #\ 4) (make-frequency #\P 2)) t turns each of these frequencies into a trivial binary tree consisting of just the input character and its frequency, and sorts them by frequency (lowest to highest): (list (make-leaf #\M 1) (make-leaf #\P 2) (make-leaf #\ 4) (make-leaf #\ 4)) From this point on the algorithm works on lists of trees sorted by frequency. t successively removes the first two trees from the list and combines them into a single branch whose character list is the combination of the two subtrees character lists and whose frequency is the sum of the two subtrees frequencies. t inserts this new branch into the list (making sure to maintain sorted order) and repeats the process until only one tree is left. That tree is the output. For instance, here are the successive steps the algorithm would take on the example above, both in code and in picture form: tage Code: (list (make-leaf #\M 1) (make-leaf #\P 2) (make-leaf #\ 4) (make-leaf #\ 4)) tage 1 Code: M 1 P 2 4 4 (list (make-branch (make-leaf #\M 1) (make-leaf #\P 2) (list #\M #\P) 3) (make-leaf #\ 4) (make-leaf #\ 4)) (M P) 3 4 4 M 1 P 2 3

tage 2 Code: (list (make-leaf #\ 4) (make-branch (make-branch (make-leaf #\M 1) (make-leaf #\P 2) (list #\M #\P) 3) (make-leaf #\ 4) (list #\M #\P #\) 7)) 4 (M P ) 7 (M P) 3 4 tage 3 Code: M 1 P 2 (list (make-branch (make-leaf #\ 4) (make-branch (make-branch (make-leaf #\M 1) (make-leaf #\P 2) (list #\M #\P) 3) (make-leaf #\ 4) (list #\M #\P #\) 7) (list #\ #\M #\P #\) )) ( M P ) 4 (M P ) 7 (M P) 3 4 M 1 P 2 Write the function build-huffman-tree : statistics huffman-tree, which builds the Huffman tree that corresponds to the given frequencies. 4 Encoding a message The Huffman tree for a message is a representation of the optimal table for encoding that message: the code for each letter is just the path from the root of the tree to that letter, with representing going down the left branch and 1 representing going down the right branch. Write the function encode-message : (listof character) huffman-tree (listof bit), where a bit is either or 1. For instance, 4

(define message (string list "MPP")) (define freqs (frequencies message)) (define tree (build-huffman-tree freqs)) (encode-message message tree) shouldbe (list 1 1 1 1 1 1 1 1 1 1 1 1 1) 5 Decoding a message To decode a message, one needs the encoded version of the message and the Huffman tree that was used to encode it. Write the function decode-message : (listof bit) huffman-tree (listof character), which decodes a message encoded with encode-message. For instance, (list string (decode-message (list 1 1 1 1 1 1 1 1 1 1 1 1 1) tree)) shouldbe "MPP" 6 An application: file compression n the introduction we mentioned that computers store messages as sequences of bits. That is not quite the whole truth: the sequences must be exact multiples of 8, since computers arrange memory into 8-bit bytes. When using AC you never need to think about this, since every character in AC is represented as a whole byte, so you can t end up with a message that doesn t fill some exact number of bytes; but with the encodings that come from Huffman tables it is possible. The problem is this: when you re reading a compressed message off of a disk, you will always read it as a whole number of bytes, but somewhere between and 7 of the last bits were not a part of the encoding of the original message. The standard way to deal with this is to add a special end-of-message token to the end of every message when encoding it. With that character added, the encoding process can proceed almost exactly as normal eom is counted just like a character when computing statistics, generating a Huffman tree, and encoding the message the only difference being that the encoder must ensure that the lengths of its final encodings are multiples of 8 bits long by padding the ending (after the encoding of the eom token) with arbitrary bits. With this done, the decoder can take advantage of the fact that eom appears at the end of every message and stop decoding as soon as it decodes an end-of-message token, even if there are more bits available for decoding. Change the definition of a frequency from section 2 as follows: ;; A frequency is: ;; (make-frequency token number) ;; a token is either: ;; - a character ;; - eom Then modify all parts of your program that need to change to make proper use of the eom token. Once you have done that, you are ready to write the final compression and decompression functions. To help with that, the huffman-util.ss teachpack provides one new data definition and four functions: ;; compressed-data is 5

;; (make-compressed-data statistics (listof bit)) ;; NOTE: the length of the list of bits must be a multiple of 8 (define-struct compressed-data (stats bits)) ;; file list : string (listof character) ;; produces a list of characters corresponding to the entire named file ;; write-compressed-data-to-file : compressed-data string boolean ;; writes the contents of the given compressed-data structure into a file. ;; Returns true on success, or false if something went wrong ;; (for instance the file couldn t be written) ;; read-compressed-data-from-file : string compressed-data ;; reads a compressed data file into a compressed-data structure ;; Note: the length of the bits returned is always a multiple of 8 ;; list file : (listof character) string boolean ;; which makes a file with the given string as its name with the given list ;; of characters as its contents. Returns true on success, ;; false if something went wrong. Note. The compressed-data structure is provided by the teachpack. Do not define it yourself. Use these helpers to define the following functions: compress-file : string string boolean, which compresses the contents of the file named by the first string and places the compressed version in the file named by the second string. uncompress-file : string string boolean, which expects the contents of the file named by the first string to be compressed data, uncompresses that data, and writes the result to the file named by the second string. (The provided helpers do a small bit of magic for you: they write out the statistics at the beginning of the file before writing your bit list and read it back in, in addition to writing and reading your provided bit list. Building this functionality yourself is not particularly difficult, but since it isn t particularly interesting we figured we d save you the trouble.) 6