Text Compression through Huffman Coding. Terminology

Text Compression through Huffman Coding Huffman codes represent a very effective technique for compressing data; they usually produce savings between 20% 90% Preliminary example We are given a 100,000-character text (maybe a book, or a long report we want to store it on a computer hard disk Information is stored on disks as sequences of zeroes ones Simplest option: use (extended ASCII codes! Encode each character in a two-byte code, like the one that Java uses In this way the resulting file will be (forgetting about spaces carriage returns 200,000 byte long! Data from http://wwwanujsethcom/crypto/historyhtml Average length 4227 How?? 1 2-1 Slightly more careful option: use a reduced size code only relative to the characters that actually occur in the text (this will not save anything in the worst case More careful option: compute the character frequencies first (! then associate longer sequences of bits to characters that occur less frequently In general such codes, also known as variable length codes, may give significant savings on the amount of space needed to store a given (very long text Terminology We want to define a code (ie a mapping from an alphabet words (sequences, strings over another alphabet the length of the encoded string to that minimises We consider only codes in which no codeword is also a prefix of some other codeword Such codes are called prefix codes It is possible to show that the optimal data compression achievable by any code can always be achieved with a prefix code Therefore there is no loss of generality in restricting attention to prefix codes symbol a 0 b 10 c 110 codeword 2 3

Encoding/Decoding Prefix codes are desirable because of simple encoding/decoding procedures: Encoding given the source text, simply concatenate the codewords representing each character: abababa becomes (spaces have been shown to give a clearer description of the encoding, they are NOT part of the encoded text Decoding since no codeword is a prefix of any other codeword, the codeword that begins an encoded file is unambiguous To decode 1 identify the initial codeword; 2 translate it back to the original character; 3 remove the codeword from the file; 4 repeat the decoding process on the remainder of the encoded file If the encoded sequence is c we can decode this as b b Data Structure To be efficient the decoding process needs a convenient representation for the prefix code, so that the initial codeword can be easily picked off A binary tree whose leaves are the given characters provides one such representation We interpret the binary codeword for a character as the path from the root of the tree to that character, where 0 means go to the left child 1 means go to the right child Example? 4 6 Property Claim An optimal code for a file is always represented by a tree in which every non-leaf node has exactly two children Exercises 1 Write a decoder for the code given above; 2 What is its time complexity? (Informal argument if one of the children is missing, in some sense, we are losing the opportunity to use shorter codewords, for some of the symbols in So, if the text alphabet has characters, then the tree for an optimal prefix code has exactly leaves, one for each letter in internal nodes This is a simple graph-theoretic property of any tree whose internal nodes have all degree three 5 7

! HUFFMAN ( Cost of a tree Given a tree corresponding to some prefix code, it is a simple matter to associate a cost function with it For each character, let denote the frequency of let denote the depth of s leaf in (note that is also the length of the codeword for The average codeword length is (* for to ALLOCATE-NODE( leftextract-min(!left right EXTRACT-MIN(!right INSERT(,! return EXTRACT-MIN( We will use to represent the cost of the tree An example is due 8 9-1 Constructing a Huffman code Huffman invented a greedy algorithm that constructs an optimal prefix code, called a Huffman code The algorithm builds the tree corresponding to an optimal code in a bottom-up manner It begins with a set of leaves performs a sequence of merging operations to create the final tree In the pseudo-code that follows we assume that is a set of characters, that eachis associated with an object with a defined frequency A priority queue, keyed on, is used to identify the two least frequent objects to merge together The result of the merger of two objects is a new object whose frequency is the sum of the frequencies of the two objects that were merged The generic node of right to the left right child of in Who computes this? is an object containing two pointers left Priority Queues?? A priority queue is a data structure for maintaining a set key A priority queue supports the following operations: INSERT( MIN(, of elements each with an associated value (or returns the element of inserts the element into with minimal key EXTRACT-MIN( removes returns the element of minimal key with A priority queue can be implemented in many ways (exercise! Different implementations lead to different complexity results 9 10

Analysis From the discussion above about the efficiency of elementary priority queue operations (when optimally implemented it follows that HUFFMAN takes time In general the running time is dominated by (a the time to populate the data structure extract the minimum element from plus (b multiplied by the time to To complete the analysis of this procedure we need to prove that it actually works! Claim 1 Let be an alphabet let the frequency function be defined for each Let be the two characters having the lowest frequencies There exists an optimal prefix code for in which the codewords for in the last bit have the same length differ only (Proof idea Let be a tree representing an arbitrary optimal prefix code We show how to modify it to a new tree (again representing an optimal prefix code but having maximum depth The codewords for in will have the same length differ only in the last bit as sibling leaves of 11 13 Key properties An optimisation problem can be solved optimally by a greedy algorithm if it has the following two features: greedy-choice An optimal solution can be reached by making a locally optimal choice at each step optimal-substructure An optimal solution is formed by optimal solutions to subproblems Details Given let (any two characters that are sibling leaves of maximum depth Without loss of generality assume Since also be (otherwise, minimal Now define have the two lowest frequencies it must from by exchanging with wouldn t be with Finally compute Most of the terms simplify what is left is positive! Therefore the transformation generates a new optimal tree, furthermore, have the desired property in 12 14

! Claim 2 Let be a tree representing an optimal prefix code over an alphabet let the frequency function be defined for each Consider any two characters leaves in, let their parent Then, considering!!be as a character with frequency, the tree represents an optimal prefix code for the alphabet appearing as sibling 15