14 Data Compression by Huffman Encoding

Size: px

Start display at page:

Download "14 Data Compression by Huffman Encoding"

Milton Banks
5 years ago
Views:

1 4 Data Compression by Huffman Encoding 4. Introduction In order to save on disk storage space, it is useful to be able to compress files (or memory blocks) of data so that they take up less room. However, we don't want to lose or corrupt the data, so we want to use loss-less data compression. Huffman encoding is probably the simplest method of loss-less data compression (although not the most effective method). We should also note here that not all files can be compressed. A simple explanation for this is that if any file can be compressed, then a compressed file can be compressed into another file, and so on, until the original file is reduced down to nothing! Clearly this is not possible. Before we go any further with Huffman Encoding, let's remind ourselves about probability. 4.2 Probability Probability is the mathematician's way of communicating how likely an event is. There are two ways of calculating probabilities of events: (i) Using past experience e.g. What is the probability that you will live to be 9 years of age? Life insurance companies hold a database of statistics recording how many people live to that age. They can use this to estimate how long you are likely to live. (ii) By calculation using the formula p(event) = No of ways event can happen Total no of possible outcomes where p(event) means 'probability of event happening' e.g. The probability of scoring 5 or more when throwing a normal 'fair' die is #{5,6} = 2 = #{,2,3,4,5,6} 6 3 e.g.2 A character is read from a file. Assuming that all 256 ASCII characters are equally likely to be read, what is the probability that the character is alpha-numeric? Alpha-numerics are {'a'... 'z'}, {'A'... 'Z'} and {''... '9'} therefore the required probability is given by: p(char is alphanumeric) = = Huffman Encoding Huffman encoding uses a similar principle to Morse code. In Morse code, the length of the code reflects the frequency of occurrence of that character in English language text. So, for example, 'e' = (i.e. the shortest code) and 'q' = (i.e. almost the longest code). 3

2 4.3. Creating the Huffman Codes In Huffman coding, instead of using ASCII, a new code is devised, which depends on which characters are in the file (or message) and how often they occur. The coded file or message will take up less space than the original ASCII representation. Huffman codes are of variable length and are created using a tree structure. The algorithm used for this is:. Across the bottom of the page, list each different character that appears in the 'message'. 2. Write against each character the probability of it occurring. (These are the leaf nodes of the tree. At the moment they have no parents.) 3. If there is only one parent-less node, then go to step Find the two nodes (leaf or internal node) with the lowest sum of probabilities. Join them by adding a common (internal) parent node. Give this parent node a probability equal to their sum. 5. Go to step Assign binary codes to the characters by 'walking' down the tree from root to leaf, giving a '' for each left branch and a '' for each right branch. The code for each character is obtained by reading from the root to each leaf node Example Consider the following block of 6 characters: ABABCBABDF. BEBCBDBEBF. BDBDBABCBA FABABCCCDE. FABCFABBAA. FCAAABABCD The distribution and hence probability of each character is thus: Character No of occurrences Probability A 5.25 B 2.35 C 9.5 D 6. E 3.5 F 6. So, steps and 2 of the algorithm give us: Step 3 of the algorithm allows us to go onto step 4. At step 4, we can choose either (E and F) or (E and D). We'll use (E and F) giving us: 3

3 We now go back to step 3 and then do step 4 again. This time we'll link C and D giving: We keep repeating this procedure until all nodes have a parent. At this point, the root node should have the value We now go to step 6 and label all the left branches with the value '' and all the right branches with the value '' giving:

4 We can now read off the codes of each character as follows: A = B = C = D = E = F = Note: It is only by chance that the codes for A to F go up in a binary sequence. If we had made different decisions during the creation of the tree, we could have got different codes. The first characters of the encoded message (ABABCBABDF) are thus: Exercise Produce another version of this tree and codes. Compare the total number of bits in the encoded message using the two sets of codes. (Think about how to do this. You don't have to encode the message to fiond out how many bits are needed.) 4.4 Decoding In order to decode the Huffman codes, we need a copy of the encoding tree at the receiving end. Unfortunately, this means we also have to transmit the frequency data for each character in the character set (or send the codes with separators first). So, if the character set is ASCII, we have to send 256 values which are the counts for each character, as well as the encoded data. This is unfortunate as it increases the size of the compressed file. Fortunately, for large files the overhead involved is acceptably small. (And there are ways to reduce the size of the count data.) We also have to ensure that both the transmitting end and the receiving end use the same rules to create the tree Recreating the tree To recreate the tree, we use the counts to produce the probabilities as before. However, we have already seen that there may be more than one possible tree created from any set of probability data. Hence we have to make sure that both the transmitting end and the receiving end follow the same rules for creating the tree. Rules which should ensure that both ends create the same tree (& which might help to produce a nice neat tree) are:. Write the characters along the bottom of the page with the highest probability on the left and descending across the page to the lowest on the right. Any characters with the same probability should be written in 'ASCII' order. 2. When there is a choice of nodes to use, always use the one furthest over to the right, even if it is at a higher level. 33

5 Following these rules, the data we used in the example would give this tree: B A C D F E and the codes would be: A = B = C = D = E = F = The encoded data for the first characters of the message (ABABCBABDF) would now be: The frequency values followed by Decoding the message.4 Having recreated the tree, decoding the message is simply a matter of reading the encoded characters in turn by reading down the tree from the root node to the leaf node and hence decoding the character. Thus, taking in order we would go: From the root node: - left, - right = A Go back to the root: - left, - left = B Go back to the root: - left, - right = A Go back to the root: - left, - left = B Go back to the root: - right, - left = C etc There is no problem about knowing when the next character starts and no conflict between which codes mean which character. 4.5 Instantaneous Codes Huffman codes are an example of instantaneous codes. These are codes in which it is guaranteed that no code will have a sequence of characters which is identical to the first few characters of any other code. e.g. in our codes, there is no character with the code '' which could be confused with the code for A or B, and no character has the code '' which could be confused with E or F. 34

6. Finding Efficient Compressions; Huffman and Hu-Tucker

6. Finding Efficient Compressions; Huffman and Hu-Tucker We now address the question: how do we find a code that uses the frequency information about k length patterns efficiently to shorten our message?