CS02b Project 2 String compression with Huffman trees

PROJECT OVERVIEW CS02b Project 2 String compression with Huffman trees We've discussed how characters can be encoded into bits for storage in a computer. ASCII (7 8 bits per character) and Unicode (16+ bits per character) are two related standardized encodings used by Java and other computer systems. However, an encoding can use fewer bits (less memory) when the "alphabet" (set of all possible characters) is small, requiring fewer unique bit combinations (character codes) since fewer characters need representation. Huffman trees are a method of producing an encoding system that uses few bits for the most commonly used characters, and more bits for rarely used characters, saving bits overall. This is one form of "compression", minimizing the memory required to store data. For this project, you will complete a class whose primary functions: 1. build a Huffman tree based on a text passage, 2. encode a text passage into a "bit String" (String of 1s and 0s) using an encoding from the tree, and 3. decode a bit String using the encoding represented by the tree. Your starter code also comes with a function that builds a "standard" Huffman Tree for you. Note that even small variations in Huffman Trees will result in different encodings for some characters, which is why this standardized tree is provided for you to test your decode function in a way that PRECISELY matches the original encoding. This project also gives you the opportunity to see and manage a larger amount of code than most of our assignments, as well as practice working within and understanding a code base produced by another person. This is a valuable skill in programming most advanced work is a result of multiple complex pieces of code produced by multiple people, working together! Required Questions Once your program is functioning, you should use it and write some additional output statements in key locations in your code to help you answer the following questions: 1. Consider encoding the passage from I Know Why the Caged Bird Sings (cagedbirdpassage.txt).

a. How many unique characters are present in that passage? (In this case we mean text characters, not story characters!) b. If all characters were assigned unique codes using the same number of bits, how many bits would be required for each? (See our class example where we assigned unique 3 bit codes to the characters from "a man, a plan, a canal, panama". Why did we need 3 bits?) c. How many bits total would be required for the entire passage, using the number of bits from part (b) for each character? d. How many bits do you save by instead encoding the passage using your Huffman tree generated from the same text? 2. Consider encoding the passage from The Hobbit (thehobbitpassage.txt). a. How many bits does it take to encode it using a Huffman tree generated using its own text? b. How many bits does it take to encode using the provided "standard" Huffman tree, which was generated from a different text passage? c. How do you explain this difference? 3. Using the standard Huffman tree provided, what's encoded in mysterypassage.txt? DETAILED INSTRUCTIONS Drag and drop the provided file, HuffmanTree.java, into the src folder for your CS02b project within the Eclipse package explorer or if you like, you can make a brand new Java Project for this assignment and put your file in that src folder. If Eclipse asks, choose "copy", which allows Eclipse to manage the file in your workspace folder without worrying about what you do with the original. Save the following text files in your project folder (but outside your src folder): thehobbitpassage.txt cagedbirdpassage.txt mysterypassage.txt This does not need to be done through the Eclipse package explorer, although if you want them to show up there, you'll need to select your project folder and choose "refresh" from the menu (which is likely hotkeyed to F5). Get to know the provided source code. An overview is provided below. Look at the different sections and make sure you have a clear picture of the different options, variables and functions.

Complete the code responsible for the three major tasks: Tree building, encoding, and decoding. You can do these in any order and test them individually if you like the standard tree can be used for both encoding and decoding without building your own, and the mysterypassage.txt file is already encoded using that standard tree if you would like to decode it without first encoding your own files. Thoroughly test your code. You may un comment or add your own println statements to give feedback throughout your program. One thing you might like to do is create a very small text file, for instance containing a single line with just a few characters, to test the basics before trying it on real English text. With a small enough file, you can manually check a generated tree or encoding for correctness. And once your encoding and decoding is working, you should be able to encode a file and then decode it using the same tree, and confirm the the original result comes back out. When your code is working (or perhaps in the process of testing), activate the standard tree and decode the mystery passage. If the result is not intelligible, your decode functions are probably not correct! Answer questions 1 and 2 above in a multi line comment at the end of your source code. You may run your program multiple times for this, changing which files are encoded/decoded as necessary. You may also switch between using the provided standard tree or your own built tree. Feel free to add any extra output statements to give you additional information as your code runs. There is no specific console output required as long as it's clear what your code is doing, the three primary operations work correctly, and you've answered the questions. Submit the following via e mail: 1. your HuffmanTree.java source code, with a comment at the bottom answering the two questions 2. your decoded.txt file showing the decoded version of mysterypassage.txt (you may rename the decoded file if you like) Optionally, feel free to experiment and make additions to the code. My complete version of this project includes two extensions that are unfinished in your starter code: 1. the gapcheck function, which "fills in gaps" in a frequency map so the resulting tree is more flexible. 2. the buildbitrep function of the HuffmanParent class, which is part of a set of functions that can translate the tree itself to a bit String of 0s and 1s.

CODE OVERVIEW As mentioned, there are a few distinct sections and processes going on in the code. This section describes the overall plan, but you should look at the code itself to get a sense of how it all fits together it's thoroughly commented. By the way, this may be a good time to re open that "outline" panel in Eclipse we've kept closed/minimized this whole time! Terminology note: The term "prefix" is used in several places in this code, meaning "the first part of a String". In this program, prefixes are often added to one or two characters at a time, for instance while working your way recursively down a Huffman tree. Upon reaching a leaf, the "prefix" represents the path taken through the parent nodes to get there. Tree node classes These only appear at the end, but understanding them is key to working with the tree structure, so you may want to look at them first. The tree is represented by interconnected HuffmanNode objects. Since HuffmanNode is abstract, each node must be one of the two subclasses, HuffmanParent or HuffmanLeaf. HuffmanParent s keep track of the structure of the tree, while HuffmanLeaf s store the actual characters at the end of each series of branches. The two subclasses have appropriate (different) definitions for the same recursive methods to enable tree operations (see below). Because HuffmanNode declares these methods, polymorphism is used to call them recursively from parent nodes, regardless of whether its children are HuffmanLeaf s or are themselves HuffmanParent s of even more nodes. Thus, the recursive calls travel easily down the tree and into the leaves. Constants These static final variables at the top of the HuffmanTree class cannot be changed once the program has begun. In an application produced for the general public, these variables would probably be replaced with configuration files, a user interface, or some other way to specify exactly what the program is supposed to be doing without having to change the source code. In our case, these variables serve our needs just fine. Take a look at what each is supposed to do and where each is used. One variable you might like to add to this section is a file name of your own so that you can easily test your trees on any piece of text you like. You can create a new file in Eclipse through the "new" options.

To answer the questions, you will definitely need to change which files are encoded and decoded by changing the file selections represented in these constants. One thing you might like to do while testing is set your program to decode the very same file that was just generated during encoding ( ENCODE_OUT_F ). Main method This long method is already finished, although there are a few lines where System.out.println function calls may be commented in or out, and you may add your own println statements as well. This may be useful while testing and answering questions. For the most part you shouldn't have to change existing code in this method instead focus on making the individual methods work correctly when called FROM the main method. Tree building methods You will need to complete the genfrequencymap and maptotree methods for a tree to be built correctly the first creates a Map from Characters to their frequencies based on a char[], and the second uses that Map to build the actual Huffman tree. These are both used by the gentree method, which is already complete but may be a useful place to put some additional output statements. Two (overloaded) genstdtree methods are provided and already completed, which generate a standard tree from the included bit String instead of building a brand new one. You don't need to change these, although it may be interesting to see how they work (see the last section of this document for how trees are encoded as bits). Encoding methods As we discussed, there are two main approaches to encoding a sequence of characters: Mapping each possible character to its node within the tree ahead of time, and following the chain of parent nodes up to the top to determine the bit String each time a character is encoded, OR pre generating a mapping directly from each possible character to its bit String. This project uses the latter approach. A third approach, searching the entire tree from the top down every time a character needs encoding, would be extremely inefficient. You'll have to finish the recursive setbitstrings methods of the Huffman node subclasses. A parent node is responsible for propagating the recursive call to its children so that they too can be added to the Map, providing those children with the

correct bit Strings representing the branches taken down to that point. A leaf node is responsible for adding itself to the Map. Once the Map from characters to their bit Strings is finished, encoding is quite straightforward, so the encode function has been completed for you. The hard part is recursively traversing the tree and getting each character into the map in the first place. Decoding methods You must complete the decode method of the Huffman node subclasses. Decoding works perfectly well without a Map, instead starting at the top of the tree and reading input bits one by one to decide which branch to take at each point. That's exactly what this method should do for the parent nodes using the provided CharArrayIterator to advance through the different bit characters. Once a leaf is reached, no further bits are required; the character has been found. Again, once the recursive tree methods are complete, the rest is very straightforward, so the non recursive static decode method has been completed for you already. A note about char s If we were writing professional compression software, instead of converting each character to multiple '0' and '1' char s (which take up the same amount of memory as any other character, after all), we'd be converting them to "raw binary" 0s and 1s (which really do only take one bit to store) before writing them to a file, creating an actual reduction in file size. However, the purpose of this project is to demonstrate the compression techniques, so to make it as easy as possible to see the results of your encoding, we'll keep them as char s. Remember to treat them this way in the code! OPTIONAL ADDITIONS That concludes the sections of code you are required to complete. If you'd like some more challenges, here are two additional features you might try. They actually look more complicated than they are, they shouldn't take too much code, and they allow you to do some pretty interesting things! Gap check Certain characters appear in some passages and not others. A properly generated Huffman tree can always be used to encode the passage it was generated from, but a tree can't be used to fully encode a passage for which it's missing characters (although

it could just skip those characters). One way to handle this is to have your tree generating function, after generating the "frequency count", add any missing characters to the queue/tree with a frequency of 0. This way, they'll be at the bottom of the tree (the way generation works, they should all end up in one big sub tree "hanging off" the bottom), but they will still get encodings, even if they're really long ones. This could conceivably allow two people to agree on a "standard passage" to always use for tree generation. They could then encode ANY messages they like using that tree and send them to each other in compressed form. Decoding the messages would use that same standard tree. In our project, some letters may not be present in some passages. This is especially true of the less common capital letters. Additionally, the following non alphabetic characters appear in some of the provided passages and not others: '!' '"' '(' ')' '/' ':' ';' '?' Angelou Y Y Y Y Tolkein Y Y Y Y Y Mystery Y Y Y Y The standard tree was generated by looping through the following character groups and adding any characters from it to the tree generating queue if necessary: All capital letters All lowercase letters Characters with ASCII values 32 34: ' ' '!' and '"' Characters with ASCII values 39 41: '\'' '(' and ')' Characters with ASCII values 44 47: ',' ' ' '.' and '/' These characters with inconvenient ASCII values: '\n' ':' ';' '?' So, the standard tree is therefore capable of encoding and decoding ANY of the three passages provided! Although it might use more bits to do so than a tree built specifically for a given passage. By the way, we could have also included the digits and certain other punctuation char s in this group, but none of our sample passages include those characters, so we've chosen not to worry about them. Your mission, should you choose to accept it, is to complete the gapcheck method to fill in any missing characters in the frequency map so that the generated tree will include those characters too. The most obvious way to do this is to create a long array of all the characters you want to double check for, then loop through them all. However,

there are cleverer ways to set up your loops to avoid having to list out every single character to check in your code remember, each character is represented by a number behind the scenes, so you can loop through them if you know what order they occur. Bit representation of the tree itself When files are sent or stored in encoded form, they are very hard to decode unless the decoder knows exactly which tree was used in the first place. Some sort of standard passage could be used to always build the same tree (as described above), but this can be inconvenient to communicate and may result in inefficient compression since the standard tree isn't custom built for each different encoded file. One solution is to encode the tree itself in binary and include it at the beginning of the file in a standardized form. One straightforward standard is to start at the top and write a 0 for each parent node (which always has 2 children following it), and a 1 for each leaf immediately followed by the 8 digit ASCII character code (in binary) for that leaf. In fact, that is precisely the standard used to encode the tree provided to you in this project, and the genstdtree methods decode any bit String using that standard (although it obviously reads the bit String from the constant variable rather than a file). For example, a small tree with only three children could be encoded as follows (spaces added for clarity): left/"0" right/"1" child child of of root (itself root an entire sub tree) 0 1 0110 0001 0 1 0110 0010 1 0110 0011 / root / left/"0" right/"1" parent node child in child in in sub tree sub tree sub tree Note that following each 1 identifying a leaf node is the 8 bit ASCII code for 'a', 'b', or 'c'. So this represents a Huffman tree where 'a' is encoded as 0 (the only leaf on that side of the root), 'b' is encoded as 10, and c as 11 (the two leaves on the other side of the root).

So, this technique allows ANY file using ANY tree to be encoded and communicated, as long as the receiving program is using the same standard to encode and decode its trees and files. Fortunately, this is much easier to standardize than having to settle on a single tree to use for every communication! The project code for converting a tree to and from a bit String is actually nearly complete already. Obviously, the decoding part is already finished or else the provided standard tree could not be built by the genstdtree method. And, the translation of a character to its 8 bit ASCII code uses enough functions and operations you haven't used before that it has been provided for you in the HuffmanLeaf class. The only thing left to do is fill in the buildbitrep function in the HuffmanParent class, call bitrep() on the root of the tree from the main method, and print the result. (Actually, you could go even further and actually write it to the beginning of each encoded file. But you'll have to do that part on your own, and also, mysterypassage.txt was not encoded this way, so make sure you don't try to decode it using that format!) Display (already finished) The recursive display function has already been completed in the node classes, but you may learn something from looking at the definition and working out how it operates!