CS02b Project 2 String compression with Huffman trees

Similar documents
Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

15 July, Huffman Trees. Heaps

COSC-211: DATA STRUCTURES HW5: HUFFMAN CODING. 1 Introduction. 2 Huffman Coding. Due Thursday, March 8, 11:59pm

CSE 143, Winter 2013 Programming Assignment #8: Huffman Coding (40 points) Due Thursday, March 14, 2013, 11:30 PM

Binary Trees Due Sunday March 16, 2014

So on the survey, someone mentioned they wanted to work on heaps, and someone else mentioned they wanted to work on balanced binary search trees.

Design Pattern: Composite

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

CSE143X: Computer Programming I & II Programming Assignment #10 due: Friday, 12/8/17, 11:00 pm

Binary Trees Case-studies

Black Problem 2: Huffman Compression [75 points] Next, the Millisoft back story! Starter files

PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between

BEGINNER PHP Table of Contents

CS 206 Introduction to Computer Science II

Constraint Satisfaction Problems: A Deeper Look

More Bits and Bytes Huffman Coding

Binary Trees and Huffman Encoding Binary Search Trees

TREES. Trees - Introduction

MITOCW MIT6_172_F10_lec18_300k-mp4

EE 368. Weeks 5 (Notes)

Huffman, YEAH! Sasha Harrison Spring 2018

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

CS15100 Lab 7: File compression

Horn Formulae. CS124 Course Notes 8 Spring 2018

Out: April 19, 2017 Due: April 26, 2017 (Wednesday, Reading/Study Day, no late work accepted after Friday)

Trees! Ellen Walker! CPSC 201 Data Structures! Hiram College!

Download, Install and Use Winzip

Text Compression through Huffman Coding. Terminology

Arduino IDE Friday, 26 October 2018

CS 200 Algorithms and Data Structures, Fall 2012 Programming Assignment #3

Skill 1: Multiplying Polynomials

CSE 143 Lecture 22. Huffman Tree

Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides

Lab 7 Macros, Modules, Data Access Pages and Internet Summary Macros: How to Create and Run Modules vs. Macros 1. Jumping to Internet

Linked Lists. What is a Linked List?

Assignment 1: grid. Due November 20, 11:59 PM Introduction


Data compression.

Huffman Codes (data compression)

CMPSCI 240 Reasoning Under Uncertainty Homework 4

CS103 Spring 2018 Mathematical Vocabulary

Binary Search Trees. Carlos Moreno uwaterloo.ca EIT

Using X-Particles with Team Render

Radix Searching. The insert procedure for digital search trees also derives directly from the corresponding procedure for binary search trees:

In our first lecture on sets and set theory, we introduced a bunch of new symbols and terminology.

Binary, Hexadecimal and Octal number system

Using Eclipse and Karel

New to the Mac? Then start with this lesson to learn the basics.

CS 170 Java Tools. Step 1: Got Java?

Hi everyone. I hope everyone had a good Fourth of July. Today we're going to be covering graph search. Now, whenever we bring up graph algorithms, we

A PROGRAM IS A SEQUENCE of instructions that a computer can execute to

Building Java Programs. Priority Queues, Huffman Encoding

COMP-202 Unit 4: Programming with Iterations

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

CSE 100 Advanced Data Structures

UNIT III BALANCED SEARCH TREES AND INDEXING

printf( Please enter another number: ); scanf( %d, &num2);

Binary Trees

Greedy Algorithms CHAPTER 16

1 Getting used to Python

6.001 Notes: Section 8.1

TourMaker Reference Manual. Intro

CSC148 Week 6. Larry Zhang

Graduate-Credit Programming Project

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

Ruby on Rails Welcome. Using the exercise files

Java Programming Constructs Java Programming 2 Lesson 1

B-Trees. Introduction. Definitions

Slide 1 Side Effects Duration: 00:00:53 Advance mode: Auto

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

An Overview 1 / 10. CS106B Winter Handout #21 March 3, 2017 Huffman Encoding and Data Compression

Designing a Database -- Understanding Relational Design

Notice on Access to Advanced Lists...2 Database Overview...2 Example: Real-life concept of a database... 2

Analysis of Algorithms

Naming Things in Adafruit IO

static CS106L Spring 2009 Handout #21 May 12, 2009 Introduction

15-122: Principles of Imperative Computation, Spring 2013

Formal Methods of Software Design, Eric Hehner, segment 24 page 1 out of 5

[key, Left subtree, Right subtree]

There are many other applications like constructing the expression tree from the postorder expression. I leave you with an idea as how to do it.

MITOCW watch?v=flgjisf3l78

Tree Structures. A hierarchical data structure whose point of entry is the root node

SQL - Tables. SQL - Create a SQL Table. SQL Create Table Query:

CONTENTS: While loops Class (static) variables and constants Top Down Programming For loops Nested Loops

The Stack, Free Store, and Global Namespace

CSE100. Advanced Data Structures. Lecture 13. (Based on Paul Kube course materials)

Android Programming Family Fun Day using AppInventor

CS2112 Fall Assignment 4 Parsing and Fault Injection. Due: March 18, 2014 Overview draft due: March 14, 2014

2. INSTALLATION OF SUSE

Linked lists. Yet another Abstract Data Type Provides another method for providing space-efficient storage of data

MITOCW watch?v=v3omvlzi0we

Post Experiment Interview Questions

EECS 311: Data Structures and Data Management Program 1 Assigned: 10/21/10 Checkpoint: 11/2/10; Due: 11/9/10

The IBM I A Different Roadmap

CS103 Handout 50 Fall 2018 November 30, 2018 Problem Set 9

CIS 121 Data Structures and Algorithms with Java Spring 2018

Assignment #6: Markov-Chain Language Learner CSCI E-220 Artificial Intelligence Due: Thursday, October 27, 2011

MITOCW watch?v=0jljzrnhwoi

Physical Level of Databases: B+-Trees

Transcription:

PROJECT OVERVIEW CS02b Project 2 String compression with Huffman trees We've discussed how characters can be encoded into bits for storage in a computer. ASCII (7 8 bits per character) and Unicode (16+ bits per character) are two related standardized encodings used by Java and other computer systems. However, an encoding can use fewer bits (less memory) when the "alphabet" (set of all possible characters) is small, requiring fewer unique bit combinations (character codes) since fewer characters need representation. Huffman trees are a method of producing an encoding system that uses few bits for the most commonly used characters, and more bits for rarely used characters, saving bits overall. This is one form of "compression", minimizing the memory required to store data. For this project, you will complete a class whose primary functions: 1. build a Huffman tree based on a text passage, 2. encode a text passage into a "bit String" (String of 1s and 0s) using an encoding from the tree, and 3. decode a bit String using the encoding represented by the tree. Your starter code also comes with a function that builds a "standard" Huffman Tree for you. Note that even small variations in Huffman Trees will result in different encodings for some characters, which is why this standardized tree is provided for you to test your decode function in a way that PRECISELY matches the original encoding. This project also gives you the opportunity to see and manage a larger amount of code than most of our assignments, as well as practice working within and understanding a code base produced by another person. This is a valuable skill in programming most advanced work is a result of multiple complex pieces of code produced by multiple people, working together! Required Questions Once your program is functioning, you should use it and write some additional output statements in key locations in your code to help you answer the following questions: 1. Consider encoding the passage from I Know Why the Caged Bird Sings (cagedbirdpassage.txt).

a. How many unique characters are present in that passage? (In this case we mean text characters, not story characters!) b. If all characters were assigned unique codes using the same number of bits, how many bits would be required for each? (See our class example where we assigned unique 3 bit codes to the characters from "a man, a plan, a canal, panama". Why did we need 3 bits?) c. How many bits total would be required for the entire passage, using the number of bits from part (b) for each character? d. How many bits do you save by instead encoding the passage using your Huffman tree generated from the same text? 2. Consider encoding the passage from The Hobbit (thehobbitpassage.txt). a. How many bits does it take to encode it using a Huffman tree generated using its own text? b. How many bits does it take to encode using the provided "standard" Huffman tree, which was generated from a different text passage? c. How do you explain this difference? 3. Using the standard Huffman tree provided, what's encoded in mysterypassage.txt? DETAILED INSTRUCTIONS Drag and drop the provided file, HuffmanTree.java, into the src folder for your CS02b project within the Eclipse package explorer or if you like, you can make a brand new Java Project for this assignment and put your file in that src folder. If Eclipse asks, choose "copy", which allows Eclipse to manage the file in your workspace folder without worrying about what you do with the original. Save the following text files in your project folder (but outside your src folder): thehobbitpassage.txt cagedbirdpassage.txt mysterypassage.txt This does not need to be done through the Eclipse package explorer, although if you want them to show up there, you'll need to select your project folder and choose "refresh" from the menu (which is likely hotkeyed to F5). Get to know the provided source code. An overview is provided below. Look at the different sections and make sure you have a clear picture of the different options, variables and functions.

Complete the code responsible for the three major tasks: Tree building, encoding, and decoding. You can do these in any order and test them individually if you like the standard tree can be used for both encoding and decoding without building your own, and the mysterypassage.txt file is already encoded using that standard tree if you would like to decode it without first encoding your own files. Thoroughly test your code. You may un comment or add your own println statements to give feedback throughout your program. One thing you might like to do is create a very small text file, for instance containing a single line with just a few characters, to test the basics before trying it on real English text. With a small enough file, you can manually check a generated tree or encoding for correctness. And once your encoding and decoding is working, you should be able to encode a file and then decode it using the same tree, and confirm the the original result comes back out. When your code is working (or perhaps in the process of testing), activate the standard tree and decode the mystery passage. If the result is not intelligible, your decode functions are probably not correct! Answer questions 1 and 2 above in a multi line comment at the end of your source code. You may run your program multiple times for this, changing which files are encoded/decoded as necessary. You may also switch between using the provided standard tree or your own built tree. Feel free to add any extra output statements to give you additional information as your code runs. There is no specific console output required as long as it's clear what your code is doing, the three primary operations work correctly, and you've answered the questions. Submit the following via e mail: 1. your HuffmanTree.java source code, with a comment at the bottom answering the two questions 2. your decoded.txt file showing the decoded version of mysterypassage.txt (you may rename the decoded file if you like) Optionally, feel free to experiment and make additions to the code. My complete version of this project includes two extensions that are unfinished in your starter code: 1. the gapcheck function, which "fills in gaps" in a frequency map so the resulting tree is more flexible. 2. the buildbitrep function of the HuffmanParent class, which is part of a set of functions that can translate the tree itself to a bit String of 0s and 1s.

CODE OVERVIEW As mentioned, there are a few distinct sections and processes going on in the code. This section describes the overall plan, but you should look at the code itself to get a sense of how it all fits together it's thoroughly commented. By the way, this may be a good time to re open that "outline" panel in Eclipse we've kept closed/minimized this whole time! Terminology note: The term "prefix" is used in several places in this code, meaning "the first part of a String". In this program, prefixes are often added to one or two characters at a time, for instance while working your way recursively down a Huffman tree. Upon reaching a leaf, the "prefix" represents the path taken through the parent nodes to get there. Tree node classes These only appear at the end, but understanding them is key to working with the tree structure, so you may want to look at them first. The tree is represented by interconnected HuffmanNode objects. Since HuffmanNode is abstract, each node must be one of the two subclasses, HuffmanParent or HuffmanLeaf. HuffmanParent s keep track of the structure of the tree, while HuffmanLeaf s store the actual characters at the end of each series of branches. The two subclasses have appropriate (different) definitions for the same recursive methods to enable tree operations (see below). Because HuffmanNode declares these methods, polymorphism is used to call them recursively from parent nodes, regardless of whether its children are HuffmanLeaf s or are themselves HuffmanParent s of even more nodes. Thus, the recursive calls travel easily down the tree and into the leaves. Constants These static final variables at the top of the HuffmanTree class cannot be changed once the program has begun. In an application produced for the general public, these variables would probably be replaced with configuration files, a user interface, or some other way to specify exactly what the program is supposed to be doing without having to change the source code. In our case, these variables serve our needs just fine. Take a look at what each is supposed to do and where each is used. One variable you might like to add to this section is a file name of your own so that you can easily test your trees on any piece of text you like. You can create a new file in Eclipse through the "new" options.

To answer the questions, you will definitely need to change which files are encoded and decoded by changing the file selections represented in these constants. One thing you might like to do while testing is set your program to decode the very same file that was just generated during encoding ( ENCODE_OUT_F ). Main method This long method is already finished, although there are a few lines where System.out.println function calls may be commented in or out, and you may add your own println statements as well. This may be useful while testing and answering questions. For the most part you shouldn't have to change existing code in this method instead focus on making the individual methods work correctly when called FROM the main method. Tree building methods You will need to complete the genfrequencymap and maptotree methods for a tree to be built correctly the first creates a Map from Characters to their frequencies based on a char[], and the second uses that Map to build the actual Huffman tree. These are both used by the gentree method, which is already complete but may be a useful place to put some additional output statements. Two (overloaded) genstdtree methods are provided and already completed, which generate a standard tree from the included bit String instead of building a brand new one. You don't need to change these, although it may be interesting to see how they work (see the last section of this document for how trees are encoded as bits). Encoding methods As we discussed, there are two main approaches to encoding a sequence of characters: Mapping each possible character to its node within the tree ahead of time, and following the chain of parent nodes up to the top to determine the bit String each time a character is encoded, OR pre generating a mapping directly from each possible character to its bit String. This project uses the latter approach. A third approach, searching the entire tree from the top down every time a character needs encoding, would be extremely inefficient. You'll have to finish the recursive setbitstrings methods of the Huffman node subclasses. A parent node is responsible for propagating the recursive call to its children so that they too can be added to the Map, providing those children with the

correct bit Strings representing the branches taken down to that point. A leaf node is responsible for adding itself to the Map. Once the Map from characters to their bit Strings is finished, encoding is quite straightforward, so the encode function has been completed for you. The hard part is recursively traversing the tree and getting each character into the map in the first place. Decoding methods You must complete the decode method of the Huffman node subclasses. Decoding works perfectly well without a Map, instead starting at the top of the tree and reading input bits one by one to decide which branch to take at each point. That's exactly what this method should do for the parent nodes using the provided CharArrayIterator to advance through the different bit characters. Once a leaf is reached, no further bits are required; the character has been found. Again, once the recursive tree methods are complete, the rest is very straightforward, so the non recursive static decode method has been completed for you already. A note about char s If we were writing professional compression software, instead of converting each character to multiple '0' and '1' char s (which take up the same amount of memory as any other character, after all), we'd be converting them to "raw binary" 0s and 1s (which really do only take one bit to store) before writing them to a file, creating an actual reduction in file size. However, the purpose of this project is to demonstrate the compression techniques, so to make it as easy as possible to see the results of your encoding, we'll keep them as char s. Remember to treat them this way in the code! OPTIONAL ADDITIONS That concludes the sections of code you are required to complete. If you'd like some more challenges, here are two additional features you might try. They actually look more complicated than they are, they shouldn't take too much code, and they allow you to do some pretty interesting things! Gap check Certain characters appear in some passages and not others. A properly generated Huffman tree can always be used to encode the passage it was generated from, but a tree can't be used to fully encode a passage for which it's missing characters (although

it could just skip those characters). One way to handle this is to have your tree generating function, after generating the "frequency count", add any missing characters to the queue/tree with a frequency of 0. This way, they'll be at the bottom of the tree (the way generation works, they should all end up in one big sub tree "hanging off" the bottom), but they will still get encodings, even if they're really long ones. This could conceivably allow two people to agree on a "standard passage" to always use for tree generation. They could then encode ANY messages they like using that tree and send them to each other in compressed form. Decoding the messages would use that same standard tree. In our project, some letters may not be present in some passages. This is especially true of the less common capital letters. Additionally, the following non alphabetic characters appear in some of the provided passages and not others: '!' '"' '(' ')' '/' ':' ';' '?' Angelou Y Y Y Y Tolkein Y Y Y Y Y Mystery Y Y Y Y The standard tree was generated by looping through the following character groups and adding any characters from it to the tree generating queue if necessary: All capital letters All lowercase letters Characters with ASCII values 32 34: ' ' '!' and '"' Characters with ASCII values 39 41: '\'' '(' and ')' Characters with ASCII values 44 47: ',' ' ' '.' and '/' These characters with inconvenient ASCII values: '\n' ':' ';' '?' So, the standard tree is therefore capable of encoding and decoding ANY of the three passages provided! Although it might use more bits to do so than a tree built specifically for a given passage. By the way, we could have also included the digits and certain other punctuation char s in this group, but none of our sample passages include those characters, so we've chosen not to worry about them. Your mission, should you choose to accept it, is to complete the gapcheck method to fill in any missing characters in the frequency map so that the generated tree will include those characters too. The most obvious way to do this is to create a long array of all the characters you want to double check for, then loop through them all. However,

there are cleverer ways to set up your loops to avoid having to list out every single character to check in your code remember, each character is represented by a number behind the scenes, so you can loop through them if you know what order they occur. Bit representation of the tree itself When files are sent or stored in encoded form, they are very hard to decode unless the decoder knows exactly which tree was used in the first place. Some sort of standard passage could be used to always build the same tree (as described above), but this can be inconvenient to communicate and may result in inefficient compression since the standard tree isn't custom built for each different encoded file. One solution is to encode the tree itself in binary and include it at the beginning of the file in a standardized form. One straightforward standard is to start at the top and write a 0 for each parent node (which always has 2 children following it), and a 1 for each leaf immediately followed by the 8 digit ASCII character code (in binary) for that leaf. In fact, that is precisely the standard used to encode the tree provided to you in this project, and the genstdtree methods decode any bit String using that standard (although it obviously reads the bit String from the constant variable rather than a file). For example, a small tree with only three children could be encoded as follows (spaces added for clarity): left/"0" right/"1" child child of of root (itself root an entire sub tree) 0 1 0110 0001 0 1 0110 0010 1 0110 0011 / root / left/"0" right/"1" parent node child in child in in sub tree sub tree sub tree Note that following each 1 identifying a leaf node is the 8 bit ASCII code for 'a', 'b', or 'c'. So this represents a Huffman tree where 'a' is encoded as 0 (the only leaf on that side of the root), 'b' is encoded as 10, and c as 11 (the two leaves on the other side of the root).

So, this technique allows ANY file using ANY tree to be encoded and communicated, as long as the receiving program is using the same standard to encode and decode its trees and files. Fortunately, this is much easier to standardize than having to settle on a single tree to use for every communication! The project code for converting a tree to and from a bit String is actually nearly complete already. Obviously, the decoding part is already finished or else the provided standard tree could not be built by the genstdtree method. And, the translation of a character to its 8 bit ASCII code uses enough functions and operations you haven't used before that it has been provided for you in the HuffmanLeaf class. The only thing left to do is fill in the buildbitrep function in the HuffmanParent class, call bitrep() on the root of the tree from the main method, and print the result. (Actually, you could go even further and actually write it to the beginning of each encoded file. But you'll have to do that part on your own, and also, mysterypassage.txt was not encoded this way, so make sure you don't try to decode it using that format!) Display (already finished) The recursive display function has already been completed in the node classes, but you may learn something from looking at the definition and working out how it operates!