Data Structures and Algorithms

Similar documents
ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

Text Compression through Huffman Coding. Terminology

15 July, Huffman Trees. Heaps

Priority Queues and Huffman Encoding

CSE 143 Lecture 22. Huffman Tree

Priority Queues and Huffman Encoding

CSE 143, Winter 2013 Programming Assignment #8: Huffman Coding (40 points) Due Thursday, March 14, 2013, 11:30 PM

Homework 3 Huffman Coding. Due Thursday October 11

Huffman Codes (data compression)

More Bits and Bytes Huffman Coding

Algorithms and Data Structures CS-CO-412

Greedy Algorithms. CLRS Chapters Introduction to greedy algorithms. Design of data-compression (Huffman) codes

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Binary Trees and Huffman Encoding Binary Search Trees

CS 1027b Foundations of Computer Science II Assignment 1: File Compression Due Date: February 5, 11:55 pm Weight: 10%

CSE143X: Computer Programming I & II Programming Assignment #10 due: Friday, 12/8/17, 11:00 pm

4/16/2012. Data Compression. Exhaustive search, backtracking, object-oriented Queens. Check out from SVN: Queens Huffman-Bailey.

CS 206 Introduction to Computer Science II

14.4 Description of Huffman Coding

Building Java Programs. Priority Queues, Huffman Encoding

CS15100 Lab 7: File compression

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

A New Compression Method Strictly for English Textual Data

Out: April 19, 2017 Due: April 26, 2017 (Wednesday, Reading/Study Day, no late work accepted after Friday)

Binary Trees Case-studies

15110 Principles of Computing, Carnegie Mellon University - CORTINA. Digital Data

Data Structures and Algorithms

About this exam review

Getting started with Java

COSC-211: DATA STRUCTURES HW5: HUFFMAN CODING. 1 Introduction. 2 Huffman Coding. Due Thursday, March 8, 11:59pm

Data compression.

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Basic data types. Building blocks of computation

CMPSCI 240 Reasoning Under Uncertainty Homework 4

CMPSC112 Lecture 37: Data Compression. Prof. John Wenskovitch 04/28/2017

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

Data Compression Techniques

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Greedy Algorithms CHAPTER 16

CS/COE 1501

Intermediate Code Generation

Compilers CS S-01 Compiler Basics & Lexical Analysis

Greedy algorithms 2 4/5/12. Knapsack problems: Greedy or not? Compression algorithms. Data compression. David Kauchak cs302 Spring 2012

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

Lossless Compression Algorithms

Analysis of Algorithms - Greedy algorithms -

Compilers CS S-01 Compiler Basics & Lexical Analysis

Graduate-Credit Programming Project

Greedy Algorithms and Huffman Coding

EE 368. Weeks 5 (Notes)

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

14 Data Compression by Huffman Encoding

Digital Image Processing

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

CS106X Handout 29 Winter 2015 February 20 th, 2015 Assignment 5: Huffman

CS 200 Algorithms and Data Structures, Fall 2012 Programming Assignment #3

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Name Section Number. CS210 Exam #3 *** PLEASE TURN OFF ALL CELL PHONES*** Practice

3. Java - Language Constructs I

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

CS/COE 1501

Greedy Algorithms. Alexandra Stefan

GZIP is a software application used for file compression. It is widely used by many UNIX

A Research Paper on Lossless Data Compression Techniques

Design and Analysis of Algorithms

Java is an objet-oriented programming language providing features that support

Algorithms Dr. Haim Levkowitz

Programming with Java

Data Structures and Algorithms

Review of Basic Java

CSE COMPUTER USE: Fundamentals Test 1 Version D

EE67I Multimedia Communication Systems Lecture 4

JAVA OPERATORS GENERAL

Lecture: Analysis of Algorithms (CS )

An Overview 1 / 10. CS106B Winter Handout #21 March 3, 2017 Huffman Encoding and Data Compression

Basic Operations jgrasp debugger Writing Programs & Checkstyle

CSE 100: STREAM I/O, BITWISE OPERATIONS, BIT STREAM I/O

Information Retrieval

Indexing and Searching

Handout 4 Conditionals. Boolean Expressions.

Chapter 16: Greedy Algorithm

A First Program - Greeting.cpp

CMPT 125: Lecture 3 Data and Expressions

ENSC Multimedia Communications Engineering Huffman Coding (1)

Following is the general form of a typical decision making structure found in most of the programming languages:

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

16 Greedy Algorithms

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Computer Science II Fall 2009

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

CSE 421 Greedy: Huffman Codes

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Class Libraries and Packages

The type of all data used in a C++ program must be specified

Transcription:

Data Structures and Algorithms CS245-2015S-P2 Huffman Codes Project 2 David Galles Department of Computer Science University of San Francisco

P2-0: Text Files All files are represented as binary digits including text files Each character is represented by an integer code ASCII American Standard Code for Information Interchange Text file is a sequence of binary digits which represent the codes for each character.

P2-1: ASCII Each character can be represented as an 8-bit number ASCII for a = 97 = 01100001 ASCII for b = 98 = 01100010 Text file is a sequence of 1 s and 0 s which represent ASCII codes for characters in the file File aba is 97, 97, 98 011000010110001001100001

P2-2: ASCII Each character in ASCII is represented as 8 bits We need 8 bits to represent all possible character combinations (including control characters, and unprintable characters) Breaking up file into individual characters is easy Finding the kth character in a file is easy

P2-3: ASCII ASCII is not terribly efficient All characters require 8 bits Frequently used characters require the same number of bits as infrequently used characters We could be more efficient if frequently used characters required fewer than 8 bits, and less frequently used characters required more bits

P2-4: Representing Codes as Trees Want to encode 4 only characters: a, b, c, d (instead of 256 characters) How many bits are required for each code, if each code has the same length?

P2-5: Representing Codes as Trees Want to encode 4 only characters: a, b, c, d (instead of 256 characters) How many bits are required for each code, if each code has the same length? 2 bits are required, since there are 4 possible options to distinguish

P2-6: Representing Codes as Trees Want to encode 4 only characters: a, b, c, d Pick the following codes: a: 00 b: 01 c: 10 d: 11 We can represent these codes as a tree Characters are stored at the leaves of the tree Code is represented by path to leaf

P2-7: Representing Codes as Trees a: 00, b: 01, c: 10, d:11 0 1 0 1 0 1 a b c d

P2-8: Representing Codes as Trees a: 01, b: 00, c: 11, d:10 0 1 0 1 0 1 b a d c

P2-9: Prefix Codes If no code is a prefix of any other code, then decoding the file is unambiguous. If all codes are the same length, then no code will be a prefix of any other code (trivially) We can create variable length codes, where no code is a prefix of any other code

P2-10: Variable Length Codes Variable length code example: a: 0, b: 100, c: 101, d: 11 Decoding examples: 100 10011 01101010010011

P2-11: Prefix Codes & Trees Any prefix code can be represented as a tree a: 0, b: 100, c: 101, d: 11 0 1 a 0 1 0 1 d b c

P2-12: File Length If we use the code: a:00, b:01, c:10, d:11 How many bits are required to encode a file of 20 characters?

P2-13: File Length If we use the code: a:00, b:01, c:10, d:11 How many bits are required to encode a file of 20 characters? 20 characters * 2 bits/character = 40 bits

P2-14: File Length If we use the code: a:0, b:100, c:101, d:11 How many bits are required to encode a file of 20 characters?

P2-15: File Length If we use the code: a:0, b:100, c:101, d:11 How many bits are required to encode a file of 20 characters? It depends upon the number of a s, b s, c s and d s in the file

P2-16: File Length If we use the code: a:0, b:100, c:101, d:11 How many bits are required to encode a file of: 11 a s, 2 b s, 2 c s, and 5 d s?

P2-17: File Length If we use the code: a:0, b:100, c:101, d:11 How many bits are required to encode a file of: 11 a s, 2 b s, 2 c s, and 5 d s? 11*1 + 2*3 + 2*3 + 5*2 = 33 < 40

P2-18: Decoding Files We can use variable length keys to encode a text file Given the encoded file, and the tree representation of the codes, it is easy to decode the file 0 1 a 0 1 0 1 d b c 0111001010011

P2-19: Decoding Files We can use variable length keys to encode a text file Given the encoded file, and the tree representation of the codes, it is easy to decode the file Finding the kth character in the file is more tricky

P2-20: Decoding Files We can use variable length keys to encode a text file Given the encoded file, and the tree representation of the codes, it is easy to decode the file Finding the kth character in the file is more tricky Need to decode the first (k-1) characters in the file, to determine where the kth character is in the file

P2-21: File Compression We can use variable length codes to compress files Select an encoding such that frequently used characters have short codes, less frequently used characters have longer codes Write out the file using these codes (If the codes are dependent upon the contents of the file itself, we will also need to write out the codes at the beginning of the file for decoding)

P2-22: File Compression We need a method for building codes such that: Frequently used characters are represented by leaves high in the code tree Less Frequently used characters are represented by leaves low in the code tree Characters of equal frequency have equal depths in the code tree

P2-23: Huffman Coding For each code tree, we keep track of the total number of times the characters in that tree appear in the input file We start with one code tree for each character that appears in the input file We combine the two trees with the lowest frequency, until all trees have been combined into one tree

P2-24: Huffman Coding Example: If the letters a-e have the frequencies: a: 100, b: 20, c:15, d: 30, e: 1 :100 b:20 c:15 d:30 e:1

P2-25: Huffman Coding Example: If the letters a-e have the frequencies: a: 100, b: 20, c:15, d: 30, e: 1 :16 :100 b:20 c:15 e:1 d:30

P2-26: Huffman Coding Example: If the letters a-e have the frequencies: a: 100, b: 20, c:15, d: 30, e: 1 :36 b:20 :16 a:100 c:15 e:1 d:30

P2-27: Huffman Coding Example: If the letters a-e have the frequencies: a: 100, b: 20, c:15, d: 30, e: 1 :66 :36 d:30 b:20 :16 a:100 c:15 e:1

P2-28: Huffman Coding Example: If the letters a-e have the frequencies: a: 100, b: 20, c:15, d: 30, e: 1 :166 a:100 :66 :36 d:30 b:20 :16 c:15 e:1

P2-29: Huffman Coding Example: If the letters a-e have the frequencies: a: 10, b: 10, c:10, d: 10, e: 10 :10 b:10 c:10 d:10 e:10

P2-30: Huffman Coding Example: If the letters a-e have the frequencies: a: 10, b: 10, c:10, d: 10, e: 10 :20 :10 b:10 c:10 d:10 e:10

P2-31: Huffman Coding Example: If the letters a-e have the frequencies: a: 10, b: 10, c:10, d: 10, e: 10 :20 :20 :10 b:10 c:10 d:10 e:10

P2-32: Huffman Coding Example: If the letters a-e have the frequencies: a: 10, b: 10, c:10, d: 10, e: 10 :30 :20 :20 e:10 a:10 b:10 c:10 d:10

P2-33: Huffman Coding Example: If the letters a-e have the frequencies: a: 10, b: 10, c:10, d: 10, e: 10 :30 :20 :30 a:10 b:10 :20 e:10 c:10 d:10

P2-34: Huffman Trees & Tables Once we have a Huffman tree, decoding a file is straightforward but encoding a tree requires a bit more information. Given just the tree, finding an encoding can be difficult... What would we like to have, to help with encoding?

P2-35: Encoding Tables :166 a:100 :66 :36 d:30 a 0 b 100 c 1010 d 11 e 1011 b:20 c:15 :16 e:1

P2-36: Creating Encoding Table Traverse the tree Keep track of the path during the traversal When a leaf is reached, store the path in the table

P2-37: Huffman Coding To compress a file using huffman coding: Read in the file, and count the occurrence of each character, and built a frequency table Build the Huffman tree from the frequencies Build the Huffman codes from the tree Print the Huffman tree to the output file (for use in decompression) Print out the codes for each character

P2-38: Huffman Coding To uncompress a file using huffman coding: Read in the Huffman tree from the input file Read the input file bit by bit, traversing the Huffman tree as you go When a leaf is read, write the appropriate file to an output file

P2-39: Magic Numbers We only want to uncompress files that we actually compressed ourselves Write a Magic Number at the beginning of compressed file For this project, use 0x4846 (ASCII HF ) When uncompressing, first check to see that magic number is correct

P2-40: Binary Files public BinaryFile(String filename, char readorwrite) public boolean EndOfFile() public char readchar() public void writechar(char c) public boolean readbit() public void writebit(boolean bit) public void close()

P2-41: Binary Files readbit Read a single bit readchar Read a single character (8 bits)

P2-42: Binary Files writebit Writes out a single bit writechar Writes out a single (8 bit) character

P2-43: Binary Files If we write to a binary file: char, bit, bit, bit, bit And then read from the file: bit, bit, bit, bit, char What will we get out?

P2-44: Binary Files If we write to a binary file: char, bit, bit, bit, bit And then read from the file: bit, bit, bit, bit, char What will we get out? Garbage!

P2-45: Printing out Trees To print out Huffman trees: Print out nodes in pre-order traversal Need a way of denoting which nodes are leaves and which nodes are interior nodes (Huffman trees are full every node has 0 or 2 children) For each interior node, print out a 0 (single bit). For each leaf, print out a 1, followed by 8 bits for the character at the leaf

P2-46: Compression? Is it possible that huffman compression would not compress the file? Is it possible that huffman compression could actually make the file larger? How?

P2-47: Compression? What happens if all the charcters have the same frequency? What does the tree look like? What can we say about the lengths of the codes for each character? What does that mean for the file size?

P2-48: Compression? What happens if all the charcters have the same frequency? All nodes are at the same depth in the tree (that is, 8) Each code will have a length of 8 The encoded file will be the same size as the original file plus the size required to encode the tree

P2-49: Compression! What to do? Calculate the size of the input file Calculate the size that the compressed file would be If the compressed file is larger than than the input file, don t compress

P2-50: Compression! Given the frequency array, how large is the input file?

P2-51: Compression! Given the frequency array, how large is the input file? c freq(c) len(c) (# of characters in the input file) * 8

P2-52: Compression! Given the frequency array & codes for each compressed element, how large is the compressed file?

P2-53: Compression! Given the frequency array & codes for each compressed element, how large is the compressed file? c freq(c) len(c) + Size of tree + + 4 bytes (32 bits) (file overhead) + 2 (16 bits) bytes (Magic Number) Implementation detail of BinaryFile class Note file sizes need to be a multiple of 8 bits...

P2-54: Command Line Arguments public static void main(string args[]) The args parameter holds the input parameters java MyProgram arg1 arg2 arg3 args.length = 3 args[0] = arg1 args[1] = arg2 args[2] = arg3

P2-55: Calling Huffman java Huffman (-c -u) [-v] [-f] infile outfile (-c -u) stands for either -c (for compress), or -u (for uncompress) [-v] stands for an optional -v flag (for verbose) [-f] stands for an optional -f flag (for force compress) infile is the input file outfile is the output file