CMPSCI 240 Reasoning Under Uncertainty Homework 4

Similar documents
Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

ENSC Multimedia Communications Engineering Huffman Coding (1)

Text Compression through Huffman Coding. Terminology

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

14.4 Description of Huffman Coding

15 July, Huffman Trees. Heaps

COSC-211: DATA STRUCTURES HW5: HUFFMAN CODING. 1 Introduction. 2 Huffman Coding. Due Thursday, March 8, 11:59pm

CS 206 Introduction to Computer Science II

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Priority Queues. Chapter 9

Data compression.

Greedy Algorithms. CLRS Chapters Introduction to greedy algorithms. Design of data-compression (Huffman) codes

Algorithms Dr. Haim Levkowitz

CSE 143, Winter 2013 Programming Assignment #8: Huffman Coding (40 points) Due Thursday, March 14, 2013, 11:30 PM

EE 368. Weeks 5 (Notes)

Second Semester - Question Bank Department of Computer Science Advanced Data Structures and Algorithms...

Greedy Algorithms. Alexandra Stefan

Greedy Algorithms and Huffman Coding

16 Greedy Algorithms

14 Data Compression by Huffman Encoding

CSE143X: Computer Programming I & II Programming Assignment #10 due: Friday, 12/8/17, 11:00 pm

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Greedy Algorithms CHAPTER 16

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Multimedia Networking ECE 599

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

Data Compression Techniques

CSC 373 Lecture # 3 Instructor: Milad Eftekhar

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

Huffman Codes (data compression)

Analysis of Algorithms - Greedy algorithms -

looking ahead to see the optimum

The priority is indicated by a number, the lower the number - the higher the priority.

Greedy Algorithms. Mohan Kumar CSE5311 Fall The Greedy Principle

CSE 143 Lecture 22. Huffman Tree

6. Finding Efficient Compressions; Huffman and Hu-Tucker

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

February 24, :52 World Scientific Book - 9in x 6in soltys alg. Chapter 3. Greedy Algorithms

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Binary Heaps in Dynamic Arrays

Lecture: Analysis of Algorithms (CS )

Heaps and Priority Queues

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Red-Black, Splay and Huffman Trees

Greedy algorithms part 2, and Huffman code

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Horn Formulae. CS124 Course Notes 8 Spring 2018

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

Uniquely Decodable. Code 1 Code 2 A 00 0 B 1 1 C D 11 11

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

Greedy Approach: Intro

Priority Queues and Huffman Encoding

Building Java Programs. Priority Queues, Huffman Encoding

Priority Queues and Huffman Encoding

TU/e Algorithms (2IL15) Lecture 2. Algorithms (2IL15) Lecture 2 THE GREEDY METHOD

Data Structures and Algorithms

Lossless Compression Algorithms

Huffman Coding. (EE 575: Source Coding Project) Project Report. Submitted By: Raza Umar. ID: g

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Digital Image Processing

CSE 421 Greedy: Huffman Codes

CS201 Discussion 12 HUFFMAN + ERDOSNUMBERS

Department of Computer Applications. MCA 312: Design and Analysis of Algorithms. [Part I : Medium Answer Type Questions] UNIT I

More Bits and Bytes Huffman Coding

Binary heaps (chapters ) Leftist heaps

CSCI 136 Data Structures & Advanced Programming. Lecture 22 Fall 2018 Instructor: Bills

Thus, it is reasonable to compare binary search trees and binary heaps as is shown in Table 1.

CS02b Project 2 String compression with Huffman trees

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging.

CS/COE 1501

Out: April 19, 2017 Due: April 26, 2017 (Wednesday, Reading/Study Day, no late work accepted after Friday)

Algorithms and Data Structures CS-CO-412

Chapter 16: Greedy Algorithm

Design and Analysis of Algorithms

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

15-122: Principles of Imperative Computation, Spring 2013

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Priority Queues, Binary Heaps, and Heapsort

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Lecture 3, Review of Algorithms. What is Algorithm?

OUTLINE. Paper Review First Paper The Zero-Error Side Information Problem and Chromatic Numbers, H. S. Witsenhausen Definitions:

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

Chapter 10: Trees. A tree is a connected simple undirected graph with no simple circuits.

CSC 373: Algorithm Design and Analysis Lecture 4

CS 161: Design and Analysis of Algorithms

CMPSCI 250: Introduction to Computation. Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013

BINARY HEAP cs2420 Introduction to Algorithms and Data Structures Spring 2015

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

CS/COE 1501

Recall: Properties of B-Trees

Design and Analysis of Algorithms

CS 61B Spring 2017 Guerrilla Section 4 Worksheet. 11 March 2017

Transcription:

CMPSCI 240 Reasoning Under Uncertainty Homework 4 Prof. Hanna Wallach Assigned: February 24, 2012 Due: March 2, 2012 For this homework, you will be writing a program to construct a Huffman coding scheme. Recall that Huffman code is a prefix code that uses variable-length bit strings to encode a set of mutually exclusive, disjoint events (in this homework, events are of the form A = the outcome is character A ) based on a probability distribution over those events. Your program will also calculate the entropy of the event distribution and also the information rate of the Huffman code you produce. Building a Huffman Tree The goal of Huffman coding is to assign a unique bit string to every possible event, such that more probable (i.e., more frequent) events have shorter code words (bit strings) while less probable events have longer code words. This is a desirable property because the average number of bits used to represent each event (and hence any message consisting of some sequence of these events) is reduced. (As an example, Morse code also has this property because the most common letters in English, E and T, have the shortest possible codes: a single dot and a single dash.) Huffman coding produces variable-length codes because different events (in this case corresponding to different characters) are encoded using bit strings of different lengths. (Contrast this with, e.g., ASCII, which represents each character using an eight bit code word and is therefore a fixedlength code.) Huffman codes are always prefix codes, which means that no code word is a prefix of any other code word. Thus, when reading a bit string encoding of some message (i.e., sequence of events), it is always unambiguously obvious where one code word ends and the next begins. A Huffman code for a specific distribution of events can be obtained by building a binary tree (often called a Huffman tree). Here s an example of how the Huffman coding algorithm works. Suppose you need to want to encode a text which only contains the characters A, B, C, D, E, F. The probability distribution of the corresponding events A, B, C, D, E, F is as follows: A : 0.30 B : 0.30 C : 0.13 D : 0.12 E : 0.10 F : 0.05 1

The first step is to find the two events with the smallest probabilities. These are E and F (with probabilities 0.10 and 0.05, respectively). We group events E and F as the children of a new node corresponding to event E F, whose probability is equal to P (E F ) = P (E) + P (F ) since E and F are disjoint events. We also place a zero on the left branch and one on the right branch: EF: 0.15 0 1 We then find the next two lowest probability events out of our remaining original events (A, B, C, D) and our new event E F. Now the two lowest probability events are C and D, so we group them: CD: 0.25 EF: 0.15 0 1 0 1 Next, C D and E F are combined: CDEF: 0.40 0 1 CD: 0.25 EF: 0.15 0 1 0 1 2

The last two steps combine A and B into A B, followed by A B and C D E F. The final step results in a single binary tree with our original events A, B, C, D, E, F as leaves: ABCDEF: 1.00 0 1 / CDEF: 0.40 / / 0 1 / AB: 0.60 CD: 0.25 EF: 0.15 0 1 0 1 0 1 To determine the code word for any of A, B, C, D, E, we follow the path from the root of the tree to the leaf corresponding to that event, reading off the sequence of ones and zeros on the branches: Symbol Probability Code Length A 0.30 00 2 B 0.30 01 2 C 0.13 100 3 D 0.12 101 3 E 0.10 110 3 F 0.05 111 3 Note that a set of events and corresponding probability distribution does not necessarily have a unique Huffman code: For example, in the first step above, when we combined events E and F, we could have associated F with the right branch and E with the left. The resultant Huffman code would have been different, but the lengths of the individual bit strings would have been the same. Entropy and information rate The information rate for any encoding of events is the weighted average of the lengths of the individual code words. For the Huffman code obtained above, the information rate is therefore 0.3(2) + 0.3(2) + 0.13(3) + 0.12(3) + 0.1(3) + 0.05(3) = 2.4 bits. 3

The entropy for a set of mutually exclusive, disjoint events A 1,..., A n is H(A 1,..., A n ) = n i=1 P (A i ) log 2 1 P (A i ). The entropy of the probability distribution we ve been examining is therefore H(A, B, C, D, E, F ) = 0.3 log 2 (1 / 0.3) + 0.3 log 2 (1 / 0.3) + 0.13 log 2 (1 / 0.13) + + 0.12 log 2 (1 / 0.12) + 0.1 log 2 (1 / 0.1) + 0.05 log 2 (1 / 0.05) 2.34 bits. The entropy is the lower bound on information rate. In other words, the information rate for a Huffman code can never be lower than the entropy of the corresponding probability distribution. Writing the program You are given skeleton code for this assignment. You must fill in three methods: Map<Character, String> buildhuffmantree(map<character, Double> charprobs) This method builds a Huffman tree from a probability distribution over characters. The tree is returned as a mapping from characters to their bit strings in the Huffman code you create. To make this easier, you will not be constructing a tree data structure. Instead, you will keep track of the Huffman bit strings in a HashMap called huffcodes; these bit strings will be constructed one character (event) at a time, thereby mirroring the tree-building procedure described above. We have given you a simple class, Event, that can help you figure out which events to merge at each step. An Event object represents a node or leaf in a Huffman tree, and has two fields: chars and prob. For example, for the Event object corresponding to the E F node in the tree above, chars = EF and prob = 0.15. You will maintain a collection of these Event objects in a data structure called a priority queue (this data structure is built into Java). A priority queue is a useful data structure here because it provides very fast access to the lowest probability Event. The algorithm outline is therefore as follows: For each item in charprobs, construct an Event object and add it to the queue while the queue has more than one event in it do Pull the top two events off the queue; call them L and R Add a 0 to the beginning of the corresponding bit strings for every one of L s chars Add a 1 to the beginning of the corresponding bit strings for every one of R s chars 4

Add a new Event to the queue constructed from the concatenation of L and R s chars fields and the sum of their prob fields end while To illustrate this algorithm, let s use the probability distribution from earlier. At the start of the algorithm, huffcodes is empty. We construct Event objects for the characters A through F and add them to the queue. We then pull the top two events from the queue (these will correspond to the characters E and F). We then update huffcodes to look like this E 0 F 1 and add a new Event with chars = EF and prob = 0.15 to the queue. Next, we pull C and D off the queue. We update huffcodes to C 0 D 1 E 0 F 1 and add a new Event with chars = CD and prob = 0.25 to the queue. Next, we pull C D and E F off the queue. We update huffcodes by adding a 0 to the beginning of the code words for both C and D and a 1 to the beginning of E and F s code words: C 00 D 01 E 10 F 11 We add a new Event with chars = CDEF and prob = 0.4 to the queue. Next, A and B are pulled off the queue. We update huffcodes to A 0 B 1 C 00 D 01 E 10 F 11 and add a new Event with chars = AB and prob = 0.6 to the queue. Next, we pull A B and C D E F off the queue. To update huffcodes, we add a 0 to the beginning of the code words for A and B and a 1 to the beginning of C, D, E, and F s code words: A 00 B 01 C 100 D 101 E 110 F 111 We add a new Event with chars = ABCDEF and prob = 1.0 to the queue. At this point, the algorithm terminates because there s only one event left in the queue. 5

double entropy(map<character, Double> charprobs) This method should return the entropy, as a double, for the provided probability distribution. double informationrate(map<character, Double> charprobs, Map<Character, String> hufftable) This method should return the information rate, as a double, for the provided probability distribution and Huffman code. Turning in the Program You will upload your program to your account on EdLab. 6