CS201 Discussion 12 HUFFMAN + ERDOSNUMBERS

Similar documents
Discussion 11 APT REVIEWS AND GRAPH ALGORITHMS

6. Finding Efficient Compressions; Huffman and Hu-Tucker

15 July, Huffman Trees. Heaps

implementing the breadth-first search algorithm implementing the depth-first search algorithm

Greedy Algorithms and Huffman Coding

Binary Trees

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

CSE 143 Lecture 22. Huffman Tree

Text Compression through Huffman Coding. Terminology

CS201 Discussion 14 ALLWORDLADDERS AND GALAXYTRIP

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

CS 206 Introduction to Computer Science II

CMPSC 250 Analysis of Algorithms Spring 2018 Dr. Aravind Mohan Shortest Paths April 16, 2018

(Refer Slide Time: 05:25)

Some major graph problems

(Re)Introduction to Graphs and Some Algorithms

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Horn Formulae. CS124 Course Notes 8 Spring 2018

Graph Search Methods. Graph Search Methods

CS2110: Software Development Methods. Maps and Sets in Java

CS 161 Lecture 11 BFS, Dijkstra s algorithm Jessica Su (some parts copied from CLRS) 1 Review

CSE 100: GRAPH ALGORITHMS

CMPSCI 240 Reasoning Under Uncertainty Homework 4

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

CS/COE 1501

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w

Priority Queues and Huffman Trees

Computer Science 302 Spring 2017 (Practice for) Final Examination, May 10, 2017

Huffman, YEAH! Sasha Harrison Spring 2018

2. True or false: even though BFS and DFS have the same space complexity, they do not always have the same worst case asymptotic time complexity.

(Refer Slide Time: 06:01)

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

Greedy algorithms 2 4/5/12. Knapsack problems: Greedy or not? Compression algorithms. Data compression. David Kauchak cs302 Spring 2012

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Shortest Paths Date: 10/13/15

Trees. Arash Rafiey. 20 October, 2015

EECS 281 Homework 4 Key Fall 2004

14 Data Compression by Huffman Encoding

CS490: Problem Solving in Computer Science Lecture 6: Introductory Graph Theory

CS/COE 1501

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Lecture 3, Review of Algorithms. What is Algorithm?

CMPSCI 250: Introduction to Computation. Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013

(Refer Slide Time: 02.06)

Greedy Algorithms CHAPTER 16

14.4 Description of Huffman Coding

Graph Search Methods. Graph Search Methods

CSC 373 Lecture # 3 Instructor: Milad Eftekhar

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Graphs. The ultimate data structure. graphs 1

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Computer Science E-119 Fall Problem Set 4. Due prior to lecture on Wednesday, November 28

CS 310 Advanced Data Structures and Algorithms

CS 361 Data Structures & Algs Lecture 13. Prof. Tom Hayes University of New Mexico

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Greedy algorithms part 2, and Huffman code

Greedy Approach: Intro

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

Figure 4.1: The evolution of a rooted tree.

(Refer Slide Time: 01.26)

Priority Queues. Chapter 9

CIS 121 Data Structures and Algorithms Midterm 3 Review Solution Sketches Fall 2018

Graphs & Digraphs Tuesday, November 06, 2007

12 Abstract Data Types

COMP 250 Fall graph traversal Nov. 15/16, 2017

CS106X Final Review. Rachel Gardner and Jared Bitz. slides adapted from Ashley Taylor, Anton Apostolatos, Nolan Handali, and Zachary Birnholz

PESIT Bangalore South Campus Hosur road, 1km before Electronic City, Bengaluru -100 Department of MCA

CSE 100 Practice Final Exam

We will show that the height of a RB tree on n vertices is approximately 2*log n. In class I presented a simple structural proof of this claim:

Name: Lirong TAN 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G.

Info 2950, Lecture 16

Analysis of Algorithms

CS 361 Data Structures & Algs Lecture 11. Prof. Tom Hayes University of New Mexico

Recitation 9. Prelim Review

CIS 121 Data Structures and Algorithms with Java Spring 2018

(Dijkstra s Algorithm) Consider the following positively weighted undirected graph for the problems below: 8 7 b c d h g f

S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 165

CSE143X: Computer Programming I & II Programming Assignment #10 due: Friday, 12/8/17, 11:00 pm

Binary Trees Case-studies

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

Lecture 34. Wednesday, April 6 CS 215 Fundamentals of Programming II - Lecture 34 1

Lecture: Analysis of Algorithms (CS )

Algorithms Dr. Haim Levkowitz

Lossless Compression Algorithms

I want an Oompa-Loompa! screamed Veruca. Roald Dahl, Charlie and the Chocolate Factory

Mystery Algorithm! ALGORITHM MYSTERY( G = (V,E), start_v ) mark all vertices in V as unvisited mystery( start_v )

CSE 143, Winter 2013 Programming Assignment #8: Huffman Coding (40 points) Due Thursday, March 14, 2013, 11:30 PM

Second Semester - Question Bank Department of Computer Science Advanced Data Structures and Algorithms...

Distributed Algorithms 6.046J, Spring, 2015 Part 2. Nancy Lynch

Out: April 19, 2017 Due: April 26, 2017 (Wednesday, Reading/Study Day, no late work accepted after Friday)

GZIP is a software application used for file compression. It is widely used by many UNIX

2.1 Greedy Algorithms. 2.2 Minimum Spanning Trees. CS125 Lecture 2 Fall 2016

COMP 250 Fall recurrences 2 Oct. 13, 2017

Binary Heaps in Dynamic Arrays

Lecture 10 Graph algorithms: testing graph properties

Problem Solving through Programming In C Prof. Anupam Basu Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

CSE 373: Data Structures and Algorithms

Homework 3 Huffman Coding. Due Thursday October 11

EE 368. Weeks 5 (Notes)

Transcription:

CS0 Discussion HUFFMAN + ERDOSNUMBERS

Huffman Compression Today, we ll walk through an example of Huffman compression. In Huffman compression, there are three steps:. Create a Huffman tree. Get all the encodings from our Huffman tree. Encoding the text The text example we ll work through is go go gophers. Then, we ll have you practice your own example.

Creating the Huffman tree First, find the frequency of each character in the text we re trying to encode: go go gophers Character Frequency E G H O P R S _ (space)

Creating the Huffman tree Then, take the frequency-character pairs and turn them into nodes E G H O P R S _

Creating the Huffman tree Then, repeat the following process take the two smallest-weighted nodes and combine them as follows: Assign them a parent node that has weight equal to the sum of their weights. Have the one with smaller weight be the left child. (For the sake of making sure we always generate the same tree in the case of a tie, pick whatever comes earliest in the alphabet) G O P R S _ E H

Creating the Huffman tree Once we ve combined these two nodes, we repeat the process, with this new node as one of our nodes. G O S _ E H P R

Creating the Huffman tree G O E H P R S _

Creating the Huffman tree 4 G O S _ E H P R

Creating the Huffman tree 4 6 S _ E H P R G O

Creating the Huffman tree 7 4 6 S _ G O E H P R

Creating the Huffman tree 6 7 G O 4 At this point, we ve connected all our nodes, so we ve created our Huffman tree! S _ Now, your turn to practice try creating a Huffman tree for abracadabra feel free to double check with a partner, but keep in mind there are multiple valid trees! E H P R

Huffman Tree For abracadabra o a:5 o b: o c: o d: o r: http://www.mathcs.emory.edu/~cheung/courses//syllabus/compression/huffman.html

Finding the encoding 0 G 00 O 0 0 To find the encodings the path from the root of this tree to any character is the encoding, if we replace left with 0 and right with. (Note, at this point we don t need the frequencies anymore) S 00 _ 0 E 00 0 H 0 P 0 R

Finding the encoding From this tree, we get the following table of encodings: Note that characters with higher frequencies have shorter encodings. Now, find the encodings from your tree for abracadabra this property should also be true. Character Encoding E 00 G 00 H 0 O 0 P 0 R S 00 _ (space) 0

Encoding the text Now for the simplest part to actually encode the text into a bitstream, just replace each character with the encoding: go go gophers 00 0 0 00 0 0 00 0 0 0 00 00 Character Encoding E 00 G 00 H 0 O 0 P 0 R S 00 _ (space) 0

Encoding abracadabra a b r a c a d a b r a 0 00 000 00 0 00 00000 00000 0000 00000 0000 00000 00000 00000 00000 0000 00000 http://www.mathcs.emory.edu/~cheung/courses//syllabus/compression/huffman.html

Compression visualized go go gophers : 00000000000000000 go go gophers uncompressed: 0000000000000000000000000000000000 0000000000000000 abracadabra : 00000000000 abracadabra uncompressed: 000000000000000000000000000000000000000000000 0000000

HuffmanDecoding Review To start thinking about the decoding process for Huffman, we ll look at the HuffmanDecoding APT. Note that in HuffmanDecoding the encodings are given to us. In the actual assignment, you ll have to find the encodings by recreating the tree, using what we ll call a header, but we ll delay thinking about that for now.

HuffmanDecoding The encodings for Huffman have the prefix property no encoding is a prefix of another encoding. So, we know that if 00 is an encoding, 00 can t be an encoding, or vice-versa. The APT also guarantees this as part of the input. Our Huffman tree creation process ensures this, because all the nodes representing characters are leaves, and by definition one leaf can t be on the path to another.

HuffmanDecoding pseudocode Create a map from Strings to Characters. For i from 0 to dictionary.length-: Map the i th encoding to A +I currentencoding = Output = For j from 0 to archive.length-: currentencoding += archive.charat(j) If map has currentencoding as a key: output += map.get(currentencoding) currentencoding = Return output

Huffman decoding with a tree Suppose we did not have a map of encodings to characters, but instead a Huffman tree. (Again, we ll ignore how we actually get the tree from the encoded information for now) Then, rather than search in our map, we could instead start at the root, and move left or right on our Huffman tree based on what the next bit of our archive is. If we reach a leaf node, we add the character it represents to our output, and return to the root. current = root For j from 0 to archive.length-: currentencoding += archive.charat(j) Go left or right based on archive.charat(j) If map has currentencoding as a key: If current is a leaf node: output += map.get(currentencoding) current s character value currentencoding = current = root

ErdosNumbers: The Problem http://www.cs.duke.edu/csed/newapt/erdosnumbers.html

Outline Today in discussion, we ll go over the general problem solving strategy from the first discussion, except catering it to graph problems: Work through an example Figure out how we worked through that example Figure out how to turn the input into a graph Figure out how to solve the problem given the graph Write code to turn the input into a graph in Java Write code to solve the problem in Java

Examples { KLEITMAN LANDER, ERDOS KLEITMAN } { ERDOS 0, KLEITMAN, LANDER } {"ERDOS A", "A B", "B AA C"} {"A ", "AA ", "B ", "C ", "ERDOS 0" } {"ERDOS B", "A B C", "B A E", "D F"} {"A ", "B ", "C ", "D", "E ", "ERDOS 0", "F" } Try working through the following two by hand: {"ERDOS A", "A B C", "B D E", C F G"}? {"ERDOS A B", A B C", B C D", E F"}?

Examples { KLEITMAN LANDER, ERDOS KLEITMAN } { ERDOS 0, KLEITMAN, LANDER } {"ERDOS A", "A B", "B AA C"} {"A ", "AA ", "B ", "C ", "ERDOS 0" } {"ERDOS B", "A B C", "B A E", "D F"} {"A ", "B ", "C ", "D", "E ", "ERDOS 0", "F" } Try working through the following two by hand: {"ERDOS A", "A B C", "B D E", C F G"} { A, B, C, D, E, ERDOS 0, F, G } {"ERDOS A B", A B C", B C D", E F"} { A, B, C, D, E, ERDOS 0, F }

One possible (general) strategy Start with a table of Erdos numbers, with ERDOS = 0. Then, at each step, simultaneously update as many as possible until you reach a step where no updates can be made. e.g. for {"ERDOS A B", A B C", B C D", E F"} Author Number A B C D E ERDOS 0 F Author Number A B C D E ERDOS 0 F Author Number A B C D E ERDOS 0 F

Implementation You could implement the strategy on the previous slide using a double for loop, and it would solve the APT correctly. However, this would run in O(n ), which might time out the tester (and also avoids the purpose of the exercises, which is graph practice). Instead, we ll talk about implementing this strategy by turning the input into a graph.

Turning ErdosNumbers into a graph Given our input for the ErdosNumbers APT (a series of co-publishers of some number of papers), what should our vertices represent? What should our edges represent? i.e., given a set of co-publishers, how can we determine the set of edges we want to use? (This is a good step-by-step approach for turning any problem into a graph problem) Try drawing the graph for the following set of co-publishers. Also figure out what the adjacency list would look like. { A, A B, AA BA, BA M, M B, B BA, ERDOS BA }

You should have gotten: { A, A B, AA BA, BA M, M B, B BA, ERDOS BA } ERDOS Vertex Neighbors A B M BA AA AA B BA BA, M B BA ERDOS AA, B, ERDOS, M BA M B, BA A Now, using the graph, can you quickly find the ERDOS numbers?

Breadth-first search Because each vertex s Erdos number is just the minimum distance from the ERDOS vertex to that vertex, and breadth-first search (BFS) finds these distances efficiently, we ll implement our solution using BFS. Recall the following pseudo-code for BFS: Create a queue of vertices Q Q.push(start vertex) While Q is not empty: curr = Q.pop if we ve visited curr, continue mark curr as visited do something at curr push all of curr s neighbors into Q For this problem, what is the start vertex? What is the do something we will do at curr?

Marking nodes as visited This is our customized version of BFS for the ErdosNumbers problem. All we have left to do is find a way to define which vertices are visited. dist = map of strings to ints dist[ ERDOS ] = 0 Create a queue of vertices Q Q.push( ERDOS ) While Q is not empty: curr = Q.pop if we ve visited curr, continue mark curr as visited for each of curr s neighbors: if neighbor is not a key in dist: dist[neighbor] = dist[curr] + push all of curr s neighbors into Q

Finalized pseudo-code Once we have constructed the graph, the following pseudo-code will correctly determine all distances: dist = map of strings to ints dist[ ERDOS ] = 0 Create a queue of vertices Q Create a set of visited vertices S Q.push( ERDOS ) While Q is not empty: curr = Q.pop if curr in S: continue S.add(curr) for each of curr s neighbors: if neighbor is not a key in dist: dist[neighbor] = dist[curr] + push all of curr s neighbors into Q

Implementing in Java Now that we have specified how to do everything (short of turn our distance map into the final output, which should be relatively simple) in a generic sense, let us specify how to do it within Java. All the data structures we use in BFS on the previous slide (sets, queues, maps) are well-defined within Java, except for the graph itself. Graphs are most commonly stored as adjacency lists a data structure which when passed a vertex, returns a list of neighbor vertices. What would be the best way to represent an adjacency list in Java?

Constructing the graph Our code for constructing the graph will probably look like this: HashMap<String, HashSet<String>> mygraph = new HashMap<String, HashSet<String>>(); for(string publication: pubs){ //split pubs into authors //for each author in authors //if mygraph does not have author as a key: mygraph.put(author, new HashSet<String>()); //add each other author in authors to mygraph.get(author) } We can then quickly access all of curr s neighbors using mygraph.get(curr). Because we use a HashSet to keep track of neighbors, we will never have duplicate edges in our graph. However, we might sometimes have two authors who co-published multiple papers, which could correspond to duplicate edges. Why would this not affect the correctness of our solution?

A brief note on efficiency In a graph with n vertices and m edges, BFS runs in O(n+m) a substantial improvement on the O(n ) result we would get using a double for loop. This is because in the case where every vertex is interconnected, m is O(n ), but when there are fewer edges m is a stricter bound than n.

Finish the APT! At this point we ve discussed everything you should need to know to finish the APT. Spend the rest of discussion writing your solution.