CSE 421 Greedy: Huffman Codes

Similar documents
Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

Analysis of Algorithms - Greedy algorithms -

Text Compression through Huffman Coding. Terminology

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Algorithms Dr. Haim Levkowitz

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Lossless Compression Algorithms

Greedy Algorithms and Huffman Coding

Greedy Algorithms. CLRS Chapters Introduction to greedy algorithms. Design of data-compression (Huffman) codes

TU/e Algorithms (2IL15) Lecture 2. Algorithms (2IL15) Lecture 2 THE GREEDY METHOD

Data Compression. Guest lecture, SGDS Fall 2011

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Data Compression Techniques

Intro. To Multimedia Engineering Lossless Compression

CS/COE 1501

EE67I Multimedia Communication Systems Lecture 4

Multimedia Networking ECE 599

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Greedy Algorithms II

Design and Analysis of Algorithms 演算法設計與分析. Lecture 7 April 6, 2016 洪國寶

Data compression.

CS/COE 1501

CS 161: Design and Analysis of Algorithms

15 July, Huffman Trees. Heaps

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

Greedy Algorithms CHAPTER 16

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

EE 368. Weeks 5 (Notes)

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Horn Formulae. CS124 Course Notes 8 Spring 2018

Data Compression Algorithms

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

Design and Analysis of Algorithms

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Analysis of Algorithms

Greedy algorithms part 2, and Huffman code

16.Greedy algorithms

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Chapter 16: Greedy Algorithm

We will show that the height of a RB tree on n vertices is approximately 2*log n. In class I presented a simple structural proof of this claim:

Greedy Algorithms CSE 780

Binary heaps (chapters ) Leftist heaps

Greedy Algorithms. Textbook reading. Chapter 4 Chapter 5. CSci 3110 Greedy Algorithms 1/63

Greedy Algorithms. Alexandra Stefan

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

February 24, :52 World Scientific Book - 9in x 6in soltys alg. Chapter 3. Greedy Algorithms

We ve done. Now. Next

16 Greedy Algorithms

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Lecture 14. Greedy algorithms!

Information Theory and Communication

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Chapter 7 Lossless Compression Algorithms

Lecture 15. Error-free variable length schemes: Shannon-Fano code

Greedy Algorithms CSE 6331

Compressing Data. Konstantin Tretyakov

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Lecture: Analysis of Algorithms (CS )

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

CSE 421 Applications of DFS(?) Topological sort

Huffman Codes (data compression)

More Bits and Bytes Huffman Coding

Digital Image Processing

Recursion and Structural Induction

CSC 421: Algorithm Design & Analysis. Spring 2015

Algorithms for Data Science

1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth

binary tree empty root subtrees Node Children Edge Parent Ancestor Descendant Path Depth Height Level Leaf Node Internal Node Subtree

14.4 Description of Huffman Coding

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Trees! Ellen Walker! CPSC 201 Data Structures! Hiram College!

Design and Analysis of Algorithms

CS 337 Project 1: Minimum-Weight Binary Search Trees

Greedy Algorithms. Subhash Suri. October 9, 2017

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Algorithms Interview Questions

CMPSC112 Lecture 37: Data Compression. Prof. John Wenskovitch 04/28/2017

ECE608 - Chapter 16 answers

Priority Queues and Huffman Trees

CSE 143 Lecture 22. Huffman Tree

Engineering Mathematics II Lecture 16 Compression

CS3383 Unit 2: Greedy

CS521 \ Notes for the Final Exam

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

CMPSCI 240 Reasoning Under Uncertainty Homework 4

CSC 373 Lecture # 3 Instructor: Milad Eftekhar

Greedy Approach: Intro

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

An Overview 1 / 10. CS106B Winter Handout #21 March 3, 2017 Huffman Encoding and Data Compression

CS 758/858: Algorithms

Simple variant of coding with a variable number of symbols and fixlength codewords.

Practical Session 10 - Huffman code, Sort properties, QuickSort algorithm, Selection

Figure 4.1: The evolution of a rooted tree.

ENSC Multimedia Communications Engineering Huffman Coding (1)

Binary Heaps in Dynamic Arrays

Transcription:

CSE 421 Greedy: Huffman Codes Yin Tat Lee 1

Compression Example 100k file, 6 letter alphabet: File Size: ASCII, 8 bits/char: 800kbits 2 3 > 6; 3 bits/char: 300kbits a 45% b 13% c 12% d 16% e 9% f 5% Why? Storage, transmission 2

Compression Example 100k file, 6 letter alphabet: a 45% b 13% c 12% d 16% e 9% f 5% File Size: ASCII, 8 bits/char: 800kbits 2 3 > 6; 3 bits/char: 300kbits better: 2.52 bits/char 74%*2 +26%*4: 2kbits Optimal? E.g.: a 00 b 01 d 10 c 1100 e 1101 f 1110 Why not: 00 01 10 110 1101 1110 1101110 = cf or ec? 3

Binary character code ( code ) each k-bit source string maps to unique code word (e.g. k=8) compression alg: concatenate code words for successive k-bit characters of source Fixed/variable length codes all code words equal length? Prefix codes Data Compression no code word is prefix of another (unique decoding) 4

Prefix Codes = Trees a 45% b 13% c 12% d 16% e 9% f 5% 100 100 86 0 14 55 58 28 14 b:13 c:12 d:16 e:9 f:5 1 0 0 0 0 f a b c:12 30 b:13 14 d:16 f:5 e:9 1 1 0 0 f a b 5

Greedy Idea #1 Put most frequent under root, then recurse a 45% b 13% c 12% d 16% e 9% f 5% 100..... 6

Greedy Idea #1 Top down: Put most frequent under root, then recurse 100 a 45% b 13% c 12% d 16% e 9% f 5% Too greedy: unbalanced tree.45*1 +.16*2 +.13*3 = 2.34 not too bad, but imagine if all freqs were ~1/6: (1+2+3+4+5+5)/6=3.33 d:16 55 b:13 29... 7

Greedy Idea #2 Top down: Divide letters into 2 groups, with ~50% weight in each; recurse (Shannon-Fano code) 100 a 45% b 13% c 12% d 16% e 9% f 5% Again, not terrible 2*.5+3*.5 = 2.5 50 50 But this tree can easily be improved! (How?) f:5 b:13 c:12 d:16 e:9 8

Greedy idea #3 Bottom up: Group least frequent letters near bottom 100 a 45% b 13% c 12% d 16% e 9% f 5%...... 14 c:12 b:13 f:5 e:9 9

(a) f:5 e:9 c:12 b:13 d:16 c:12 b:13 (b) 14 f:5 e:9 d:16 (c) 14 f:5 e:9 d:16 c:12 b:13 (d) c:12 30 b:13 14 d:16 f:5 e:9 55 100 (e) c:12 30 b:13 14 d:16 f:5 e:9 (f) c:12 55 30 b:13 14 d:16 f:5 e:9.45*1 +.41*3 +.14*4 = 2.24 bits per char 10

Algorithm: Huffman s Algorithm (1952) insert each letter into priority queue by freq while queue length > 1 do remove smallest 2; call them x, y make new node z from them, with f(z) = f(x) + f(y) insert z into queue Analysis: O(n log n) Goal: Minimize cost T = σ c freq c depth(c) Correctness:??? T = Tree C = alphabet (leaves) 11

Correctness Strategy 50 100 50 f:5 Optimal solution may not be unique, so cannot prove that greedy gives the only possible answer. Instead, show greedy s solution is as good as any. How: an exchange argument Identify inversions: node-pairs whose swap improves tree b:13 c:12 d:16 e:9 13

Defn: A pair of leaves x,y is an inversion if depth(x) depth(y) and freq(x) freq(y) y Claim: If we flip an inversion, cost never increases. Why? All other things being equal, better to give more frequent letter the shorter code. x before d x f x + d y f y after d x f y + d y f x = d x d y f x f y 0 I.e., non-negative cost savings. 14

100 General Inversions 50 f:5 50 Define the frequency of an internal node to be the sum of the frequencies of the leaves in that subtree. b:13 c:12 d:16 e:9 We can generalize the defn of inversion for any pair of nodes. the associated claim still holds: exchanging an inverted pair of nodes (& associated subtrees) cannot raise the cost of a tree. Proof: Same. 15

Correctness Strategy 50 100 50 f:5 Lemma: Any prefix code tree T can be converted to a huffman tree H via inversion-exchanges b:13 c:12 d:16 e:9 Corollary: Huffman tree is optimal. Proof: Apply the above lemma to any optimal tree T = T 1. The lemma only exchanges inversions, which never increase cost. So, cost T 1 cost T 2 cost T 4 cost(h). 13

H: f:5 e:9 c:12 b:13 d:16 c:12 b:13 14 d:16 (a) (b) f:5 e:9 14 d:16 30 f:5 e:9 c:12 b:13 c:12 b:13 14 d:16 (c) (d) f:5 e:9 T: 55 100 T : 55 100 14 41 30 f:5 e:9 d:16 c:12 b:13 d:16 14 c:12 b:13 f:5 e:9 17

Lemma: prefix T Huffman H via inversion Induction Hypothesis: At k th iteration of Huffman, all nodes in the queue is a subtree of T (after inversions). Base case: all nodes are leaves of T. Inductive step: Huffman extracts A, B from the Q. Case 1: A, B is a siblings in T. Their newly created parent node in H corresponds to their parent in T. (used induction hypothesis here.) 13

Lemma: prefix T Huffman H via inversion Induction Hypothesis: At k th iteration of Huffman, all nodes in the queue is a subtree of T (after inversions). Case 2: A, B is not a siblings in T. WLOG, in T, depth(a) depth(b) & A is C s sib. Note B can t overlap C because If B = C, we have case 1. If B is a subtree of C, depth B > depth(a). If C is a subtree of B, A and B overlaps. Now, note that depth A = depth(c) depth(b) freq C freq(b) because Huff picks the min 2. So, B C is an inversion. Swapping gives T that T T satisfies the induction. A C B A B 13 C

Correctness Strategy 50 100 50 f:5 Lemma: Any prefix code tree T can be converted to a huffman tree H via inversion-exchanges b:13 c:12 d:16 e:9 Corollary: Huffman tree is optimal. Proof: Apply the above lemma to any optimal tree T = T 1. The lemma only exchanges inversions, which never increase cost. So, cost T 1 cost T 2 cost T 4 cost(h). 13

Data Compression Huffman is optimal. BUT still might do better! Huffman encodes fixed length blocks. What if we vary them? Huffman uses one encoding throughout a file. What if characteristics change? What if data has structure? E.g. raster images, video, Huffman is lossless. Necessary? LZ77, JPG, MPEG, 20

The rest are just for fun. (I learnt compression this morning.) 21

Adaptive Huffman coding Often, data comes from a stream. Difficult to know the frequencies in the beginning. There are multiple ways to update Huffman tree. FGK (Faller-Gallager-Knuth) There is a special external node, called 0-node, is used to identify a newly-coming character. Maintain the tree is sorted. When the freq is increased by 1, it can create inverted pairs. In that case, we swap nodes, subtrees, or both. (Exercise to figure out how to do exactly!) 13

LZ77 a a c a a c a b c a b a b a c Dictionary (previously coded) Cursor Lookahead Buffer Dictionary and buffer windows are fixed length and slide with the cursor Repeat: Output (p, l, c) where p = position of the longest match that starts in the dictionary (relative to the cursor) l = length of longest match c = next char in buffer beyond longest match Advance window by l + 1 Theory: it is optimal if the windows size tends to + and string is generated by Markov chain. [WZ94]

LZ77: Example a a c a a c a b c a b a a a c (_,0,a) a a c a a c a b c a b a a a c (1,1,c) a a c a a c a b c a b a a a c (3,4,b) a a c a a c a b c a b a a a c (3,3,a) a a c a a c a b c a b a a a c (1,2,c) Dictionary (size = 6) Buffer (size = 4) Longest match Next character

gzip 1. Based on LZ77. 2. Adaptive Huffman code the positions, lengths and chars 3. Non greedy: possibly use shorter match so that next match is better 4.. How to do it even better? In general, compression is like prediction. 1. The entropy of English letter is 4.219 bits per letter 2. 3-letter model yields 2.77 bits per letter 3. Human experiments suggested 0.6 to 1.3 bits per letter. For example, you can use deep net to predict and compression 1 GB of wiki to 160MB. (search deepzip) (to compare, gzip 330MB, Huffman 500-600MB) We are reaching the limit of human compression! (8bit * 0.16 ~ 1.2 bit per symbol)

Ultimate Answer? See Kolmogorov complexity. Disclaimer: I don t recommend playing such competitions unless you REALLY know what you are doing.