Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

Similar documents
Lecture: Analysis of Algorithms (CS )

16.Greedy algorithms

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Analysis of Algorithms - Greedy algorithms -

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Text Compression through Huffman Coding. Terminology

Greedy Algorithms CHAPTER 16

EE 368. Weeks 5 (Notes)

February 24, :52 World Scientific Book - 9in x 6in soltys alg. Chapter 3. Greedy Algorithms

Huffman Codes (data compression)

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

CSE 421 Greedy: Huffman Codes

Greedy Algorithms CSE 780

15 July, Huffman Trees. Heaps

Greedy Algorithms. CLRS Chapters Introduction to greedy algorithms. Design of data-compression (Huffman) codes

Chapter 16: Greedy Algorithm

Algorithms Dr. Haim Levkowitz

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

16.3 The Huffman code problem

TU/e Algorithms (2IL15) Lecture 2. Algorithms (2IL15) Lecture 2 THE GREEDY METHOD

16.3 The Huffman code problem

Greedy Algorithms CSE 6331

Greedy algorithms part 2, and Huffman code

G205 Fundamentals of Computer Engineering. CLASS 21, Mon. Nov Stefano Basagni Fall 2004 M-W, 1:30pm-3:10pm

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Heaps. 2/13/2006 Heaps 1

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

Greedy Algorithms and Huffman Coding

Trees! Ellen Walker! CPSC 201 Data Structures! Hiram College!

Binary Heaps in Dynamic Arrays

Design and Analysis of Algorithms

Greedy Algorithms. Alexandra Stefan

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

CSC 373: Algorithm Design and Analysis Lecture 4

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Greedy Algorithms. Subhash Suri. October 9, 2017

(2,4) Trees. 2/22/2006 (2,4) Trees 1

Multi-Way Search Trees

16 Greedy Algorithms

CS 161: Design and Analysis of Algorithms

Intro. To Multimedia Engineering Lossless Compression

Multiway Search Trees. Multiway-Search Trees (cont d)

13.4 Deletion in red-black trees

Priority Queues and Binary Heaps

Horn Formulae. CS124 Course Notes 8 Spring 2018

Greedy algorithms 2 4/5/12. Knapsack problems: Greedy or not? Compression algorithms. Data compression. David Kauchak cs302 Spring 2012

Trees. Arash Rafiey. 20 October, 2015

Greedy Algorithms. Textbook reading. Chapter 4 Chapter 5. CSci 3110 Greedy Algorithms 1/63

Algorithms and Data Structures CS-CO-412

We ve done. Now. Next

CS 758/858: Algorithms

Practical Session 10 - Huffman code, Sort properties, QuickSort algorithm, Selection

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w

Heaps Goodrich, Tamassia. Heaps 1

CSCI 136 Data Structures & Advanced Programming. Lecture 22 Fall 2018 Instructor: Bills

Binary Trees

Design and Analysis of Algorithms 演算法設計與分析. Lecture 7 April 6, 2016 洪國寶

Data Structures and Algorithms

Chapter 6 Heapsort 1

Tutorial 6-7. Dynamic Programming and Greedy

CS F-11 B-Trees 1

(2,4) Trees Goodrich, Tamassia (2,4) Trees 1

Priority Queues. 1 Introduction. 2 Naïve Implementations. CSci 335 Software Design and Analysis III Chapter 6 Priority Queues. Prof.

Heaps 2. Recall Priority Queue ADT. Heaps 3/19/14

Partha Sarathi Manal

Trees. Courtesy to Goodrich, Tamassia and Olga Veksler

Topics. Trees Vojislav Kecman. Which graphs are trees? Terminology. Terminology Trees as Models Some Tree Theorems Applications of Trees CMSC 302

looking ahead to see the optimum

Trees (Part 1, Theoretical) CSE 2320 Algorithms and Data Structures University of Texas at Arlington

CS 337 Project 1: Minimum-Weight Binary Search Trees

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

13.4 Deletion in red-black trees

CSE 214 Computer Science II Heaps and Priority Queues

Data Compression Algorithms

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

2-3 Tree. Outline B-TREE. catch(...){ printf( "Assignment::SolveProblem() AAAA!"); } ADD SLIDES ON DISJOINT SETS

Data Structures and Algorithms

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Design and Analysis of Algorithms

binary tree empty root subtrees Node Children Edge Parent Ancestor Descendant Path Depth Height Level Leaf Node Internal Node Subtree

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

CSC 421: Algorithm Design & Analysis. Spring 2015

Chapter 10: Trees. A tree is a connected simple undirected graph with no simple circuits.

Sorting and Searching

Algorithms. Red-Black Trees

CH 8. HEAPS AND PRIORITY QUEUES

ECE608 - Chapter 16 answers

We will show that the height of a RB tree on n vertices is approximately 2*log n. In class I presented a simple structural proof of this claim:

CS 206 Introduction to Computer Science II

CH. 8 PRIORITY QUEUES AND HEAPS

Analysis of Algorithms

Binary Trees, Binary Search Trees

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

ENSC Multimedia Communications Engineering Huffman Coding (1)

CSC 310, Fall 2011 Solutions to Theory Assignment #1

CMPSCI 240 Reasoning Under Uncertainty Homework 4

Multi-Way Search Trees

Transcription:

Huffman Coding Version of October 13, 2014 Version of October 13, 2014 Huffman Coding 1 / 27

Outline Outline Coding and Decoding The optimal source coding problem Huffman coding: A greedy algorithm Correctness Version of October 13, 2014 Huffman Coding 2 / 27

Encoding Example Suppose we want to store a given a 100, 000 character data file. The file contains only 6 characters, appearing with the following frequencies: a b c d e f Frequency in 000s 45 13 12 16 9 5 A binary code encodes each character as a binary string or codeword over some given alphabet Σ a code is a set of codewords. e.g., {000, 001, 010, 011, 100, 101} and {0, 101, 100, 111, 1101, 1100} are codes over the binary alphabet Σ = {0, 1}. Version of October 13, 2014 Huffman Coding 3 / 27

Encoding Given a code C over some alphabet it is easy to encode the message using C. Just scan through the message, replacing the characters by the codewords. Example Σ = {a, b, c, d} If the code is C 1 = {a = 00, b = 01, c = 10, d = 11} then bad is encoded as 01 00 11 If the code is C 2 = {a = 0, b = 110, c = 10, d = 111} then bad is encoded as 110 0 111 Version of October 13, 2014 Huffman Coding 4 / 27

Fixed-Length vs Variable-Length In a fixed-length code each codeword has the same length. In a variable-length code codewords may have different lengths. Example a b c d e f Freq in 000s 45 13 12 16 9 5 fixed-len code 000 001 010 011 100 101 variable-len code 0 101 100 111 1101 1100 Note: since there are 6 characters, a fixed-length code must use at least 3 bits per codeword). The fixed-length code requires 300, 000 bits to store the file. The variable-length code requires only (45 1 + 13 3 + 12 3 + 16 3 + 9 4 + 5 4) 1000 = 224, 000bits, saving a lot of space! Goal is to save space! Version of October 13, 2014 Huffman Coding 5 / 27

Decoding C 1 = {a = 00, b = 01, c = 10, d = 11}. C 2 = {a = 0, b = 110, c = 10, d = 111}. C 3 = {a = 1, b = 110, c = 10, d = 111} Given an encoded message, decoding is the process of turning it back into the original message. Message is uniquely decodable if it can be decoded in only one way. Example Relative to C 1, 010011 is uniquely decodable to bad. Relative to C 2, 1100111 is uniquely decodable to bad. Relative to C 3, 1101111 is not uniquely decipherable it could have encoded either bad or acad. In fact, any message encoded using C 1 or C 2 is uniquely decipherable. Unique decipherability property is needed in order for a code to be useful. Version of October 13, 2014 Huffman Coding 6 / 27

Prefix-Codes Fixed-length codes are always uniquely decipherable. WHY? We saw before that fixed-length codes do not always give the best compression though, so we prefer to use variable length codes. Definition A code is called a prefix (free) code if no codeword is a prefix of another one. Example {a = 0, b = 110, c = 01, d = 111} is not a prefix code. {a = 0, b = 110, c = 10, d = 111} is a prefix code. Version of October 13, 2014 Huffman Coding 7 / 27

Prefix-Codes.. Important Fact: Every message encoded by a prefix free code is uniquely decipherable. Example Because no codeword is a prefix of any other, we can always find the first codeword in a message, peel it off, and continue decoding. code: {a = 0, b = 110, c = 10, d = 111}. 01101100 = 01101100 = abba We are therefore interested in finding good (best compression) prefix-free codes. Version of October 13, 2014 Huffman Coding 8 / 27

Outline Coding and Decoding The optimal source coding problem Huffman coding: A greedy algorithm Correctness Version of October 13, 2014 Huffman Coding 9 / 27

The Optimal Source Coding Problem Huffman Coding Problem Given an alphabet A = {a 1,..., a n } with frequency distribution f (a i ), find a binary prefix code C for A that minimizes the number of bits B(C) = n f (a i )L(c i ) i=1 needed to encode a message of n i=1 f (a i) characters, where c i is the codeword for encoding a i, and L(c i ) is the length of the codeword c i. Version of October 13, 2014 Huffman Coding 10 / 27

Example Problem Suppose we want to store messages made up of 4 characters a, b, c, d with frequencies 60, 5, 30, 5 (percents) respectively. What are the fixed-length codes and prefix-free codes that use the least space? Solution: characters a b c d frequency 60 5 30 5 fixed-length code 00 01 10 11 prefix code 0 110 10 111 To store these 100 characters, (1) the fixed-length code requires 100 2 = 200 bits, (2) the prefix code uses only a 25% saving. 60 1 + 5 3 + 30 2 + 5 3 = 150 bits Remark: We will see later that this is the optimum (lowest cost) prefix code. Version of October 13, 2014 Huffman Coding 11 / 27

Correspondence between Binary Trees and Prefix Codes 1-1 correspondence between leaves and characters. 0 0 1 0 10 1 1 a b c d 00 01 11 0 a 0 1 0 1 c 0 10 b 110 1 d 111 Left edge is labeled 0; right edge is labeled 1 The binary string on a path from the root to a leaf is the codeword associated with the character at the leaf. Version of October 13, 2014 Huffman Coding 12 / 27

Minimum-Weight External Pathlength Problem d i, the depth of leaf a i, is equal to L(c i ), the depth of the codeword associated with that leaf. B(T ), weighted external pathlength of tree T is same as B(C), the number of bits needed to encode message with corresponding code C n i=1 f (a i)d i = n i=1 f (a i)l(c i ) Definition (Minimum-Weight External Pathlength Problem) Given weights f (a 1 ),..., f (a n ), find a tree T with n leaves labeled a 1,..., a n that has minimum weighted external path length. The Huffman encoding problem is equivalent to the minimum-weight external pathlength problem. Version of October 13, 2014 Huffman Coding 13 / 27

Outline Coding and Decoding The optimal source coding problem Huffman coding: A greedy algorithm Correctness Version of October 13, 2014 Huffman Coding 14 / 27

Huffman Coding Set S be the original set of message characters. 1 (Greedy idea) Pick two smallest frequency characters x, y from S. Create a subtree that has these two characters as leaves. Label the root of this subtree as z. 2 Set frequency f (z) = f (x) + f (y). Remove x, y from S and add z to S S = S {z} {x, y}. Note that S has just decreased by one. Repeat this procedure, called merge, with the new alphabet S, until S has only one character left in it. The resulting tree is the Huffman code tree. It encodes the optimum (minimum-cost) prefix code for the given frequency distribution. Version of October 13, 2014 Huffman Coding 15 / 27

Example of Huffman Coding Let S = {a/20, b/15, c/5, d/15, e/45} be the original character set S alphabet and its corresponding frequency distribution. 1 The first Huffman coding step merges c and d. (could also have merged c and b). a/20 b/15 c/5 n1/20 0 1 Now have S = {a/20, b/15, n1/20, e/45}. d/15 e/45 2 Algorithm merges a and b (could also have merged n1 and b) n2/35 n1/20 e/45 0 1 0 1 a/20 b/15 c/5 d/15 Now have S = {n2/35, n1/20, e/45}. Version of October 13, 2014 Huffman Coding 16 / 27

Example of Huffman Coding Continued 1 Algorithm merges n1 and n2. n3/55 e/45 0 1 n2/35 n1/20 0 1 0 1 a/20 b/15 c/5 d/15 Now have S 3 = {n3/55, e/45}. 2 Algorithm next merges e and n3 and finishes. n4/100 0 1 n3/55 0 1 e/45 The Huffman code is: a = 000, b = 001, c = 010, d = 011, e = 1. n2/35 n1/20 0 1 0 1 a/20 b/15 c/5 d/15 Version of October 13, 2014 Huffman Coding 17 / 27

Huffman Coding Algorithm Given character set S with frequency distribution {f (a) : a S}: The binary Huffman tree is constructed using a priority queue, Q, of nodes, with frequencies as keys. Huffman(S) n = S ; Q = S; // the future leaves for i = 1 to n 1 do // Why n 1? z = new node; left[z] =Extract-Min(Q); right[z] =Extract-Min(Q); f [z] = f [left[z]] + f [right[z]]; Insert(Q, z); end return Extract-Min(Q); // root of the tree Running time is O(n log n), as each priority queue operation takes time O(log n). Version of October 13, 2014 Huffman Coding 18 / 27

Outline Coding and Decoding The optimal source coding problem Huffman coding: A greedy algorithm Correctness Version of October 13, 2014 Huffman Coding 19 / 27

Lemma 1 Lemma (1) An optimal prefix code tree must be full, i.e., every internal node has exactly two children. Proof. If some internal node had only one child, then we could simply get rid of this node and replace it with its unique child. This would decrease the total cost of the encoding. Version of October 13, 2014 Huffman Coding 20 / 27

Lemma 2 Lemma (2) Let T be prefix code tree and T the tree obtained by swapping two leaves x and b in T. If, then, f (x) f (b), and d(x) d(b) B(T ) B(T ). i.e., swapping a lower-frequency character downward in T does not increase T s cost. T T x b y y b c x c Version of October 13, 2014 Huffman Coding 21 / 27

Lemma 2 T T x b y y b c x c Proof. B(T ) = B(T ) f (x)d(x) f (b)d(b) + f (x)d(b) + f (b)d(x) = B(T ) + (f (x) f (b)) (d(b) d(x)) }{{}}{{} 0 0 B(T ). Version of October 13, 2014 Huffman Coding 22 / 27

Lemma 3 Lemma (3) Consider the two characters x and y with the smallest frequencies. There is an optimal code tree in which these two letters are sibling leaves at the deepest level of the tree. Proof: Let T be any optimal prefix code tree, b and c be two siblings at the deepest level of the tree (such siblings must exist because T is full). T T T x b b y y c b c Assume without loss of generality that f (x) f (b) and f (y) f (c) (If necessary) swap x with b and swap y with c. Proof follows from Lemma 2. x c Version of October 13, 2014 Huffman Coding 23 / 27 x y

Lemma 4 Lemma (4) Let T be a prefix code tree and x, y two sibling leaves. Let T be obtained from T by removing x and y, naming the parent z, and setting f (z) = f (x) + f (y) Then B(T ) = B(T )+f (x) + f (y). B(T ) = B(T ) f (z)d(z) + f (x)(d(z) + 1) + f (y)(d(z) + 1) = B(T ) (f (x) + f (y))d(z) + (f (x) + f (y))(d(z) + 1) = B(T )+f (x) + f (y). Version of October 13, 2014 Huffman Coding 24 / 27

Huffman Codes are Optimal Theorem The prefix code tree (hence the code) produced by the Huffman coding algorithm is optimal. Proof: (By induction on n, the number of characters). Base case n = 2: Tree with two leaves. Obviously optimal. Induction hypothesis: Huffman s algorithm produces optimal tree in the case of n 1 characters. Induction step: Consider the case of n characters: Let H be the tree produced by the Huffman s algorithm. Need to show: H is optimal. Version of October 13, 2014 Huffman Coding 25 / 27

Huffman Codes are Optimal Induction step (cont d): Following the way Huffman s algorithm works, Let x, y be the two characters chosen by the algorithm in the first step. They are sibling leaves in H. Let H be obtained from from H by (i) removing x and y, (ii) labelling their parent z, and (iii) setting f (z) = f (x) + f (y) H has S; H has S = S {x, y} {z} H is the tree produced by Huffman s algorithm for S. By the induction hypothesis, H is optimal for S. By Lemma 4, B(H) = B(H )+f(x) + f(y). Version of October 13, 2014 Huffman Coding 26 / 27

Huffman Codes are Optimal Induction step (cont d): By Lemma 3, there exists some optimal tree T in which x and y are sibling leaves. Let T be obtained from from T by removing x and y, naming their parent z, and setting f (z) = f (x) + f (y). T is also a prefix code tree for S. By Lemma 4, B(T ) = B(T )+f (x) + f (y). Hence B(H) = B(H )+f (x) + f (y) B(T )+f (x) + f (y)(h is optimal for A ) = B(T ). Therefore, H must be optimal! Version of October 13, 2014 Huffman Coding 27 / 27