Text Compression through Huffman Coding. Terminology

Similar documents
Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

16 Greedy Algorithms

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

Greedy algorithms part 2, and Huffman code

Greedy Algorithms. CLRS Chapters Introduction to greedy algorithms. Design of data-compression (Huffman) codes

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

Algorithms Dr. Haim Levkowitz

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Analysis of Algorithms - Greedy algorithms -

15 July, Huffman Trees. Heaps

Greedy Algorithms and Huffman Coding

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

Greedy Algorithms CHAPTER 16

CSE 421 Greedy: Huffman Codes

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

16.Greedy algorithms

TU/e Algorithms (2IL15) Lecture 2. Algorithms (2IL15) Lecture 2 THE GREEDY METHOD

EE 368. Weeks 5 (Notes)

Design and Analysis of Algorithms

More Bits and Bytes Huffman Coding

Intro. To Multimedia Engineering Lossless Compression

February 24, :52 World Scientific Book - 9in x 6in soltys alg. Chapter 3. Greedy Algorithms

Lossless Compression Algorithms

CMPSCI 240 Reasoning Under Uncertainty Homework 4

G205 Fundamentals of Computer Engineering. CLASS 21, Mon. Nov Stefano Basagni Fall 2004 M-W, 1:30pm-3:10pm

Greedy Algorithms. Alexandra Stefan

Chapter 16: Greedy Algorithm

Huffman Codes (data compression)

16.3 The Huffman code problem

Design and Analysis of Algorithms

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

V Advanced Data Structures

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

Data Structures and Algorithms

CS 337 Project 1: Minimum-Weight Binary Search Trees

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

V Advanced Data Structures

Lecture: Analysis of Algorithms (CS )

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

CS 758/858: Algorithms

16.3 The Huffman code problem

We ve done. Now. Next

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

COSC-211: DATA STRUCTURES HW5: HUFFMAN CODING. 1 Introduction. 2 Huffman Coding. Due Thursday, March 8, 11:59pm

14 Data Compression by Huffman Encoding

Horn Formulae. CS124 Course Notes 8 Spring 2018

Trees! Ellen Walker! CPSC 201 Data Structures! Hiram College!

CS301 - Data Structures Glossary By

Greedy Algorithms. Textbook reading. Chapter 4 Chapter 5. CSci 3110 Greedy Algorithms 1/63

Binary heaps (chapters ) Leftist heaps

Algorithms and Data Structures CS-CO-412

Trees. A tree is a data structure consisting of data nodes connected to each other with pointers: Leaf. Vocabulary

CSC 373 Lecture # 3 Instructor: Milad Eftekhar

Binary Trees

Data Structures and Algorithms

Greedy algorithms 2 4/5/12. Knapsack problems: Greedy or not? Compression algorithms. Data compression. David Kauchak cs302 Spring 2012

PESIT Bangalore South Campus Hosur road, 1km before Electronic City, Bengaluru -100 Department of MCA

Red-Black, Splay and Huffman Trees

CSC 373: Algorithm Design and Analysis Lecture 4

CSE 143 Lecture 22. Huffman Tree

Data Compression Techniques

Algorithm Theory, Winter Term 2015/16 Problem Set 5 - Sample Solution

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

Analysis of Algorithms

18.3 Deleting a key from a B-tree

Heaps and Priority Queues

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

looking ahead to see the optimum

14.4 Description of Huffman Coding

Bioinformatics Programming. EE, NCKU Tien-Hao Chang (Darby Chang)

Greedy Algorithms. Mohan Kumar CSE5311 Fall The Greedy Principle

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Huffman Coding Assignment For CS211, Bellevue College (rev. 2016)

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

CS15100 Lab 7: File compression

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

1 Format. 2 Topics Covered. 2.1 Minimal Spanning Trees. 2.2 Union Find. 2.3 Greedy. CS 124 Quiz 2 Review 3/25/18

2.2 Syntax Definition

CS 161: Design and Analysis of Algorithms

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w

CSE 214 Computer Science II Introduction to Tree

Chapter 10: Trees. A tree is a connected simple undirected graph with no simple circuits.

Trees (Part 1, Theoretical) CSE 2320 Algorithms and Data Structures University of Texas at Arlington

CS F-11 B-Trees 1

Analysis of Algorithms Prof. Karen Daniels

CS/COE 1501

EE67I Multimedia Communication Systems Lecture 4

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

Today s Outline. The One Page Cheat Sheet. Simplifying Recurrences. Set Notation. Set Notation. Priority Queues. O( 2 n ) O( n 3 )

Uses for Trees About Trees Binary Trees. Trees. Seth Long. January 31, 2010

Design and Analysis of Algorithms 演算法設計與分析. Lecture 7 April 6, 2016 洪國寶

CSCI 136 Data Structures & Advanced Programming. Lecture 22 Fall 2018 Instructor: Bills

Administration CSE 326: Data Structures

Heaps. 2/13/2006 Heaps 1

Transcription:

Text Compression through Huffman Coding Huffman codes represent a very effective technique for compressing data; they usually produce savings between 20% 90% Preliminary example We are given a 100,000-character text (maybe a book, or a long report we want to store it on a computer hard disk Information is stored on disks as sequences of zeroes ones Simplest option: use (extended ASCII codes! Encode each character in a two-byte code, like the one that Java uses In this way the resulting file will be (forgetting about spaces carriage returns 200,000 byte long! Data from http://wwwanujsethcom/crypto/historyhtml Average length 4227 How?? 1 2-1 Slightly more careful option: use a reduced size code only relative to the characters that actually occur in the text (this will not save anything in the worst case More careful option: compute the character frequencies first (! then associate longer sequences of bits to characters that occur less frequently In general such codes, also known as variable length codes, may give significant savings on the amount of space needed to store a given (very long text Terminology We want to define a code (ie a mapping from an alphabet words (sequences, strings over another alphabet the length of the encoded string to that minimises We consider only codes in which no codeword is also a prefix of some other codeword Such codes are called prefix codes It is possible to show that the optimal data compression achievable by any code can always be achieved with a prefix code Therefore there is no loss of generality in restricting attention to prefix codes symbol a 0 b 10 c 110 codeword 2 3

Encoding/Decoding Prefix codes are desirable because of simple encoding/decoding procedures: Encoding given the source text, simply concatenate the codewords representing each character: abababa becomes (spaces have been shown to give a clearer description of the encoding, they are NOT part of the encoded text Decoding since no codeword is a prefix of any other codeword, the codeword that begins an encoded file is unambiguous To decode 1 identify the initial codeword; 2 translate it back to the original character; 3 remove the codeword from the file; 4 repeat the decoding process on the remainder of the encoded file If the encoded sequence is c we can decode this as b b Data Structure To be efficient the decoding process needs a convenient representation for the prefix code, so that the initial codeword can be easily picked off A binary tree whose leaves are the given characters provides one such representation We interpret the binary codeword for a character as the path from the root of the tree to that character, where 0 means go to the left child 1 means go to the right child Example? 4 6 Property Claim An optimal code for a file is always represented by a tree in which every non-leaf node has exactly two children Exercises 1 Write a decoder for the code given above; 2 What is its time complexity? (Informal argument if one of the children is missing, in some sense, we are losing the opportunity to use shorter codewords, for some of the symbols in So, if the text alphabet has characters, then the tree for an optimal prefix code has exactly leaves, one for each letter in internal nodes This is a simple graph-theoretic property of any tree whose internal nodes have all degree three 5 7

! HUFFMAN ( Cost of a tree Given a tree corresponding to some prefix code, it is a simple matter to associate a cost function with it For each character, let denote the frequency of let denote the depth of s leaf in (note that is also the length of the codeword for The average codeword length is (* for to ALLOCATE-NODE( leftextract-min(!left right EXTRACT-MIN(!right INSERT(,! return EXTRACT-MIN( We will use to represent the cost of the tree An example is due 8 9-1 Constructing a Huffman code Huffman invented a greedy algorithm that constructs an optimal prefix code, called a Huffman code The algorithm builds the tree corresponding to an optimal code in a bottom-up manner It begins with a set of leaves performs a sequence of merging operations to create the final tree In the pseudo-code that follows we assume that is a set of characters, that eachis associated with an object with a defined frequency A priority queue, keyed on, is used to identify the two least frequent objects to merge together The result of the merger of two objects is a new object whose frequency is the sum of the frequencies of the two objects that were merged The generic node of right to the left right child of in Who computes this? is an object containing two pointers left Priority Queues?? A priority queue is a data structure for maintaining a set key A priority queue supports the following operations: INSERT( MIN(, of elements each with an associated value (or returns the element of inserts the element into with minimal key EXTRACT-MIN( removes returns the element of minimal key with A priority queue can be implemented in many ways (exercise! Different implementations lead to different complexity results 9 10

Analysis From the discussion above about the efficiency of elementary priority queue operations (when optimally implemented it follows that HUFFMAN takes time In general the running time is dominated by (a the time to populate the data structure extract the minimum element from plus (b multiplied by the time to To complete the analysis of this procedure we need to prove that it actually works! Claim 1 Let be an alphabet let the frequency function be defined for each Let be the two characters having the lowest frequencies There exists an optimal prefix code for in which the codewords for in the last bit have the same length differ only (Proof idea Let be a tree representing an arbitrary optimal prefix code We show how to modify it to a new tree (again representing an optimal prefix code but having maximum depth The codewords for in will have the same length differ only in the last bit as sibling leaves of 11 13 Key properties An optimisation problem can be solved optimally by a greedy algorithm if it has the following two features: greedy-choice An optimal solution can be reached by making a locally optimal choice at each step optimal-substructure An optimal solution is formed by optimal solutions to subproblems Details Given let (any two characters that are sibling leaves of maximum depth Without loss of generality assume Since also be (otherwise, minimal Now define have the two lowest frequencies it must from by exchanging with wouldn t be with Finally compute Most of the terms simplify what is left is positive! Therefore the transformation generates a new optimal tree, furthermore, have the desired property in 12 14

! Claim 2 Let be a tree representing an optimal prefix code over an alphabet let the frequency function be defined for each Consider any two characters leaves in, let their parent Then, considering!!be as a character with frequency, the tree represents an optimal prefix code for the alphabet appearing as sibling 15