Data Compression Techniques

Similar documents
Lossless Compression Algorithms

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

CSE 421 Greedy: Huffman Codes

Information Theory and Communication

6. Finding Efficient Compressions; Huffman and Hu-Tucker

CS/COE 1501

Multimedia Networking ECE 599

Data Compression. Guest lecture, SGDS Fall 2011

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

CS/COE 1501

Text Compression through Huffman Coding. Terminology

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

EE67I Multimedia Communication Systems Lecture 4

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

Lecture 15. Error-free variable length schemes: Shannon-Fano code

Chapter 7 Lossless Compression Algorithms

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Greedy Algorithms CHAPTER 16

CMPSCI 240 Reasoning Under Uncertainty Homework 4

Intro. To Multimedia Engineering Lossless Compression

Algorithms Dr. Haim Levkowitz

15 July, Huffman Trees. Heaps

CSC 373: Algorithm Design and Analysis Lecture 4

ENSC Multimedia Communications Engineering Huffman Coding (1)

CS 337 Project 1: Minimum-Weight Binary Search Trees

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Uniquely Decodable. Code 1 Code 2 A 00 0 B 1 1 C D 11 11

Compressing Data. Konstantin Tretyakov

Horn Formulae. CS124 Course Notes 8 Spring 2018

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Learning Programme Fundamentals of data representation AS Level

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Bits, bytes and digital information. Lecture 2 COMPSCI111/111G

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

A New Compression Method Strictly for English Textual Data

Data Compression Fundamentals

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Huffman Code Application. Lecture7: Huffman Code. A simple application of Huffman coding of image compression which would be :

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

Huffman Codes (data compression)

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

Greedy Algorithms and Huffman Coding

Engineering Mathematics II Lecture 16 Compression

Data Compression Scheme of Dynamic Huffman Code for Different Languages

A Comparative Study of Lossless Compression Algorithm on Text Data

Digital Image Processing

Indexing and Searching

More Bits and Bytes Huffman Coding

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

DRAFT. Computer Science 1001.py. Instructor: Benny Chor. Teaching Assistant (and Python Guru): Rani Hod

CS 161: Design and Analysis of Algorithms

CSC 170 Introduction to Computers and Their Applications. Lecture #1 Digital Basics. Data Representation

CS473-Algorithms I. Lecture 11. Greedy Algorithms. Cevdet Aykanat - Bilkent University Computer Engineering Department

CSC 310, Fall 2011 Solutions to Theory Assignment #1

Algorithms for Data Science

Lecture 13. Types of error-free codes: Nonsingular, Uniquely-decodable and Prefix-free

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

Information Science 2

EE-575 INFORMATION THEORY - SEM 092

Ch. 2: Compression Basics Multimedia Systems

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

Multimedia Data and Its Encoding

Design and Analysis of Algorithms

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Data Structures and Algorithms

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

COMPSCI 650 Applied Information Theory Feb 2, Lecture 5. Recall the example of Huffman Coding on a binary string from last class:

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

Figure-2.1. Information system with encoder/decoders.

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Chapter 4: Application Protocols 4.1: Layer : Internet Phonebook : DNS 4.3: The WWW and s

Design and Analysis of Algorithms

Data Compression Techniques

A Research Paper on Lossless Data Compression Techniques

Image coding and compression

Overview. ELEC2041 Microprocessors and Interfacing. Lecture 7: Number Systems - II. March 2006.

Analysis of Algorithms - Greedy algorithms -

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

Ch. 2: Compression Basics Multimedia Systems

Data Representation 1

CIS 121 Data Structures and Algorithms with Java Spring 2018

OUTLINE. Paper Review First Paper The Zero-Error Side Information Problem and Chromatic Numbers, H. S. Witsenhausen Definitions:

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

Greedy Algorithms. Subhash Suri. October 9, 2017

Arithmetic Coding for Joint Source-Channel Coding

Digital Image Processing

16.Greedy algorithms

"Digital Media Primer" Yue- Ling Wong, Copyright (c)2011 by Pearson EducaDon, Inc. All rights reserved.

Lecture 17. Lower bound for variable-length source codes with error. Coding a sequence of symbols: Rates and scheme (Arithmetic code)

Chapter 16: Greedy Algorithm

Lecture 19. Lecturer: Aleksander Mądry Scribes: Chidambaram Annamalai and Carsten Moldenhauer

Repetition 1st lecture

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

Transcription:

Data Compression Techniques Part 1: Entropy Coding Lecture 1: Introduction and Huffman Coding Juha Kärkkäinen 31.10.2017 1 / 21

Introduction Data compression deals with encoding information in as few bits as is possible or reasonable. The purpose is (usually) to reduce resource requirements: Permanent storage space Transmission time/bandwidth Processing time/space (more data fits in cache or main memory) Data compression can reduce the cost of dealing with data or enable things that would not otherwise be possible with available resources. In addition, data compression involves reorganizing the data in a way that is connected to other fields: data mining and machine learning (see Minimum Description Length (MDL) principle, for example) text indexing (covered in this course). 2 / 21

Compressed data is often useful only for storage and transmission and needs to be decompressed before doing anything else with it. Compression (encoding) is the process of transforming the original (uncompressed) data into compressed data. Decompression (decoding) is the reverse process of recovering the original data from the compressed data. Compression and decompression are often performed by different parties and one must be aware of what information apart from the compressed data is available to both parties. It is often helpful to think compression as communication with the compressed data as the message to be sent. A recent trend are compressed data structures that support certain operations on the data without decompression. Typically, they need more space than fully compressed data but much less space than a corresponding uncompressed data structure. The operations are usually slower than on the uncompressed data structure. 3 / 21

There are two broad types of compression: Lossy compression throws away some information. The data produced by decompression is similar but not an exact copy of the original data. Very effective with image, video and audio compression. Removes unessential information such as noise and undetectable details. Throwing away more information improves compression but reduces the quality of the data. Lossless compression keeps all information. The data produced by decompression is an exact copy of the original data. Removes redundant data, i.e., data that can be reconstructed from the data that is kept. Statistical techniques: minimize the average amount of data. 4 / 21

Bit is the basic unit of information or size of data or computer storage. International Electrotechnical Commission (IEC) recommends: Bit is abbreviated bit. The abbreviation b is best avoided to prevent confusion with B, which stands for byte (8 bits), another commonly used unit. The prefixes kilo (k), mega (M), giga (G), etc. refer to powers of 10. The corresponding binary prefixes referring to powers of two are called kibi (Ki), mebi (Mi), gibi (Gi), etc.. (Note that the abbreviation for kibi is capitalized but the one for kilo is not.) For example, 8 kbit means 8000 bits and 8 KiB means 8192 bytes. The level of compression is usually expressed in one the following ways. Compression ratio is c/u, where c is the size of the compressed data and u is the size of the original, uncompressed data. Bit rate is the average number of bits per some basic unit of data, such as bits per character for text or bits per second for audio. For example, a bit rate of 3 bits/character on standard ASCII text means compression ratio 3/8 = 37.5 %. 5 / 21

About this course This course covers only lossless compression techniques. Lossy compression techniques are quite specific to the type of data and often rely on a deep understanding of human senses. The focus is on text compression. However, many of the basic techniques and principles are general and applicable to other types of data. The course concentrates on practical algorithms and techniques. Some key concepts of information theory, the mathematical basis of data compression, will be introduced but not covered in detail. The course consists of three main parts: 1. Entropy coding 2. Text compression 3. Compressed data structures 6 / 21

Variable-length encoding Definition Let Σ be the source alphabet and Γ the code alphabet. A code is defined by an injective mapping C : Σ Γ C(a) for a Σ is called the codeword for a. The code is extended to sequences over Σ by concatenation: C : Σ Γ : s 1 s 2... s n C(s 1 )C(s 2 )... C(s n ) We will usually assume that Γ is the binary alphabet {0, 1}. Then the code is called a binary code. Σ is an arbitrary set, possibly even an infinite set such as the set of natural numbers. 7 / 21

Example Morse code is a type of variable length code. 8 / 21

Let us construct a binary code based on Morse code. First attempt: Use zero for dot and one for dash. A 01 B 1000 C 1010 D 100... However, when applied to sequences, we have a problem: CAT 1010011 KIM 1010011 TETEETT 1010011... The problem is that the code has no representation for the longer spaces between letters. The code would work if we had some kind of a delimiter to separate letters from each other, but the codes we are interested in here do not support such delimiters. They are sometimes called self-delimiting codes. If all the letter codes had the same length, there would be no need for a delimiter. The ASCII code is an example of such a fixed-length code. However, here we are more interested in variable-length codes. 9 / 21

The kind of code we need can be formalized as follows. Definition A code C : Σ Γ is uniquely decodable if for all S, T Σ, S T implies C(S) C(T ). Determining whether a code is uniquely decodable is not always easy. Also, unique decodability does not necessarily mean that decoding is easy. However, there are easily recognizable and decodable subclasses: A fixed-length code that maps every source symbol to a codeword of the same length is always uniquely decodable. A much larger subclass are prefix codes. Definition A code C : Σ Γ is a prefix code if for all a, b Σ, a b implies that C(a) is not a prefix of C(b). Theorem (Proof as exercise) Every prefix code is uniquely decodable. 10 / 21

Let us return to the Morse code example and define a binary prefix code based on the Morse code: Encode dots and dashes with ones and the spaces with zeros. The number of zeros and ones represents the length in time: dot is a single one, dash is three ones, etc.. The space between letters is included at the end of each codeword. A 10111000 B 111010101000 C 11101011101000... CAB 1110101110100010111000111010101000 The three zeros at the end of each codeword act as a kind of a delimiter and ensuring that the code is a prefix code. 11 / 21

The code on the previous frame is not ideal for compression because the encoded sequences tend to be quite long. A straightforward optimization is to use two ones for a dash and two zeros for a long space. A 101100 B 1101010100 C 11010110100... CAB 110101101001011001101010100 The unnecessarily long codes are a form of redundancy. Removing redundancy is one way to achieve compression. In the original Morse code, the length of dashes and long spaces is a compromise between compression (for faster transmission) and avoiding errors. Some amount of redundancy is necessary if we want to be able to detect and correct errors. However, such error-correcting codes are not covered on this cause. 12 / 21

Huffman coding Let P be a probability distribution over the source alphabet Σ. We want to find a binary prefix code C for Σ that minimizes the average code length Such a code is called optimal. P(a) C(a) a Σ The probability distribution P may come from a probabilistic model of the data a statistical study of the type of data (for example, letter frequencies in English) actual symbol frequencies in the data to be encoded. If the probability distribution is known to the decoder, the decoder can construct the code the same way the encoder does. Otherwise, the code (or the distribution) must be included as part of the compressed data. We assume that all probabilities are positive. A symbol with a zero probability should be removed from the alphabet. 13 / 21

Intuitively, to minimize the average code length, we should use shorter codewords for frequent symbols and longer codewords for rare symbols. Morse code has this property. The shortest codes are assigned to the most frequent letters in English. The utility of this principle is easy to show formally. Let a, b Σ be two symbols that violate the principle, i.e., satisfy P(a) < P(b) and C(a) < C(b). We can correct the violation by swapping the codes of a and b. The resulting the change in the average code length is P(a)( C(b) C(a) ) + P(b)( C(a) C(b) ) = (P(b) P(a))( C(b) C(a) ) < 0 Thus we can reduce the average code length for any code that violates the principle. 14 / 21

An optimal code can be computed with the Huffman algorithm. Algorithm Huffman Input: Probability distribution P over Σ = {s 1, s 2,..., s σ } Output: Optimal binary prefix code C for Σ (1) for s Σ do C(s) ɛ (2) T { {s 1 }, {s 1 },..., {s σ } } (3) while T > 1 do (4) Let A T be the set that minimizes P(A) = s A P(s) (5) T T \ {A} (6) for s A do C(s) 1C(s) (7) Let B T be the set that minimizes P(B) = s B P(s) (8) T T \ {B} (9) for s B do C(s) 0C(s) (10) T T {A B} (11) return C 15 / 21

Example (Huffman algorithm) Initial After Round 1 After Round 2 After Round 3 P({e}) = 0.3 P({e}) = 0.3 P({e}) = 0.3 P({a, o}) = 0.4 P({a}) = 0.2 P({a}) = 0.2 P({i, u, y}) = 0.3 P({e}) = 0.3 P({o}) = 0.2 P({o}) = 0.2 P({a}) = 0.2 P({i, u, y}) = 0.3 P({i}) = 0.1 P({u, y}) = 0.2 P({o}) = 0.2 P({u}) = 0.1 P({i}) = 0.1 P({y}) = 0.1 After Round 4 Final P({e, i, u, y}) = 0.6 P({e, a, o, i, u, y}) = 1.0 P({a, o}) = 0.4 Huffman code C(a) = 10 C(e) = 00 C(i) = 011 C(o) = 11 C(u) = 0100 C(y) = 0101 Huffman tree 1.0 0.6 0.4 e 0.3 0.3 a 0.2 0.2 o u 0.2 0.1 0.1 y 0.1 i 16 / 21

The resulting code is often called Huffman code. The code can be represented as a trie of O(σ) nodes known as the Huffman tree. The algorithm can be seen as constructing the tree from bottom up. The time complexity of the algorithm is O(σ log σ): The collection of sets T can be stored in a priority queue so that the minimum elements are found quickly. The total number of operations on T is O(σ) and each operation needs O(log σ) time. The time complexity of constructing the code (lines (6) and (9)) is linear in the total length of the codes, which is O(σ log σ). (Why?) Linear time construction is possible if Input is given in order of probabilities. Output is the Huffman tree, not the code. We will next prove the optimality of the Huffman code. 17 / 21

Lemma Let Σ = {s 1, s 2,..., s σ 1, s σ } be an alphabet with a probability distribution P satisfying P(s 1 ) P(s 2 ) P(s σ 1 ) P(s σ ) > 0. Then there exists an optimal binary prefix code C satisfying No codeword C(s i ) is longer than C(s σ ). C(s σ 1 ) is the same as C(s σ ) except that the last bit is different. Proof. Let C be an optimal prefix code violating the conditions. The following operations modify C without increasing the average code length. Repeated application of them leads to a prefix code satisfying the conditions. If C(s i ) > C(s σ ), swap C(s i ) and C(s σ ) (see slide 14). If C(s σ ) = B1 (B0) and B0 (B1) is not a codeword, set C(s σ ) = B. If C(s σ ) = B1 (B0) and C(s i ) = B0 (B1), swap C(s i ) and C(s σ 1 ). 18 / 21

Lemma Let Σ = {s 1, s 2,..., s σ 1, s σ } be an alphabet with a probability distribution P satisfying P(s 1 ) P(s 2 ) P(s σ 1 ) P(s σ ) > 0. Let Σ = {s 1, s 2,..., s σ 2, s } with a probability distribution P such that P (s ) = P(s σ 1 ) + P(s σ ) and P (s) = P(s) for other s. Let C be an optimal prefix code for Σ. Let C be a code for Σ with C(s σ 1 ) = C (s )0, C(s σ ) = C (s )1, and C(s) = C (s) for other s. Then C is an optimal binary prefix code for Σ. Proof. C is clearly a binary prefix code. Suppose C is not optimal, i.e., there exists a code D with a smaller average code length. By the preceding lemma, we can assume that D(s σ 1 ) = B0, D(s σ ) = B1 for some bit string B. Thus we can construct a prefix code D for Σ by setting D (s ) = B and D (s) = D(s) for other s. Let L, L, K and K be the average code lengths of the codes C, C, D and D, respectively. Then L L = P (s ) = K K and thus K = L (L K) < L which contradicts C being optimal. 19 / 21

Theorem The Huffman algorithm computes an optimal binary prefix code. Proof. Each iteration of the main loop of the Huffman algorithm performs an operation that is essentially identical to the operation in the lemma on the preceding slide: remove the two symbols/sets with the lowest probability and add a new symbol/set representing the combination of the removed symbols/sets. The lemma shows that each iteration computes the optimal code provided that the following iterations compute the optimal code. Since the code in the last iteration is trivially optimal, the algorithm computes the optimal code. 20 / 21

The Huffman algorithm computes an optimal prefix code, but there are still other algorithms for constructing prefix codes that can sometimes be useful. For example: The Shannon-Fano algorithm is simpler to implement than the Huffman algorithm, and usually produces an optimal or nearly optimal prefix code. The package-merge algorithm constructs an optimal prefix code under the constraint that no codeword is longer than L. Such a code is sometimes known as a length-limited Huffman code. The Hu-Tucker algorithm constructs an optimal alphabetic prefix code, which has the property that the lexicographical order of codewords is the same as the order of the symbols they encode. 21 / 21