David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Similar documents
Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Lossless Compression Algorithms

CS/COE 1501

Lecture 17. Lower bound for variable-length source codes with error. Coding a sequence of symbols: Rates and scheme (Arithmetic code)

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Multimedia Networking ECE 599

COMPSCI 650 Applied Information Theory Feb 2, Lecture 5. Recall the example of Huffman Coding on a binary string from last class:

ENSC Multimedia Communications Engineering Huffman Coding (1)

EE67I Multimedia Communication Systems Lecture 4

Compressing Data. Konstantin Tretyakov

Intro. To Multimedia Engineering Lossless Compression

Engineering Mathematics II Lecture 16 Compression

CS/COE 1501

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

Information Theory and Communication

A Research Paper on Lossless Data Compression Techniques

Without the move-to-front step abcddcbamnopponm is encoded as C* = (0, 1, 2, 3, 3, 2, 1, 0, 4, 5, 6, 7, 7, 6, 5, 4) (Table 1.14b).

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

Data Compression Techniques

Data Compression. Guest lecture, SGDS Fall 2011

Chapter 7 Lossless Compression Algorithms

Data compression.

So, what is data compression, and why do we need it?

Lecture 15. Error-free variable length schemes: Shannon-Fano code

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

Repetition 1st lecture

Multimedia Systems. Part 20. Mahdi Vasighi

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

Digital Image Processing

A Comparative Study of Lossless Compression Algorithm on Text Data

Greedy Algorithms CHAPTER 16

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

Lossless compression II

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Data Compression Fundamentals

Ch. 2: Compression Basics Multimedia Systems

Distributed source coding

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

Chapter 1. Digital Data Representation and Communication. Part 2

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Ch. 2: Compression Basics Multimedia Systems

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

Data Compression 신찬수

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

6. Finding Efficient Compressions; Huffman and Hu-Tucker

CHAPTER 1 Encoding Information

A New Compression Method Strictly for English Textual Data

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Chapter 4: Application Protocols 4.1: Layer : Internet Phonebook : DNS 4.3: The WWW and s

DEFLATE COMPRESSION ALGORITHM

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

15-122: Principles of Imperative Computation, Spring 2013

Image Coding and Compression

( ) ; For N=1: g 1. g n

Course notes for Data Compression - 2 Kolmogorov complexity Fall 2005

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

Data Compression Scheme of Dynamic Huffman Code for Different Languages

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

A Context-Tree Branch-Weighting Algorithm

8 Integer encoding. scritto da: Tiziano De Matteis

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Optimization of Bit Rate in Medical Image Compression

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Compression; Error detection & correction

An Efficient Compression Technique Using Arithmetic Coding

WIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION

Figure-2.1. Information system with encoder/decoders.

JPEG: An Image Compression System. Nimrod Peleg update: Nov. 2003

CSE100. Advanced Data Structures. Lecture 12. (Based on Paul Kube course materials)

Red-Black, Splay and Huffman Trees

Image coding and compression

14.4 Description of Huffman Coding

LOSSLESS VERSUS LOSSY COMPRESSION: THE TRADE OFFS

6.856 Randomized Algorithms

Data and information. Image Codning and Compression. Image compression and decompression. Definitions. Images can contain three types of redundancy

Text Compression through Huffman Coding. Terminology

Uniquely Decodable. Code 1 Code 2 A 00 0 B 1 1 C D 11 11

An Effective Approach to Improve Storage Efficiency Using Variable bit Representation

CS 335 Graphics and Multimedia. Image Compression

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

VC 12/13 T16 Video Compression

Research Article Does an Arithmetic Coding Followed by Run-length Coding Enhance the Compression Ratio?

Lecture 6 Review of Lossless Coding (II)

Digital Image Representation Image Compression

FPGA based Data Compression using Dictionary based LZW Algorithm

ECE 533 Digital Image Processing- Fall Group Project Embedded Image coding using zero-trees of Wavelet Transform

Department of electronics and telecommunication, J.D.I.E.T.Yavatmal, India 2

On Generalizations and Improvements to the Shannon-Fano Code

Basics of Information Worksheet

A Novel Image Compression Technique using Simple Arithmetic Addition

CSE 421 Greedy: Huffman Codes

EE 368. Weeks 5 (Notes)

Compression; Error detection & correction

Transcription:

David Rappaport School of Computing Queen s University CANADA Copyright, 1996 Dale Carnegie & Associates, Inc.

Data Compression There are two broad categories of data compression: Lossless Compression e.g. gif, gzip Lossy Compression e.g. mp3, jpeg

Lossless and Lossy Compression Lossless An exact copy of the original data is obtained after decompression Structured data can be compressed to 40-60 percent of original size Lossy Original information content is lost. Any data can be compressed. Sometimes by 90% or more.

What is information? The colour of the hat is red. The colour of the book is red. The colour of the table is red. The colour of the hat is red. The ---- " ------ book is ". The ---- " ------ table is ". The book, hat, and table are red.

Modelling Information The area of information theory explores the information content of data. We can predict the size of the content by modelling the data. Within a given model we can obtain lower bounds on the number of bits required to represent the data.

Modelling Information Consider the following message x 1,x 2,, x 12 : Using binary encoding we can store 0..21 using 5 bits per number.

Block Encoding A decimal number say 1042 can be thought of as (1 1000)+(0 100)+(4 10)+(2 1) = (1 10 3 )+(0 10 2 )+(4 10 1 )+(2 10 0 ) With 4 decimal digits we can store values [0.. 9999] or 10 4 different values.

Block Encoding A binary number say 1001 can be thought of as (1 2 3 )+(0 2 2 )+(0 2 1 )+(1 2 0 ) =? With 4 binary bits we can store values [0.. 2 4-1] or 2 4 = 16 different values. With 5 binary bits we can store 32 different values.

Modelling Information Or we can store: 0 2 2 2 5 4 6 8 7 8 11 12 A value n in our message actually represents the number n+9. Using binary encoding we can store 0..12 using 4 bits per number.

Modelling Information Or we can store: x 1 is stored as is. To encode x i we use x i - x i-1. We have 5 distinct values which can be encoded with 3 bits.

Modelling Information 25 20 y = x + 8 15 10 5 0 0 5 10 15

Modelling Information Let x i denote the ith number in our message, and let y i be set to i+8. We can encode e i =x i -y i giving: There are 3 distinct values so 2 bits per number suffices.

Minimum Redundancy Coding Let s take a more organized approach with the next example: Suppose we have a message consisting of 5 distinct characters, that appear with the given frequencies. A-15, B-7, C-6, D-6, E-5.

First Attempt A-15, B-7, C-6, D-6, E-5. A -- 000, B--001, C--010, D--011, E 100. Three bits are needed to encode 5 distinct characters

Variable Length Code A-15, B-7, C-6, D-6, E-5. A -- 0, B--1, C--00, D--01, E--10. Why won t this code work?

Variable Length Code A-15, B-7, C-6, D-6, E-5. A -- 0, B--1, C--00, D--01, E--10. Why won t this code work? Is 00 AA or C?

Second attempt A-15, B-7, C-6, D-6, E-5. A -- 01, B--000, C--001, D--111, E--100. This code can be unambiguously decoded. Why?

Shannon-Fano Encoding A-15, B-7, C-6, D-6, E-5. Split sorted list of codes into roughly equal parts. Assign first bit of left side as 0, and first bit of right side as 1. Repeat on each side until every symbol is encoded.

Shannon-Fano Encoding A-15, B-7, C-6, D-6, E-5. A B C D E A B C D E D E

Shannon-Fano encoding This can be represented as a binary tree. A B C D E

Prefix Code A-00, B-01, C-10, D-110, E-111. No symbol is encoded as the prefix of any other symbol. A prefix code can be decoded with no ambiguity.

Shannon-Fano Code A-00, B-01, C-10, D-110, E-111. Consider the following string: 0001010010100101111110 If we are given, the frequencies, the tree, or the codes, we can decode the message.

Shannon-Fano Code A-00, B-01, C-10, D-110, E-111. For example try to decode the following string: 0001010010100101111110

Prefix Code A-00, B-01, C-10, D-110, E-111. Answer: 0001010010100101111110 A B B A C C B B E D

Huffman Algorithm Encoding Step 1. Determine the frequency of each symbol in the message. Each symbol can be thought of as a tree with a weight. A-15 B-7 C-6 D-6 E-5

Huffman Algorithm Encoding Step 2. While there is more than one tree create a single tree from the two trees of least weight.

Huffman Algorithm Encoding Step 2. A-15 B-7 C-6 11 D-6 E-5

Huffman Algorithm Encoding Step 2. A-15 13 11 B-7 C-6 D-6 E-5

Huffman Algorithm 39 Final Huffman Tree A-15 24 13 11 B-7 C-6 D-6 E-5

Huffman Algorithm 39 A-15 24 13 11 B-7 C-6 D-6 E-5 A Huffman code is obtained by following links from the root of the tree to each leaf, with a left link representing 0 and a right link 1.

Huffman Algorithm 39 A - 0, B - 100, C - 101, D - 110, E - 111. A-15 24 13 11 B-7 C-6 D-6 E-5

Storing the Huffman Tree To successfully decode the compressed file we need the Huffman tree. An alternate method is to simply store the counts and rebuild the tree in the decompression stage.

Huffman Algorithm We can compare the Huffman code with the Shannon-Fano code used in our previous example. Frequencies: A-15, B-7, C-6, D-6, E-5. Codes: A-0, B-100, C-101, D-110, E-111 A-00, B-01, C-10, D-110, E-111.

Huffman Algorithm Frequencies: A-15, B-7, C-6, D-6, E-5. Codes: A - 0, B - 100, C - 101, D - 110, E - 111 15 + 3*(7 + 6 + 6 + 5 ) = 87 A-00, B-01, C-10, D-110, E-111. 2*(15 + 7 + 6) + 3(6 +5) = 89

Unresolved Can we do better that Huffman? How do we measure information content?

Entropy Recall our example: A-15, B-7, C-6, D-6, E-4. If we have a string of 38 symbols from {ABCDE} the probability that an A occurs is P(A) = 15/38. Similarly: P(B) = 7/38, P(C) = 6/38, P(D) = 6/38, P(E) = 4/38.

Self-information Claude Shannon defined a quantity selfinformation associated with a symbol in a message. The self-information of a symbol X is given by the formula: i(x) = log b 1 P( X) = log b P( X)

Log Function

Self-information Intuition: The higher the probability of a character occurance, the lower the information content. At the extreme if P(X)=1 then there is nothing learned when receiving an X since that is the only possibility.

Entropy Shannon also defined a quantity entropy associated with a message, representing the average self information per symbol in the message expressed in radix b. H= P(X i )log b 1 P(X i ) = P(X )log i b P( X i )

Entropy Entropy provides an absolute limit on the best possible lossless encoding or compression of any message, assuming that the message may be represented as a sequence of independent and identically distributed random variables.

Entropy The entropy of the message in our example is computed as 2.16 binary bits (i.e. b = 2). The Huffman code obtained for this example uses an average of 2.23 bits per symbol. The Shannon-Fano code uses an average of 2.28 bits per symbol

Entropy It can be shown that a Shanon-Fano encoding of a message S produces a code with average bit length SF(S) that satisfies: H(S) SF(S) H(S) +1

Entropy Let ṗ denote the probability of the least likely symbol in message S It can be shown that a Huffman encoding of a message S produces a code with average bit length R(S) that satisfies: H(S) R(S) H(S) + ṗ +0.086 H(S) +.586

Entropy Let ṗ denote the probability of the least likely symbol in message S Furthermore ṗ 1/m where m is the number of symbols in the alphabet: H(S) R(S) H(S) + 1/m +0.086

Entropy For m = 2 8 (8 bits), 1/m =.0039 H(S) R(S) H(S) +.0039 +0.086 H(S) +.0899

Arithmetic Encoding P(A) = 1/2, P(B) = 1/4 P(C) = P(D) = 1/8. Assign symbols to intervals in 0.. 1 0 A 1/2 B 3/4 C 7/8 D 1

Arithmetic Encoding P(A) = 1/2, P(B) = 1/4 P(C) = P(D) = 1/8. Assign symbols to intervals in 0.. 1 A_lo = 0, B_lo = 1/2, C_lo = 3/4, D_lo = 7/8 A_r = 1/2, B_r= 1/4, C_r= 1/8, D_r =1/8 0 A 1/2 B 3/4 C 7/8 D 1

Arithmetic Encoding lo = 0; r = 1; while symbols remain s = next symbol \\adjust range lo = lo + r * s_lo r = r * s_r After all symbols are read we are left with an interval defined between lo.. lo+r. We can represent the entire message by any value, call it v, in that interval.

Arithmetic Encoding Encode message BAABCADA 0 A 1/2 B 3/4 C 7/8 D 1

Arithmetic Encoding Encode message BAABCADA 0 A 1/2 B 3/4 C 7/8 D 1 1/2 5/8 11/16 23/32 3/4

Arithmetic Encoding Encode message BAABCADA 1/2 A 5/8 B 11/16 C23/32D 3/4 1/2 9/16 19/32 39/64 5/8

Arithmetic Encoding Encode message BAABCADA 1/2 9/16 19/32 39/64 5/8 1/2 17/32 35/64 71/128 9/16

Arithmetic Encoding Encode message BAABCADA 1/2 17/32 35/64 71/128 9/16 17/32 35/64

Arithmetic Encoding Encode message BAABCADA 1/2 17/32 35/64 71/128 9/16 272/512 279/512 276/512 278/512 280/512

Arithmetic Encoding Encode message BAABCADA 279/512 272/512 276/512 278/512 280/512 278/512 279/512

Arithmetic Encoding Encode message BAABCADA 278/512 279/512 556/1024 557/1024

Arithmetic Encoding Encode message BAABCADA 556/1024 557/1024 4455/8192 4456/8192

Arithmetic Encoding Encode message BAABCADA 8910/16384 8912/16384 8910/16384 8911/16384 Final message can be encoded as any value in the final interval.

Arithmetic Encoding Encode message BAABCADA 8910/16384 8912/16384 8910/16384 8911/16384 8910/16384 in binary is 0.1000101100111

Arithmetic Encoding Encode message BAABCADA 8910/16384 in binary is 0.1000101100111 In general* the number of binary bits needed to encode a message using arithmetic encoding is -log2(r) where r is the interval of the final range. * there are issues of arithmetic that have to be resolved in a real world application. In practice this statement is true whenever messages are sufficiently long.

Arithmetic Decoding Decoding an arithmetic encoded message simply reverses the encoding process. The value 8910/16384 is in B s interval so B must be the first symbol in the message 8192/16384 12288/16384 0 A B C 7/8 D 1

Arithmetic Decoding Decoding an arithmetic encoded message simply reverses the encoding process. The value 8910/16384 is in B s interval so B must be the first symbol in the message 8192/16384 12288/16384 0 A B C 7/8 D 1 8192/16384 12288/16384 A B C D

Arithmetic Decoding Encode message BAABCADA 0 A 1/2 B 3/4 C 7/8 D 1 1/2 5/8 11/16 23/32 3/4 8192/16384 10235/16384 12288/16384

Arithmetic Decoding Decoding an arithmetic encoded message simply reverses the encoding process. The value 8910/16384 is now in A s interval so A must be the next symbol in the message. 8192/16384 10235/16384 12288/16384 A B C D

Arithmetic Decoding lo = 0; r = 1; val = number received while symbols decoded less than total number of symbols find symbol s such that s_lo val /r < s_lo + s_r \\adjust range lo = lo + r * s_lo r = r * s_r

Performance Let p k denote the probability for symbol s k. Let f k denote the frequency for symbol s k. For a message of length n, f k = n p k. Let m denote the size of the alphabet. So the range of the final interval is: ny Y m k=1 p k = k=1 p f k k

Performance We have: log 2 Y p f k k = n X p k log 2 p k expressing the number of binary bits needed to represent a message (of n symbols). So the average number of bits per symbol matches the entropy bound exactly.

Arithmetic Encoding Very efficient and effective implementations of arithmetic encoding and variants of the algorithm have been developed. In general arithmetic encoding can achieve compression rates superior to Huffman encoding.

Epilogue Claude Shannon, Robert Fano, Peter Elias professors at MIT during ~ 1950-1970 Claude Shannon invented the field of information theory and concepts such as entropy Shannon-Fano algorithm Shannon-Fano-Elias algorithm (Arithmetic Encoding)

Epilogue David Huffman student at MIT early 50s Huffman s algorithm was developed as a project. (Fano was Huffman s professor who gave students a choice between writing a final exam or doing a project.)

Epilogue Many modern compressors use Huffman encoding e.g. PKZIP, JPEG and mp3. Very efficient versions of arithmetic encoding have been developed and implemented. Although the original JPEG specification had an option to use arithmetic encoding, Huffman is used due to patent issues. (It has been shown that arithmetic encoding can reduce a Huffman JPEG up to 25%!)

Epilogue These methods assume that every symbol comes from a sequence of independent and identically distributed random variables In English that is clearly not the case, e.g. queen, inquiry, queue, cheque, question There are various compression schemes that can adapt to the highly non-independent sequence of symbols in messages. e.g., LZW encoding (gif, UNIX compress).

Data Compression Resources http://www.ics.uci.edu/~dan/pubs/datacompression.html See sections 1.1 Definitions 3.1 Shannon-Fano Coding 3.2 Static Huffman Coding 3.4 Arithmetic Coding

People Resources http://en.wikipedia.org/wiki/claude_shannon http://en.wikipedia.org/wiki/ David_A._Huffman http://en.wikipedia.org/wiki/robert_fano http://en.wikipedia.org/wiki/peter_elias