Compressing Data. Konstantin Tretyakov

Similar documents
IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

Multimedia Systems. Part 20. Mahdi Vasighi

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Communication

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

EE67I Multimedia Communication Systems Lecture 4

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

CS/COE 1501

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

COMPSCI 650 Applied Information Theory Feb 2, Lecture 5. Recall the example of Huffman Coding on a binary string from last class:

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

MCS-375: Algorithms: Analysis and Design Handout #G2 San Skulrattanakulchai Gustavus Adolphus College Oct 21, Huffman Codes

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Greedy Algorithms CHAPTER 16

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

CS/COE 1501

Engineering Mathematics II Lecture 16 Compression

Simple variant of coding with a variable number of symbols and fixlength codewords.

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

Figure-2.1. Information system with encoder/decoders.

Lossless Compression Algorithms

Distributed source coding

Lossless compression II

Lecture 15. Error-free variable length schemes: Shannon-Fano code

EE-575 INFORMATION THEORY - SEM 092

I. Introduction II. Mathematical Context

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

Data Compression Techniques

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

CIS 121 Data Structures and Algorithms with Java Spring 2018

CSE 421 Greedy: Huffman Codes

7: Image Compression

Chapter 7 Lossless Compression Algorithms

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

ENSC Multimedia Communications Engineering Huffman Coding (1)

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Repetition 1st lecture

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

Data compression with Huffman and LZW

Error Resilient LZ 77 Data Compression

EECS 122: Introduction to Communication Networks Final Exam Solutions

Huffman Coding. Version of October 13, Version of October 13, 2014 Huffman Coding 1 / 27

Lecture 17. Lower bound for variable-length source codes with error. Coding a sequence of symbols: Rates and scheme (Arithmetic code)

Greedy Algorithms. Alexandra Stefan

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

Linked Structures Songs, Games, Movies Part IV. Fall 2013 Carola Wenk

Image Coding and Compression

14.4 Description of Huffman Coding

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

Data Compression. Guest lecture, SGDS Fall 2011

Multimedia Networking ECE 599

Error-Correcting Codes

Lempel-Ziv-Welch (LZW) Compression Algorithm

Digital Image Processing

Image coding and compression

Data Compression Techniques

6. Finding Efficient Compressions; Huffman and Hu-Tucker

VIDEO SIGNALS. Lossless coding

Lecture Coding Theory. Source Coding. Image and Video Compression. Images: Wikipedia

Introduction to Data Compression

So, what is data compression, and why do we need it?

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Ch. 2: Compression Basics Multimedia Systems

15 July, Huffman Trees. Heaps

Dictionary techniques

VC 12/13 T16 Video Compression

Encoding/Decoding, Counting graphs

WIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION

OUTLINE. Paper Review First Paper The Zero-Error Side Information Problem and Chromatic Numbers, H. S. Witsenhausen Definitions:

Intro. To Multimedia Engineering Lossless Compression

DEFLATE COMPRESSION ALGORITHM

A Research Paper on Lossless Data Compression Techniques

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

Digital Image Processing

6. Finding Efficient Compressions; Huffman and Hu-Tucker Algorithms

The impossible patent: an introduction to lossless data compression. Carlo Mazza

Compression; Error detection & correction

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

EE 368. Weeks 5 (Notes)

Ch. 2: Compression Basics Multimedia Systems

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Image Compression. cs2: Computational Thinking for Scientists.

Encoding/Decoding and Lower Bound for Sorting

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

Noise Reduction in Data Communication Using Compression Technique

University of Waterloo CS240 Spring 2018 Help Session Problems

A Comprehensive Review of Data Compression Techniques

KINGS COLLEGE OF ENGINEERING DEPARTMENT OF INFORMATION TECHNOLOGY ACADEMIC YEAR / ODD SEMESTER QUESTION BANK

A Hybrid Image Compression Technique using Quadtree Decomposition and Parametric Line Fitting for Synthetic Images

LZW Compression. Ramana Kumar Kundella. Indiana State University December 13, 2014

Comparing Data Compression in Web-based Animation Models using Kolmogorov Complexity

Text Compression through Huffman Coding. Terminology

Compression of Stereo Images using a Huffman-Zip Scheme

Optimal Parsing. In Dictionary-Symbolwise. Compression Algorithms

Transcription:

Compressing Data Konstantin Tretyakov (kt@ut.ee) MTAT.03.238 Advanced April 26, 2012

Claude Elwood Shannon (1916-2001)

C. E. Shannon. A mathematical theory of communication. 1948

C. E. Shannon. The mathematical theory of communication. 1949

Shannon-Fano coding Nyquist-Shannon sampling theorem Shannon-Hartley theorem Shannon s noisy channel coding theorem Shannon s source coding theorem Rate-distortion theory Ethernet, Wifi, GSM, CDMA, EDGE, CD, DVD, BD, ZIP, JPEG, MPEG,

MTMS.02.040 Informatsiooniteooria (3-5 EAP) Jüri Lember http://ocw.mit.edu/ 6.441 Information Theory https://www.coursera.org/courses/

Basic terms: Information, Code Information Coding, Code Can you code the same information differently? Why would you? What properties can you require from a coding scheme? Are they contradictory? Show 5 ways of coding the concept number 42 What is the shortest way of coding this concept? How many bits are needed? Aha! Now define the term code once again.

Basic terms: Coding Suppose we have a set of three concepts. Denote them as A, B and C. Propose a code for this set. Consider the following code: A 0, B 1, C 01 What do you think about it? Define variable length code. Define uniquely decodable code.

Basic terms: Prefix-free If we want to code series of messages, what would be a great property for a code to have? Define prefix-free code. For historical reasons those are more often referred to as prefix codes. Find a prefix-free code for {A, B, C}. Is it uniquely decodable? Is prefix-free uniquely decodable? Is uniquely decodable prefix-free?

Prefix-free code.. can always be represented as a tree with symbols at the leaves.

Compression Consider some previously derived code for {A, B, C}. Is it good for compression purposes? Define expected code length. Let event probabilities be as follows: A 0.50, B 0.25, C 0.25 Find the shortest possible prefix-free code.

Compression & Prefix coding Does the prefix-free property sacrifice code length? No! For each uniquely-decodable code there exists a prefix-code with the same codeword lengths.

Huffman code Consider the following event probabilities A 0.50, B 0.25, C 0.125, D 0.125 and some event sequence ADABAABACDABACBA Replace all events C and D with a new event Z. Construct the optimal code for {A, B, Z} Extend this code to a new code for {A, B, C, D}

Huffman coding algorithm Generalize the previous construction to construct an optimal prefix-free code. Use Huffman coding to encode YAYBANANABANANA Compare its efficiency to straightforward 2-bit encoding. D. Huffman. A Method for the Construction of Minimum-Redundancy Codes, 1952

Huffman coding in practice Is just saving the result of Huffman coding to file enough? What else should be done? How? Straightforward approach dump the tree using preorder traversal. Smarter approach save only code lengths Wikipedia: Canonical Huffman Code RFC1951: DEFLATE Compressed Data Format Specification version 1.3, Section 3.2.2

Huffman code optimality Consider an alphabet, sorted by event (letter) probability, e.g. x 1 0.42, x 2 0.25,, x 9 0.01, x 10 0.01 Is there just a single optimal code for it, or several of them?

Huffman code optimality Show that each optimal code has: l x 1 l x 2 l(x 10 ) Show that there is at least one optimal code where x 9 and x 10 are siblings in the prefix tree. Let L be the expected length of the optimal code. Merge x 9 and x 10, and let L s be the expected length of the resulting smaller code. Express L in terms of L s. Complete the proof.

Huffman code in real life Which of those use Huffman coding? DEFLATE (ZIP, GZIP) JPEG PNG GIF MP3 MPEG-2 All of them do, as a post-processing step.

Shannon-Fano code I randomly chose a letter from this probability: A 0.45, B 0.35, C 0.125, D 0.125 You need to guess it in the smallest expected number of yes/no questions. Devise an optimal strategy.

Shannon-Fano code Constructs a prefix-code in a top-down manner: Split the alphabet into two parts with as equal probability as possible. Construct a code for each part. Prepend 0 to codes of the first part Prepend 1 to codes of the second part. Is Shannon-Fano the same as Huffman?

Shannon-Fano & Huffman Shannon-Fano is not always optimal. Show that it is optimal, though, for letter probabilities of the form 1/2 k.

log(p) as amount of information Let letter probabilities all be of the form p = 1 2 k Show that for the optimal prefix code, the length of codeword for a letter with probability p i = 1 2 k is exactly k = log 2 1 p i = log 2 p i.

Why logarithms? Intuitively, we want a measure of information to be additive. Receiving N equivalent events must correspond to N times the information in the single event. However, probabilities are Therefore, the most logical way to measure information of an event is

The thing to remember log 2 1 p is the information content of a single random event with probability p. For p of the form 2 k it is exactly the number of bits needed to code this event using an optimal binary prefix-free code.

The thing to remember log 2 1 p is the information content of a single random event with probability p. For p of the form 2 k For other values of p the information it is content exactly is not an the integer. number Obviously you can t use something like 2.5 bits to encode a symbol. However, for of bits needed to code this event using an longer texts you can code multiple symbols at once and in this case you can optimal achieve binary the average prefix-free coding rate of this code. number (e.g. 2.5) bits per each presence of the corresponding event.

Expected codeword length Let letter probabilities all be of the form p = 1 2 k What is the expected code length for the optimal binary prefix-free code?

The thing to remember For a given discrete probability distribution, the function 1 1 H p 1, p 2,, p n = p 1 log 2 + + p p n log 2 1 is called the entropy of this distribution. p n

Meaning of entropy The average codeword length L for both Huffman and Shannon-Fano codes satisfies: H P L < H(P) + 1

Meaning of entropy Shannon Source Coding Theorem A sequence of N events from probability P can be losslessly represented as a sequence of N H(P) bits for sufficiently large N. Conversely, it is impossible to losslessly represent a the sequence using less than N H(P) bits.

The things to remember log 2 1 p is the information content of a single random event with probability p, measured in bits. H(P) Is the expected information content for the distribution P, measured in bits.

The things to remember log 2 1 is the I.e. it information is the expected number of content bits necessary to of optimally a single encode random event with such probability. event with probability p, measured in bits. p H(P) Is the I.e. it expected is the expected number information of bits necessary to content optimally encode for a single the distribution P, random measured event from this in distribution. bits.

Demonstrate an N-element distribution with zero entropy. Demonstrate an N-element distribution with maximal entropy. Define entropy for a continuous distribution p(x).

Is Huffman code good for coding: Images? Music? Text? None of them, because Huffman coding assumes an I.I.D. sequence, yet all of those have a lot of structure. What is it good for? It is good for coding randomlike sequences.

Say we need to encode the text THREE SWITCHED WITCHES WATCH THREE SWISS SWATCH WATCH SWITCHES. WHICH SWITCHED WITCH WATCHES WHICH SWISS SWATCH WATCH SWITCH? Can we code this better than Huffman? Of course, if we use a dictionary. Can we build the dictionary adaptively from the data itself?

Lempel-Ziv-Welch algorithm Say we want to code string AABABBCAB Start with a dictionary {0 } Scan string from the beginning. Find the longest prefix present in the dictionary (0, ). Read one more letter A. Output prefix id and this letter (0, A ). Append <current prefix><current letter> to the dictionary. New dictionary: {0, 1 A }. Finish the coding. Terry Welch, A Technique for High-Performance Data Compression, 1984.

LZW Algorithm Unpack the obtained code. Can we do smarter initialization? If we pack a long text, the dictionary may bloat. How do we handle it? In practice LZW coding is followed by Huffman (or a similar) coding.

Theorem LZW coding is asymptotically optimal. I.e. as the length of the string goes to infinity, the compression ratio approaches the best possible (given some conditions).

LZW and variations in real life Which of those use variations of LZW? DEFLATE (ZIP, GZIP) JPEG PNG GIF MP3 MPEG-2

LZW and variations in real life Which of those use variations of LZW? DEFLATE (ZIP, GZIP) JPEG PNG GIF MP3 MPEG-2 Remember, LZW is aimed at text-like data with many repeating substrings. It is used in GIF after the run-length-encoding step (which produces such kind of data). Not sure why PNG uses it, but probably for a similar reason.

Ideal compression? Given a string of bytes, what would be the theoretically best way to encode it?

Kolmogorov complexity The Kolmogorov complexity of a byte string is the length of the shortest program which outputs this string.

Kolmogorov complexity Can we achieve Kolmogorov complexity at packing?

Kolmogorov complexity Theorem Kolmogorov complexity is not computable.

Summary Thou shalt study Information Theory! Huffman-code is a length-wise optimal uniquely-decodable code. log (1/p) is the information content of an event. H P is the information content of a distribution. LZW is asymptotically optimal. Kolmogorov complexity is a fun (but practically useless) idea.