The impossible patent: an introduction to lossless data compression. Carlo Mazza

Similar documents
Semi-Lossless Text Compression: a Case Study

CNT4406/5412 Network Security

CSSE SEMESTER 1, 2017 EXAMINATIONS. CITS1001 Object-oriented Programming and Software Engineering FAMILY NAME: GIVEN NAMES:

Einführung in die Programmierung Introduction to Programming

An example (1) - Conditional. An example (2) - Conditional. An example (3) Nested conditional

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

History of Typography. (History of Digital Font)

Professional Communication

Lossless compression B 1 U 1 B 2 C R D! CSCI 470: Web Science Keith Vertanen

Error Checking Codes

Compressing Data. Konstantin Tretyakov

Department of Image Processing and Computer Graphics University of Szeged. Fuzzy Techniques for Image Segmentation. Outline.

CS/COE 1501

Lossless Compression Algorithms

Data Compression Techniques

Computer Security & Privacy. Why Computer Security Matters. Privacy threats abound (identity fraud, etc.) Multi-disciplinary solutions

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

CS/COE 1501

Data Compression. Guest lecture, SGDS Fall 2011

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

FauxCrypt - A Method of Text Obfuscation

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Repetition 1st lecture

Compression; Error detection & correction

CIS 121 Data Structures and Algorithms with Java Spring 2018

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Compression; Error detection & correction

7: Image Compression

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

Ch. 2: Compression Basics Multimedia Systems

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

Ch. 2: Compression Basics Multimedia Systems

I. Introduction II. Mathematical Context

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

EE67I Multimedia Communication Systems Lecture 4

Lempel-Ziv-Welch (LZW) Compression Algorithm

EE-575 INFORMATION THEORY - SEM 092

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

Information Retrieval

Introduction to Data Compression

Data Compression Fundamentals

Computing in the Modern World

So, what is data compression, and why do we need it?

Digital Image Processing

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Lossless compression II

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

Digital Image Processing

Engineering Mathematics II Lecture 16 Compression

CS 206 Introduction to Computer Science II

Data compression with Huffman and LZW

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

G64PMM - Lecture 3.2. Analogue vs Digital. Analogue Media. Graphics & Still Image Representation

Data Compression 신찬수

A study in compression algorithms

CSC 421: Algorithm Design & Analysis. Spring 2015

Multimedia Systems. Part 20. Mahdi Vasighi

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Lecture #3: Digital Music and Sound

Administrivia. FEC vs. ARQ. Reliable Transmission FEC. Last time: Framing Error detection. FEC provides constant throughput and predictable delay

Multimedia Networking ECE 599

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Lempel-Ziv-Welch Compression

Summary of Digital Information (so far):

IMAGE COMPRESSION TECHNIQUES

Image compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year

Programming Abstractions

A Research Paper on Lossless Data Compression Techniques

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

Lempel-Ziv compression: how and why?

Grade 6 Math Circles November 6 & Relations, Functions, and Morphisms

Horn Formulae. CS124 Course Notes 8 Spring 2018

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

A Comparative Study of Entropy Encoding Techniques for Lossless Text Data Compression

Dictionary techniques

Data compression.

Greedy Algorithms II

Lecture 1: Overview

Lecture 5: Compression I. This Week s Schedule

CSCI 270: Introduction to Algorithms and Theory of Computing Fall 2017 Prof: Leonard Adleman Scribe: Joseph Bebel

Source coding and compression

A Comprehensive Review of Data Compression Techniques

More Bits and Bytes Huffman Coding

Welcome. Non-Profit Involvement. Goals. Agenda. Goal

International Journal of Trend in Research and Development, Volume 3(2), ISSN: A Review of Coding Techniques in the Frequency

Ocr: A Statistical Model Of Multi-engine Ocr Systems

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

Flwrap Users Manual Generated by Doxygen

Section 0.3 The Order of Operations

Information Theory and Communication

GETTING STARTED 8 December 2016

1 One-Time Pad. 1.1 One-Time Pad Definition

Binary Trees Case-studies

Efficient Sequential Algorithms, Comp309. Motivation. Longest Common Subsequence. Part 3. String Algorithms

CSE 421 Greedy: Huffman Codes

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

LZW Compression. Ramana Kumar Kundella. Indiana State University December 13, 2014

Transcription:

The impossible patent: an introduction to lossless data compression Carlo Mazza

Plan Introduction Formalization Theorem A couple of good ideas

Introduction

What is data compression? Data compression is the procedure that reduces the size of information. It is used today in many applications, expecially in digital data: generic files compression (ZIP, RAR, etc.) audio compression (MP3, AAC, FLAC, etc.) images compression (JPG, GIF, PNG, etc.) video compression (AVI, MP4, WMV, etc.)

(Very) Brief historic overview 1838: Morse Code 1940: Information Theory (Shannon, Fano, Huffman) 1970s 1980s LZW (Lempel, Ziv and Welch) ARJ, PKZIP, LHarc Microsoft and Apple, email BBS and newsgroups 1990s JPG, MP3 The web, browsers, Yahoo and Google 2000s H.264, AAC, MP4, M4V dot-com bubble, Facebook

Screenshot of PKZIP 2.04g, created on February 15, 2007 using DOSBox

Different kinds of compression Lossless compression: ZIP, RAR, FLAC, PNG Lossy compression: MP3, JPG, MP4, AAC

Formalization

Lossless compression The lossless compression is the compression which does not lose information, i.e., there is another operation, decompression, such that compressing and decompressing a file gives back the exact same file.

No loss of information Messaggi SMS: "hi m8, r u k? sry i 4gt 2 cal u lst nite. why dnt we go c movie 2nite? c u l8r" "c 6? xke nn ho bekkato ness1 in 3no? cmq c vdm + trd nel pom" Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht frist and lsat ltteer is at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.

Loss of information Jane S., a chief sub editor and editor, can always be found hard at work in her cubicle. Jane works independently, without wasting company time talking to colleagues. She never thinks twice about assisting fellow employees, and she always finishes given assignments on time. Often Jane takes extended measures to complete her work, sometimes skipping coffee breaks. She is a dedicated individual who has absolutely no vanity in spite of her high accomplishments and profound knowledge in her field. I firmly believe that Jane can be classed as a high-caliber employee, the type which cannot be dispensed with. Consequently, I duly recommend that Jane be promoted to executive management, and a proposal will be sent away as soon as possible.

Formalization We try to formalize the situation: let F be a file, a sequence of ones and zeros let L(F) be the length of the file F we want to find a procedure that from F yields another file G in such a way that L(G) L(F) How many are the files of length N? And those of length at most N?

Compression as a function We think of compression as a function f from the set of files into the same set of files such that L(f(F)) L(F). What properties do we need from this function for the compression to be lossless? the function f(f)=0 surely compresses but loses information the function f(f)=f surely does not loses information but does not compress either What is the property that distinguishes lossless and lossy compression?

Compression as a function As we said before, we say that the compression is lossless if there is another operation which recovers, the original file. The functions f models lossless compression if there is another fuction g such that for every file F we have g(f(f))=f (f o g)(f)=f (f o g)(f)=(id)(f) We say that f has a left inverse

Left inverses and injective maps Theorem: A function f admits a left inverse if and only if it is injective. Proof: Say f is a map from X to Y. Suppose f is injective. Then every y is the image of at most one x in X. We define the map g by stating that every y which is hit goes back to x, and every other y can do whatever it wants. It is clear that for every x in X, g(f(x))=x.

Left inverses and injective maps Proof (cont d): Suppose now that f admits a left inverse, call it g. Suppose that f(x)=f(x ). Then g (f(x))=g(f(x )), but x=g(f(x))=g(f(x ))=x, and therefore x=x, that is f is injective. We managed to translate an intuitive property ( losslessness ) into a precise mathematical concept (injectivity).

Theorem

Limits of lossless compression WEB Technologies Premier Research Corporation (MINC) Hyper Space method Matthew Burch Pegasus Web Services Inc. (patent 7,096,360) Actually... Theorem: There is no perfect lossless compression.

Proof by contraddiction Theorem: There isn t a function f such that for every F we have L(f(F)) L(F), but there is at lest one such that L(f(F))<L(F)). Proof: Let s suppose such a function exists Let F be a file which is actually compressed and let G=f(F). Consider L(f(G)). If L(f(G))=L(F) then let H=f(G)=f(f(F)) and consider L(f(H)) and so on. Since f is injective, I cannot hit the same file twice.

Proof (continued) So the length will have to decrease eventually. But then we will eventually go to files of length one, from where we cannot go any further, which leads to a contraddiction.

Schubfachprinzip Dirichlet s Principle (1834), pigeonhole principle Let f be a function from a set A to a set B. If the number of elements of B is stricly less than that of A, then f is not injective.

Let s count Theorem: There isn t a function f which compresses almost all files (i.e., L(f(F)) L(F) for all F but there is at least one such that L(f (F))<L(F)). Proof: Let N be the minimal length of a file which is compressed. The files of length N-1 are 2 (N-1) and so all files of length N are 2 (N-1) +2 (N-2) +...+2 1 =2 N -2. Then f sends a set of size 2 (N- 2) +1 to a set of size 2 (N-2). But because of the pigeonhole principle, it cannot be injective.

Impossible compression So there is no universal compression function. Actually, looking at the proof, it s clear that if something is compressed, something else increases in size. So, if we have no good ideas, better leave everything as is.

A couple of good ideas RLE and prefix codes

Run Lenght Encoding The Run Lenght Encoding (RLE) technique is one of the oldest compression algorithm: when a symbol repeats, we substitute the symbol and the number of its repetitions. aaaabbbcccdd -> 4a3b3c2d mathematics -> 1m1a1t1h1e1m1a1t1i1c1s It works badly for messages with few repetitions and very well for messages with a lot of repetitions (fax).

ASCII encoding But we still need to encode the letters and frequencies in binary. In general, let s say we have a text message that we want to compress. The output will be a binary string, so we need to convert letters to binary numbers. One of the standards is the ASCII standard that assigns to each letter a 7 bit number (a string of 7 ones or zeros, so it encodes 2 7 =128 symbols).

Dictionary Encoding We decide to choose a dictionary that need not be only one letter, but maybe more. But we still need to have some kind of fixed length to be able to separate the frequencies from the symbols.

Exercise 01100001011000010110001001100011011000 01011000010110001001100011011000010110 00010110001001100011 01100001011000010110001001100011011000 01011000010110001001100011011000010110 00010110001001100011 01100001011000010110001001100011 01100001011000010110001001100011 01100001011000010110001001100011

Reducing number of bits encoding mathematics in ASCII requires 7 bits * 11 letters = 77bits mathematics only has 8 different letters, so only 3 bits are needed, so in total 33 bits but we could use less bits for the more frequent letters, i.e., a=0 m=1 t=10 h=11 e=100 i=101 c=110 s=111 so mathematics becomes 1010111001010101110111 (22 b) but that also encodes iasaattihas

Prefix codes Need to make sure that no code is the prefix of another code a=0 b=1 c=10 doesn t work a=0 b=10 c=11 works Examples: international prefix (+1 USA, +39 Italy)

Huffman coding We start with a frequency table of the letters. We produce a tree following the rules: create a tree for every letter with weight equal to its frequency create a new tree by joining the two trees with the least two weights (and give it as weigth the sum of the two weigths) go on until there is only one tree To see what the codes are, we read the tree from the top to the bottom.

Examples 1. aaaabbbccdd a. RLE 4a3b2c2d 10000111010111011 (17 bits) b. Huffman: 2. mathematics a. RLE 1m1a1t1h1e1m1a1t1i1c1s (3*11=33 bits) b. Huffman:

assassins: (5,s) (2,a) (1,i) (1,n) 5 2 1 1 5 2 2 s a i n s a i n 4 5 a s i n

0 1 So, in the end: s=0 a=10 i=110 n=111 s 0 1 assassins = 100010001101110 (15 bits) a 0 1 Try sessions, sassafrasses, mummy, beekeeper, but not mathematics i n

Advantages and disavantages RLE: one can start compressing at once (there is no need to read the whole message to construct a frequency table) RLE: works expecially well when there are few symbols and lots of repetitions Huffman: works well when the frequencies are not close to each other (natual language) Huffman: works expecially well when frequencies are powers of two

That s all folks!

(Very) Brief history of data compression 1838: Morse code 1940s: Information theory (Shannon, Fano, Huffman) 1970s: LZW (Lempel, Ziv and Welch), Microsoft, Apple 1980s: ARJ, PKZIP, LHarc (BBS and newsgroups) 1990s: JPG, MP3 ( The web and browsers), 1994: Yahoo 1998: Google 2001: dot-com bubble 2004: Facebook