Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

Similar documents
COMP 423 lecture 11 Jan. 28, 2008

The dictionary model allows several consecutive symbols, called phrases

Algorithm Design (5) Text Search

CS201 Discussion 10 DRAWTREE + TRIES

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Ma/CS 6b Class 1: Graph Recap

What are suffix trees?

Fig.25: the Role of LEX

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

Reducing a DFA to a Minimal DFA

Presentation Martin Randers

Suffix trees, suffix arrays, BWT

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

MTH 146 Conics Supplement

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

10.5 Graphing Quadratic Functions

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

Agilent Mass Hunter Software

COMBINATORIAL PATTERN MATCHING

Ma/CS 6b Class 1: Graph Recap

Simple variant of coding with a variable number of symbols and fixlength codewords.

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley

ZZ - Advanced Math Review 2017

On String Matching in Chunked Texts

CS481: Bioinformatics Algorithms

Information Retrieval and Organisation

Quiz2 45mins. Personal Number: Problem 1. (20pts) Here is an Table of Perl Regular Ex

Lexical Analysis: Constructing a Scanner from Regular Expressions

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

4-1 NAME DATE PERIOD. Study Guide. Parallel Lines and Planes P Q, O Q. Sample answers: A J, A F, and D E

Definition of Regular Expression

Orthogonal line segment intersection

ITEC2620 Introduction to Data Structures

Greedy Algorithm. Algorithm Fall Semester

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

6.3 Volumes. Just as area is always positive, so is volume and our attitudes towards finding it.

Section 3.1: Sequences and Series

NOTES. Figure 1 illustrates typical hardware component connections required when using the JCM ICB Asset Ticket Generator software application.

Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming

Lecture T4: Pattern Matching

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

Lecture T1: Pattern Matching

Start Here. Remove all tape and lift display. Locate components

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

PPS: User Manual. Krishnendu Chatterjee, Martin Chmelik, Raghav Gupta, and Ayush Kanodia

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

Study of LZ77 and LZ78 Data Compression Techniques

Compilers Spring 2013 PRACTICE Midterm Exam

Lexical Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

CS/COE 1501

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 4: Lexical Analyzers 28 Jan 08

Introduction to Integration

CS 221: Artificial Intelligence Fall 2011

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits

1.5 Extrema and the Mean Value Theorem

Hyperbolas. Definition of Hyperbola

UNIT 11. Query Optimization

LR Parsing, Part 2. Constructing Parse Tables. Need to Automatically Construct LR Parse Tables: Action and GOTO Table

a < a+ x < a+2 x < < a+n x = b, n A i n f(x i ) x. i=1 i=1

box Boxes and Arrows 3 true 7.59 'X' An object is drawn as a box that contains its data members, for example:

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

The Greedy Method. The Greedy Method

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007

this grammar generates the following language: Because this symbol will also be used in a later step, it receives the

cisc1110 fall 2010 lecture VI.2 call by value function parameters another call by value example:

OUTPUT DELIVERY SYSTEM

Topic 2: Lexing and Flexing

Graphs with at most two trees in a forest building process

Suffix Tries. Slides adapted from the course by Ben Langmead

Regular Expression Matching with Multi-Strings and Intervals. Philip Bille Mikkel Thorup

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

8.2 Areas in the Plane

From Dependencies to Evaluation Strategies

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

Lossless compression II

Slides for Data Mining by I. H. Witten and E. Frank

CS/COE 1501

Uninformed Search. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 31 Jan 2012

How to Design REST API? Written Date : March 23, 2015

I/O Efficient Dynamic Data Structures for Longest Prefix Queries

What do all those bits mean now? Number Systems and Arithmetic. Introduction to Binary Numbers. Questions About Numbers

Section 10.4 Hyperbolas

PARALLEL AND DISTRIBUTED COMPUTING

CS 432 Fall Mike Lam, Professor a (bc)* Regular Expressions and Finite Automata

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

CSCI1950 Z Computa4onal Methods for Biology Lecture 2. Ben Raphael January 26, hhp://cs.brown.edu/courses/csci1950 z/ Outline

called the vertex. The line through the focus perpendicular to the directrix is called the axis of the parabola.

Lecture 7: Integration Techniques

CMPUT101 Introduction to Computing - Summer 2002

Intermediate Information Structures

Today. Search Problems. Uninformed Search Methods. Depth-First Search Breadth-First Search Uniform-Cost Search

COMPUTER SCIENCE 123. Foundations of Computer Science. 6. Tuples

Transcription:

Compression Outline 15-853:Algorithms in the Rel World Dt Compression III Introduction: Lossy vs. Lossless, Benchmrks, Informtion Theory: Entropy, etc. Proility Coding: Huffmn + Arithmetic Coding Applictions of Proility Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, LZ78, compress (Not covered in clss) Other Lossless Algorithms: Burrows-Wheeler Lossy lgorithms for imges: JPEG, MPEG,... Compressing grphs nd meshes: BBK 15-853 Pge 1 15-853 Pge 2 Lempel-Ziv Algorithms LZ77 (Sliding Window) Vrints: LZSS (Lempel-Ziv-Storer-Szymnski) Applictions: gzip, Squeeze, LHA, PKZIP, ZOO LZ78 (Dictionry Bsed) Vrints: LZW (Lempel-Ziv-Welch), LZC Applictions: compress, GIF, CCITT (modems), ARC, PAK Trditionlly LZ77 ws etter ut slower, ut the gzip version is lmost s fst s ny LZ78. 15-853 Pge 3 LZ77: Sliding Window Lempel-Ziv c c c c Dictionry (previously coded) Cursor Lookhed Buffer Dictionry nd uffer windows re fixed length nd slide with the cursor Repet: Output (p, l, c) where p = position of the longest mtch tht strts in the dictionry (reltive to the cursor) l = length of longest mtch c = next chr in uffer eyond longest mtch Advnce window y l + 1 15-853 Pge 4 1

LZ77: Exmple c c c c (_,0,) c c c c (1,1,c) c c c c (3,4,) c c c c (3,3,) c c c c (1,2,c) Dictionry (size = 6) Longest mtch Buffer (size = 4) Next chrcter LZ77 Decoding Decoder keeps sme dictionry window s encoder. For ech messge it looks it up in the dictionry nd inserts copy t the end of the string Wht if l > p? (only prt of the messge is in the dictionry.) E.g. dict = cd, codeword = (2,9,e) Simply copy from left to right for (i = 0; i < length; i++) out[cursor+i] = out[cursor-offset+i] Out = cdcdcdcdcdce 15-853 Pge 5 15-853 Pge 6 LZ77 Optimiztions used y gzip LZSS: Output one of the following two formts (0, position, length) or (1,chr) Uses the second formt if length < 3. c c c c c c c c c c c c (1,) (1,) (1,c) c c c c (0,3,4) Optimiztions used y gzip (cont.) 1. Huffmn code the positions, lengths nd chrs 2. Non greedy: possily use shorter mtch so tht next mtch is etter 3. Use hsh tle to store the dictionry. Hsh keys re ll strings of length 3 in the dictionry window. Find the longest mtch within the correct hsh ucket. Puts limit on the length of the serch within ucket. Within ech ucket store in order of position 15-853 Pge 7 15-853 Pge 8 2

The Hsh Tle Theory ehind LZ77 7 8 9 101112131415161718192021 c c c c Sliding Window LZ is Asymptoticlly Optiml [Wyner-Ziv,94] Will compress long enough strings to the source entropy s the window size goes to infinity. 1 H n = p( X )log n p( X ) X A c 19 c 15 c 11 c 10 c 12 c 9 c 7 c 8 H = lim H n n Uses logrithmic code (e.g. gmm) for the position. Prolem: long enough is relly relly long. 15-853 Pge 9 15-853 Pge 10 Comprison to Lempel-Ziv 78 Both LZ77 nd LZ78 nd their vrints keep dictionry of recent strings tht hve een seen. The differences re: How the dictionry is stored (LZ78 is trie) How it is extended (LZ78 only extends n existing entry y one chrcter) How it is indexed (LZ78 indexes the nodes of the trie) How elements re removed Lempel-Ziv Algorithms Summry Adpts well to chnges in the file (e.g. Tr file with mny file types within it). Initil lgorithms did not use proility coding nd performed poorly in terms of compression. More modern versions (e.g. gzip) do use proility coding s second pss nd compress much etter. The lgorithms re ecoming outdted, ut ides re used in mny of the newer lgorithms. 15-853 Pge 11 15-853 Pge 12 3

Compression Outline Introduction: Lossy vs. Lossless, Benchmrks, Informtion Theory: Entropy, etc. Proility Coding: Huffmn + Arithmetic Coding Applictions of Proility Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, compress, Other Lossless Algorithms: Burrows-Wheeler ACB Lossy lgorithms for imges: JPEG, MPEG,... Compressing grphs nd meshes: BBK Burrows -Wheeler Currently ner est lnced lgorithm for text Breks file into fixed-size locks nd encodes ech lock seprtely. For ech lock: Sort ech chrcter y its full context. This is clled the lock sorting trnsform. Use move-to-front trnsform to encode the sorted chrcters. The ingenious oservtion is tht the decoder only needs the sorted chrcters nd pointer to the first chrcter of the originl sequence. 15-853 Pge 13 15-853 Pge 14 Burrows Wheeler: Exmple Let s encode: d 1 e 2 c 3 o 4 d 5 e 6 We ve numered the chrcters to distinguish them. Context wrps round. Lst chr is most significnt. Context Chr ecode 6 d 1 coded 1 e 2 Sort odede 2 c 3 Context dedec 3 o 4 edeco 4 d 5 decod 5 e 6 Context Output dedec 3 o 4 coded 1 e 2 decod 5 e 6 odede 2 c 3 ecode 6 d 1 edeco 4 d 5 15-853 Pge 15 Burrows-Wheeler (Continued) Theorem: After sorting, equl vlued chrcters pper in the sme order in the output s in the most significnt position of the context. Proof sketch: Since the chrs hve equl vlue in the most-significntposition of the context, they will e ordered y the rest of the context, i.e. the previous chrs. This is lso the order of the output since it is sorted y the previous chrcters. Context Output dedec 3 o 4 coded 1 e 2 decod 5 e 6 odede 2 c 3 ecode 6 d 1 edeco 4 d 5 15-853 Pge 16 4

Burrows-Wheeler: Decoding Burrows-Wheeler: Decoding Consider dropping ll ut the lst chrcter of the context. Wht follows the underlined? Wht follows the underlined? Wht is the whole string? Answer:,, c Context c Output c Wht out now? Answer: c Cn lso use the rnk. The rnk is the position of chrcter if it were sorted using stle sort. Context c Output Rnk c 6 1 4 5 2 3 15-853 Pge 17 15-853 Pge 18 Burrows-Wheeler Decode Decode Exmple Function BW_Decode(In, Strt, n) S = MoveToFrontDecode(In,n) R = Rnk(S) j = Strt for i=1 to n do Out[i] = S[j] j = R[j] Rnk gives position of ech chr in sorted order. 6 S Rnk(S) o 4 e 2 4 e 6 5 c 3 1 d 1 2 ( d 5 3 Out e 6 d 1 d 1 e 2 e 2 c 3 c 3 o 4 o 4 d 5 d 5 e 6 15-853 Pge 19 15-853 Pge 20 5

Overview of Text Compression ACB (Associte Coder of Buynovsky) PPM nd Burrows-Wheeler oth encode single chrcter sed on the immeditely preceding context. LZ77 nd LZ78 encode multiple chrcters sed on mtches found in lock of preceding text Cn you mix these ides, i.e., code multiple chrcters sed on immeditely preceding context? BZ does this, ut they don t give detils on how it works current est compressor ACB lso does this close to est Keep dictionry sorted y context (the lst chrcter is the most significnt) Find longest mtch for context Find longest mtch for contents Code Distnce etween mtches in the sorted order Length of contents mtch Hs spects of Burrows-Wheeler, nd LZ77 Context Contents decode dec ode d ecode decod e de code deco de 15-853 Pge 21 15-853 Pge 22 6