Lossless compression II

Similar documents
Lossless compression II

5.5 Data Compression

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

5.5 Data Compression

Algorithms. Algorithms 5.5 DATA COMPRESSION. introduction run-length coding Huffman compression LZW compression ROBERT SEDGEWICK KEVIN WAYNE

Algorithms 5.5 DATA COMPRESSION. basics run-length coding Huffman compression LZW compression

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

EE-575 INFORMATION THEORY - SEM 092

Algorithms. Algorithms 5.5 DATA COMPRESSION. introduction run-length coding Huffman compression LZW compression ROBERT SEDGEWICK KEVIN WAYNE

Simple variant of coding with a variable number of symbols and fixlength codewords.

Data Compression. Guest lecture, SGDS Fall 2011

CS/COE 1501

CS/COE 1501

Data Compression 신찬수

Multimedia Systems. Part 20. Mahdi Vasighi

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Data compression with Huffman and LZW

Ch. 2: Compression Basics Multimedia Systems

Data Compression Techniques

LZW Compression. Ramana Kumar Kundella. Indiana State University December 13, 2014

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

Comparative Study of Dictionary based Compression Algorithms on Text Data

CIS 121 Data Structures and Algorithms with Java Spring 2018

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

David Rappaport School of Computing Queen s University CANADA. Copyright, 1996 Dale Carnegie & Associates, Inc.

Distributed source coding

EE67I Multimedia Communication Systems Lecture 4

Compressing Data. Konstantin Tretyakov

Dictionary techniques

A Comparative Study Of Text Compression Algorithms

Algorithms. Algorithms 5.5 DATA COMPRESSION. introduction run-length coding Huffman compression LZW compression ROBERT SEDGEWICK KEVIN WAYNE

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Algorithms. Algorithms 5.5 DATA COMPRESSION. introduction run-length coding Huffman compression LZW compression ROBERT SEDGEWICK KEVIN WAYNE

7: Image Compression

Image coding and compression

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Chapter 7 Lossless Compression Algorithms

WIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

IMAGE PROCESSING (RRY025) LECTURE 13 IMAGE COMPRESSION - I

Lempel-Ziv-Welch (LZW) Compression Algorithm

Study of LZ77 and LZ78 Data Compression Techniques

Ch. 2: Compression Basics Multimedia Systems

Error Resilient LZ 77 Data Compression

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

TEXT COMPRESSION ALGORITHMS - A COMPARATIVE STUDY

Lossless Compression Algorithms

FPGA based Data Compression using Dictionary based LZW Algorithm

arxiv: v2 [cs.it] 15 Jan 2011

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

Compression of Concatenated Web Pages Using XBW

Engineering Mathematics II Lecture 16 Compression

A study in compression algorithms

Introduction to Data Compression

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

Grayscale true two-dimensional dictionary-based image compression

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

DEFLATE COMPRESSION ALGORITHM

11 Dictionary-based compressors

Minification techniques

5.5 Data Compression. basics run-length coding Huffman compression LZW compression. Data compression

Repetition 1st lecture

Lossless compression B 1 U 1 B 2 C R D! CSCI 470: Web Science Keith Vertanen

11 Dictionary-based compressors

GUJARAT TECHNOLOGICAL UNIVERSITY

G64PMM - Lecture 3.2. Analogue vs Digital. Analogue Media. Graphics & Still Image Representation

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

Modeling Delta Encoding of Compressed Files

An Asymmetric, Semi-adaptive Text Compression Algorithm

ENSC Multimedia Communications Engineering Topic 4: Huffman Coding 2

COMPRESSION OF SMALL TEXT FILES

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Lecture Coding Theory. Source Coding. Image and Video Compression. Images: Wikipedia

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

Optimized Compression and Decompression Software

Compression; Error detection & correction

You can say that again! Text compression

7.5 Dictionary-based Coding

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods

The Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression

Optimal Parsing. In Dictionary-Symbolwise. Compression Algorithms

University of Waterloo CS240 Spring 2018 Help Session Problems

Modeling Delta Encoding of Compressed Files

A Hybrid Approach to Text Compression

Data Compression Techniques

ASCII American Standard Code for Information Interchange. Text file is a sequence of binary digits which represent the codes for each character.

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

Multimedia Networking ECE 599

A SIMPLE LOSSLESS COMPRESSION ALGORITHM IN WIRELESS SENSOR NETWORKS: AN APPLICATION OF SEISMIC DATA

Greedy Algorithms CHAPTER 16

Digital Image Processing

CSE 421 Greedy: Huffman Codes

Lempel-Ziv compression: how and why?

Image compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

CHAPTER II LITERATURE REVIEW

Transcription:

Lossless II D 44 R 52 B 81 C 84 D 86 R 82 A 85 A 87 A 83 R 88 A 8A B 89 A 8B Symbol Probability Range a 0.2 [0.0, 0.2) e 0.3 [0.2, 0.5) i 0.1 [0.5, 0.6) o 0.2 [0.6, 0.8) u 0.1 [0.8, 0.9)! 0.1 [0.9, 1.0) CSCI 470: Web Science Keith Vertanen

Overview Flavors of StaBc vs. Dynamic vs. AdapBve Lempel- Ziv- Welch (LZW) Fixed- length Variable- length paqern StaBsBcal coding ArithmeBc coding PredicBon by ParBal Match (PPM) 2

StaBc Compression models Predefined map for all text, e.g. ASCII, Morse Code Not opbmal: different texts = different stabsbcs Dynamic Generate model based on text Requires inibal pass before can start Must transmit the model (e.g. Huffman coding) AdapBve More accurate modeling = beqer Decoding must start from beginning (e.g. LZW) 3

LZW Lempel- Ziv- Welch (LZW) Basis for many popular formats e.g. PNG, 7zip, gzip, jar, PDF, compress, pkzip, GIF, TIFF Algorithm basics: Maintain a table of fixed- length s for variable- length paqerns in the input Built progressively by compressor and expander Table does not need to be transmiqed! Table is of fixed size Entries for all single characters in alphabet Entries for longer subs encountered 4

Details: LZW example 7- bit ASCII characters, 8- bit For real use, 8- bit ~12- bit Alphabet: 128 characters + 128 longer s Codeword: 2- digit hex value 00-79 = single characters, e.g. 41 = A, 52 = R 80 = end of file 81- FF = longer s, e.g. 81 = AB, 88 = ABR Employs lookahead character to add s 5

input A B R A C A D A B R A B R A B R A matches Compressing D 44 Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c R 52 Ini6al set of character s 6

input A B R A C A D A B R A B R A B R A matches A 41 D 44 R 52 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c Ini6al set of character s 7

input A B R A C A D A B R A B R A B R A matches A B 41 42 D 44 R 52 BR 82 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c Ini6al set of character s 8

input A B R A C A D A B R A B R A B R A matches A B R 41 42 52 D 44 R 52 BR 82 RA 83 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c Ini6al set of character s 9

input A B R A C A D A B R A B R A B R A matches A B R A 41 42 52 41 D 44 R 52 BR 82 RA 83 AC 84 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c Ini6al set of character s 10

input A B R A C A D A B R A B R A B R A matches A B R A C A D 41 42 52 41 43 41 44 D 44 R 52 BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c Ini6al set of character s 11

input A B R A C A D A B R A B R A B R A matches A B R A C A D AB 41 42 52 41 43 41 44 81 D 44 R 52 BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c ABR 88 Ini6al set of character s 12

input A B R A C A D A B R A B R A B R A matches A B R A C A D AB RA 41 42 52 41 43 41 44 81 83 D 44 R 52 BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c ABR 88 Ini6al set of character s RAB 89 13

input A B R A C A D A B R A B R A B R A matches A B R A C A D AB RA BR 41 42 52 41 43 41 44 81 83 82 D 44 R 52 BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c ABR 88 Ini6al set of character s RAB 89 BRA 8A 14

input A B R A C A D A B R A B R A B R A matches A B R A C A D AB RA BR ABR 41 42 52 41 43 41 44 81 83 82 88 D 44 R 52 BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c ABR 88 Ini6al set of character s RAB 89 BRA 8A ABRA 8B 15

input A B R A C A D A B R A B R A B R A matches A B R A C A D AB RA BR ABR A 41 42 52 41 43 41 44 81 83 82 88 41 D 44 BR 82 RA 83 AC 84 CA 85 AD 86 Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c R 52 Ini6al set of character s DA 87 ABR 88 RAB 89 BRA 8A ABRA 8B Input: 17 ASCII characters * 7 bits = 119 bits Output: 12 s * 8 bits = 96 bits Compression ratio: 82% 16

Compression data structure LZW uses two table operabons: Find longest- prefix match from current posibon Add entry with character added to longest match Trie data structure, "retrieval" Linked tree structure Path in tree defines Compressing Find longest s in table that is prefix of unscanned input Write of s Scan one character c ahead Associate next free with s + c 17

D 44 R 52 B 81 C 84 D 86 R 82 A 85 A 87 A 83 R 88 A 8A B 89 A 8B AD 86 BR 82 DA 87 RA 83 ABR 88 AC 84 RAB 89 CA 85 BRA 8A 18

Java implementabon, LZW private static final int R = 256; private static final int L = 4096; private static final int W = 12; // number of input chars // number of s = 2^W // width public static void compress() { String input = BinaryStdIn.readString(); // read input as TST<Integer> st = new TST<Integer>(); // trie data structure for (int i = 0; i < R; i++) st.put("" + (char) i, i); int code = R+1; // s for single chars // R is for EOF } while (input.length() > 0) { String s = st.longestprefixof(input); BinaryStdOut.write(st.get(s), W); // write W- bit for s int t = s.length(); if (t < input.length() && code < L) st.put(input.sub(0, t + 1), code++); // add new input = input.sub(t); } BinaryStdOut.write(R, W); // write last BinaryStdOut.close(); 19

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output Expansion Write the current val Read x from input D 44 Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s R 52 Ini6al set of character s val old = A x = 42 s = B c = B next code = 81 val new = B 20

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A D 44 R 52 Ini6al set of character s Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = B x = 52 s = R c = R next code = 82 val new = R 21

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B D 44 R 52 Ini6al set of character s BR 82 Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = R x = 41 s = A c = A next code = 83 val new = A 22

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R D 44 R 52 Ini6al set of character s BR 82 RA 83 Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = A x = 43 s = C c = C next code = 84 val new = C 23

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A D 44 R 52 Ini6al set of character s BR 82 RA 83 AC 84 Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = C x = 41 s = A c = A next code = 85 val new = A 24

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C D 44 R 52 Ini6al set of character s BR 82 RA 83 AC 84 CA 85 Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = A x = 44 s = D c = D next code = 86 val new = D 25

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C A D 44 R 52 Ini6al set of character s BR 82 RA 83 AC 84 CA 85 AD 86 Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = D x = 81 s = AB c = A next code = 87 val new = AB 26

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C A D D 44 R 52 Ini6al set of character s BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = AB x = 83 s = RA c = R next code = 88 val new = RA 27

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C A D AB D 44 R 52 Ini6al set of character s BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 ABR 88 Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = RA x = 82 s = BR c = B next code = 89 val new = BR 28

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C A D AB RA D 44 R 52 Ini6al set of character s BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 ABR 88 RAB 89 Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = BR x = 88 s = ABR c = A next code = 8A = RA val new 29

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C A D AB RA BR D 44 R 52 Ini6al set of character s BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 ABR 88 RAB 89 BRA 8A Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = ABR x = 41 s = A c = A next code = 8B = RA val new 30

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C A D AB RA BR ABR D 44 R 52 Ini6al set of character s BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 ABR 88 RAB 89 BRA ABRA 8A 8B Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = A x = 80 s = c = next code = val new = 31

input 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C A D AB RA BR ABR A D 44 R 52 Ini6al set of character s BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 ABR 88 RAB 89 BRA ABRA 8A 8B Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s val old = x = s = c = next code = val new = 32

Expansion data structure For a W- bit, look up value An array of size 2 W D 44 R 52 BR 82 RA 83 AC 84 CA 85 AD 86 DA 87 Expansion Write the current val Read x from input Set s to for x Set next unassigned to val+c where c is 1 st char in s Set val=s ABR 88 RAB 89 BRA ABRA 8A 8B 33

A tricky case: ABABABA input A B A B A B A matches Ini6al set of s 34

A tricky case: ABABABA input A B A B A B A matches A 41 Ini6al set of s 35

A tricky case: ABABABA input A B A B A B A matches A B 41 42 BA 82 Ini6al set of s 36

A tricky case: ABABABA input A B A B A B A matches A B AB 41 42 81 BA 82 ABA 83 Ini6al set of s 37

A tricky case: ABABABA input A B A B A B A matches A B AB ABA 41 42 81 83 BA 82 ABA 83 NoBce right aler adding entry for "AB" the subsequent text had "AB" as a prefix. Ini6al set of s 38

A tricky case: expansion ABABABA input 41 42 81 83 80 output Ini6al set of s 39

A tricky case: expansion ABABABA input 41 42 81 83 80 output A Ini6al set of s 40

A tricky case: expansion ABABABA input 41 42 81 83 80 output A B BA 82 Ini6al set of s 41

A tricky case: expansion ABABABA input 41 42 81 83 80 output A B AB BA 82 We need to know the first character of 83 in order to enter 83 into the table! Fix: First character of 83 must be same as first leqer of current 81, i.e. "A". Ini6al set of s 42

A tricky case: expansion ABABABA input 41 42 81 83 80 output A B AB ABA BA 82 ABA 83 Ini6al set of s 43

Java implementabon, LZW expansion private static final int R = 256; private static final int L = 4096; private static final int W = 12; // number of input chars // number of s = 2^W // width public static void expand() { String[] st = new String[L]; int i; // next available value // initialize symbol table with all 1- character s for (i = 0; i < R; i++) st[i] = "" + (char) i; st[i++] = ""; // (unused) lookahead for EOF int = BinaryStdIn.readInt(W); String val = st[]; } while (true) { BinaryStdOut.write(val); = BinaryStdIn.readInt(W); if ( == R) break; String s = st[]; if (i == ) s = val + val.charat(0); if (i < L) st[i++] = val + s.charat(0); val = s; } BinaryStdOut.close(); // special case hack 44

LZW decisions How big of a symbol table? How long is message? Whole message similar model? Many variabons What to do when symbol table fills up? Throw away and start over (e.g. GIF) Throw away when not effecbve (e.g. Unix compress) Many variabons Put longer subs in symbol table? Many variabons 45

Lempel- Ziv variants LZW variants LZ77, published Lempel & Ziv, 1977, not patented PNG LZ78, published Lempel & Ziv in 1978, patented LZW, Welch extension to LZ78, patented (expired 2003) GIF, TIFF, Pkzip Deflate = LZ77 variant + Huffman coding 7zip, gzip, jar, PDF 46

Some experiments Compressing Moby Dick Using Java programs for LZW and Huffman Different dicbonary sizes for LZW or reseqng when full Compared to zip and gzip 1,191,463 mobydick.txt 485,561 mobydick.gz 485,790 mobydick.zip 398,480 mobydick.xz 371,394 mobydick.bz2 667,651 mobydick.huff 597,060 mobydick.lzw 814,578 mobydick.huff.lzw 592,830 mobydick.lzw.huff 682,400 mobydick.lzw.reset 541,261 mobydick.lzw14 521,700 mobydick.lzw15 503,050 mobydick.lzw16 501,193 mobydick.lzw16.huff 521,700 mobydick.lzw17 514,393 mobydick.lzw18 571,548 mobydick.lzw20 47

Enter stabsbcal coding Natural language quite predictable ~ 1 bit of entropy per symbol Huffman coding sbll requires 1- bit min per symbol We're forced to use an integral number of bits DicBonary- based (LZW and friends) Memorizes sequences, but no nobon of frequency StaBsBcal coding Use long, specific context for predicbon P(A The_United_State_of_) =? Blend knowledge using contexts of different lengths AdapBve: model updates and change as text seen Olen aler every leqer! 48

MacKay, D.J.C. Informa6on Theory, Inference, and Learning Algorithms. 49

Guess the phrase 50

A simple unigram model of English ' 0.0071063-0.0000001. 0.0000384 </s> 0.0368291 <sp> 0.1653810 a 0.0595248 b 0.0112864 c 0.0174441 d 0.0282733 e 0.0890307 f 0.0127512 g 0.0213974 h 0.0403836 i 0.0586443 j 0.0018080 k 0.0117826 l 0.0343399 m 0.0247835 n 0.0490316 o 0.0762119 p 0.0134453 q 0.0003078 r 0.0408972 s 0.0433802 t 0.0680194 u 0.0273347 v 0.0083669 w 0.0210079 x 0.0010829 y 0.0295698 z 0.0005395 51

From text to a real number ArithmeBc coding Message represented by real interval in [0, 1) More precise interval = specifies more bits Example e.g. [0.58272722, 0.58272724) = "it was the best of Bmes" Or any number in that interval, 0.58272723 Alphabet = {aeiou!} Transmission = eaii! Symbol Probability Range a 0.2 [0.0, 0.2) e 0.3 [0.2, 0.5) i 0.1 [0.5, 0.6) o 0.2 [0.6, 0.8) u 0.1 [0.8, 0.9)! 0.1 [0.9, 1.0) 52

Transmission = eaii! Bell, Cleary, WiKen. Text Compression. AEer seeing Range Symbol Probability Range [0.0, 1.0] a 0.2 [0.0, 0.2) e [0.2, 0.5) e 0.3 [0.2, 0.5) a [0.2, 0.26) i 0.1 [0.5, 0.6) i [0.23, 0.236) o 0.2 [0.6, 0.8) i [0.233, 0.2336) u 0.1 [0.8, 0.9)! [0.23354, 0.2336)! 0.1 [0.9, 1.0) Note: Encoder/decoder agree on symbol to terminate message, here we use! 53

Context modeling PredicBon by ParBal Match (PPM) P(next symbol) depends on previous symbol(s) Blending for dealing with 0- frequency problem Easy to build as adapbve Learn from the text as you go No trie to send as in Huffman coding 1984, but sbll compebbve for text Various variants, PPM- A/B/C/D/Z/* Implemented in 7- zip, open source packages, PPM for XML, PPM for executables, 54

PPM An alphabet A with q symbols Order = How many previous symbols to use 2 = condibon on 2 previous symbols, P(x n x n- 1, x n- 2 ) 1 = condibon on 1 previous symbol, P(x n x n- 1 ) 0 = condibon on current symbol, P(x n ) - 1 = uniform over the alphabet, P(x n ) = 1/q For a given order: Probability of next symbol based on counbng occurrences given prior context 55

Escape probabilibes Need weights to blend probabilibes Compute based on "escape" probability Allocate some mass in each order for when a lower- order model should make predicbon instead Method A: Add one to C 0 : count of characters seen in this context e o = 1 / (C 0 + 1) Method D: u = number of unique characters seen in this context e o = (u / 2) / C 0 56

Bell, Cleary, WiKen. Text Compression. 57

Lossless benchmarks year scheme bits /char 1967 ASCII 7.00 1950 Huffman 4.70 1977 LZ77 3.94 1984 LZMW 3.32 1987 LZH 3.30 1987 move- to- front 3.24 1987 LZB 3.18 1987 gzip 2.71 1988 PPMC 2.48 1994 PPM 2.34 1995 Burrows- Wheeler 2.29 1997 BOA 1.99 1999 RK 1.89 Data using Calgary corpus 58

Compressing Moby Dick 1,191,463 mobydick.txt 485,561 mobydick.gz 485,790 mobydick.zip 398,480 mobydick.xz 371,394 mobydick.bz2 667,651 mobydick.huff 597,060 mobydick.lzw 814,578 mobydick.huff.lzw 592,830 mobydick.lzw.huff 682,400 mobydick.lzw.reset 541,261 mobydick.lzw14 521,700 mobydick.lzw15 503,050 mobydick.lzw16 501,193 mobydick.lzw16.huff 521,700 mobydick.lzw17 514,393 mobydick.lzw18 571,548 mobydick.lzw20 325,378 mobydick.ppmd 306,708 mobydick.zpaq 59

Summary DicBonary- based LZW and variants Memorizes sequences in data StaBsBcal coding Language model produces probabilibes Probability sequence defines point on [0, 1) Use arithmebc coding to convert to bits PredicBon by ParBal Match (PPM) 60