Compression Outline 15-853:Algorithms in the Rel World Dt Compression III Introduction: Lossy vs. Lossless, Benchmrks, Informtion Theory: Entropy, etc. Proility Coding: Huffmn + Arithmetic Coding Applictions of Proility Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, LZ78, compress (Not covered in clss) Other Lossless Algorithms: Burrows-Wheeler Lossy lgorithms for imges: JPEG, MPEG,... Compressing grphs nd meshes: BBK 15-853 Pge 1 15-853 Pge 2 Lempel-Ziv Algorithms LZ77 (Sliding Window) Vrints: LZSS (Lempel-Ziv-Storer-Szymnski) Applictions: gzip, Squeeze, LHA, PKZIP, ZOO LZ78 (Dictionry Bsed) Vrints: LZW (Lempel-Ziv-Welch), LZC Applictions: compress, GIF, CCITT (modems), ARC, PAK Trditionlly LZ77 ws etter ut slower, ut the gzip version is lmost s fst s ny LZ78. 15-853 Pge 3 LZ77: Sliding Window Lempel-Ziv c c c c Dictionry (previously coded) Cursor Lookhed Buffer Dictionry nd uffer windows re fixed length nd slide with the cursor Repet: Output (p, l, c) where p = position of the longest mtch tht strts in the dictionry (reltive to the cursor) l = length of longest mtch c = next chr in uffer eyond longest mtch Advnce window y l + 1 15-853 Pge 4 1
LZ77: Exmple c c c c (_,0,) c c c c (1,1,c) c c c c (3,4,) c c c c (3,3,) c c c c (1,2,c) Dictionry (size = 6) Longest mtch Buffer (size = 4) Next chrcter LZ77 Decoding Decoder keeps sme dictionry window s encoder. For ech messge it looks it up in the dictionry nd inserts copy t the end of the string Wht if l > p? (only prt of the messge is in the dictionry.) E.g. dict = cd, codeword = (2,9,e) Simply copy from left to right for (i = 0; i < length; i++) out[cursor+i] = out[cursor-offset+i] Out = cdcdcdcdcdce 15-853 Pge 5 15-853 Pge 6 LZ77 Optimiztions used y gzip LZSS: Output one of the following two formts (0, position, length) or (1,chr) Uses the second formt if length < 3. c c c c c c c c c c c c (1,) (1,) (1,c) c c c c (0,3,4) Optimiztions used y gzip (cont.) 1. Huffmn code the positions, lengths nd chrs 2. Non greedy: possily use shorter mtch so tht next mtch is etter 3. Use hsh tle to store the dictionry. Hsh keys re ll strings of length 3 in the dictionry window. Find the longest mtch within the correct hsh ucket. Puts limit on the length of the serch within ucket. Within ech ucket store in order of position 15-853 Pge 7 15-853 Pge 8 2
The Hsh Tle Theory ehind LZ77 7 8 9 101112131415161718192021 c c c c Sliding Window LZ is Asymptoticlly Optiml [Wyner-Ziv,94] Will compress long enough strings to the source entropy s the window size goes to infinity. 1 H n = p( X )log n p( X ) X A c 19 c 15 c 11 c 10 c 12 c 9 c 7 c 8 H = lim H n n Uses logrithmic code (e.g. gmm) for the position. Prolem: long enough is relly relly long. 15-853 Pge 9 15-853 Pge 10 Comprison to Lempel-Ziv 78 Both LZ77 nd LZ78 nd their vrints keep dictionry of recent strings tht hve een seen. The differences re: How the dictionry is stored (LZ78 is trie) How it is extended (LZ78 only extends n existing entry y one chrcter) How it is indexed (LZ78 indexes the nodes of the trie) How elements re removed Lempel-Ziv Algorithms Summry Adpts well to chnges in the file (e.g. Tr file with mny file types within it). Initil lgorithms did not use proility coding nd performed poorly in terms of compression. More modern versions (e.g. gzip) do use proility coding s second pss nd compress much etter. The lgorithms re ecoming outdted, ut ides re used in mny of the newer lgorithms. 15-853 Pge 11 15-853 Pge 12 3
Compression Outline Introduction: Lossy vs. Lossless, Benchmrks, Informtion Theory: Entropy, etc. Proility Coding: Huffmn + Arithmetic Coding Applictions of Proility Coding: PPM + others Lempel-Ziv Algorithms: LZ77, gzip, compress, Other Lossless Algorithms: Burrows-Wheeler ACB Lossy lgorithms for imges: JPEG, MPEG,... Compressing grphs nd meshes: BBK Burrows -Wheeler Currently ner est lnced lgorithm for text Breks file into fixed-size locks nd encodes ech lock seprtely. For ech lock: Sort ech chrcter y its full context. This is clled the lock sorting trnsform. Use move-to-front trnsform to encode the sorted chrcters. The ingenious oservtion is tht the decoder only needs the sorted chrcters nd pointer to the first chrcter of the originl sequence. 15-853 Pge 13 15-853 Pge 14 Burrows Wheeler: Exmple Let s encode: d 1 e 2 c 3 o 4 d 5 e 6 We ve numered the chrcters to distinguish them. Context wrps round. Lst chr is most significnt. Context Chr ecode 6 d 1 coded 1 e 2 Sort odede 2 c 3 Context dedec 3 o 4 edeco 4 d 5 decod 5 e 6 Context Output dedec 3 o 4 coded 1 e 2 decod 5 e 6 odede 2 c 3 ecode 6 d 1 edeco 4 d 5 15-853 Pge 15 Burrows-Wheeler (Continued) Theorem: After sorting, equl vlued chrcters pper in the sme order in the output s in the most significnt position of the context. Proof sketch: Since the chrs hve equl vlue in the most-significntposition of the context, they will e ordered y the rest of the context, i.e. the previous chrs. This is lso the order of the output since it is sorted y the previous chrcters. Context Output dedec 3 o 4 coded 1 e 2 decod 5 e 6 odede 2 c 3 ecode 6 d 1 edeco 4 d 5 15-853 Pge 16 4
Burrows-Wheeler: Decoding Burrows-Wheeler: Decoding Consider dropping ll ut the lst chrcter of the context. Wht follows the underlined? Wht follows the underlined? Wht is the whole string? Answer:,, c Context c Output c Wht out now? Answer: c Cn lso use the rnk. The rnk is the position of chrcter if it were sorted using stle sort. Context c Output Rnk c 6 1 4 5 2 3 15-853 Pge 17 15-853 Pge 18 Burrows-Wheeler Decode Decode Exmple Function BW_Decode(In, Strt, n) S = MoveToFrontDecode(In,n) R = Rnk(S) j = Strt for i=1 to n do Out[i] = S[j] j = R[j] Rnk gives position of ech chr in sorted order. 6 S Rnk(S) o 4 e 2 4 e 6 5 c 3 1 d 1 2 ( d 5 3 Out e 6 d 1 d 1 e 2 e 2 c 3 c 3 o 4 o 4 d 5 d 5 e 6 15-853 Pge 19 15-853 Pge 20 5
Overview of Text Compression ACB (Associte Coder of Buynovsky) PPM nd Burrows-Wheeler oth encode single chrcter sed on the immeditely preceding context. LZ77 nd LZ78 encode multiple chrcters sed on mtches found in lock of preceding text Cn you mix these ides, i.e., code multiple chrcters sed on immeditely preceding context? BZ does this, ut they don t give detils on how it works current est compressor ACB lso does this close to est Keep dictionry sorted y context (the lst chrcter is the most significnt) Find longest mtch for context Find longest mtch for contents Code Distnce etween mtches in the sorted order Length of contents mtch Hs spects of Burrows-Wheeler, nd LZ77 Context Contents decode dec ode d ecode decod e de code deco de 15-853 Pge 21 15-853 Pge 22 6