CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Similar documents
SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Data Compression Techniques

Lossless compression II

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Comparative Study of Dictionary based Compression Algorithms on Text Data

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Dictionary techniques

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Data Compression. Guest lecture, SGDS Fall 2011

Simple variant of coding with a variable number of symbols and fixlength codewords.

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

An Asymmetric, Semi-adaptive Text Compression Algorithm

EE-575 INFORMATION THEORY - SEM 092

A Hybrid Approach to Text Compression

Lossless compression II

A Comparative Study Of Text Compression Algorithms

DEFLATE COMPRESSION ALGORITHM

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

COMPRESSION OF SMALL TEXT FILES

ITEC2620 Introduction to Data Structures

You can say that again! Text compression

CS/COE 1501

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

The total symbol code budget

11 Dictionary-based compressors

Practical Fixed Length Lempel Ziv Coding

Modeling Delta Encoding of Compressed Files

Error Resilient LZ 77 Data Compression

Ch. 2: Compression Basics Multimedia Systems

11 Dictionary-based compressors

Study of LZ77 and LZ78 Data Compression Techniques

Engineering Mathematics II Lecture 16 Compression

Data Compression Fundamentals

A study in compression algorithms

CS/COE 1501

Greedy Algorithms CHAPTER 16

TEXT COMPRESSION ALGORITHMS - A COMPARATIVE STUDY

Ch. 2: Compression Basics Multimedia Systems

FPGA based Data Compression using Dictionary based LZW Algorithm

Implementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor

So, what is data compression, and why do we need it?

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

Data Compression 신찬수

Horn Formulae. CS124 Course Notes 8 Spring 2018

V.2 Index Compression

Distributed source coding

Practical Fixed Length Lempel Ziv Coding

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

Lossless Compression Algorithms

Modeling Delta Encoding of Compressed Files

University of Waterloo CS240 Spring 2018 Help Session Problems

Lempel-Ziv-Welch (LZW) Compression Algorithm

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods

Compressing Data. Konstantin Tretyakov

Program Construction and Data Structures Course 1DL201 at Uppsala University Autumn 2010 / Spring 2011 Homework 6: Data Compression

Optimal Parsing. In Dictionary-Symbolwise. Compression Algorithms

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Basic Compression Library

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

ECE 499/599 Data Compression & Information Theory. Thinh Nguyen Oregon State University

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Text Compression. General remarks and Huffman coding Adobe pages Arithmetic coding Adobe pages 15 25

CS402 Theory of Automata Solved Subjective From Midterm Papers. MIDTERM SPRING 2012 CS402 Theory of Automata

Text Compression. Jayadev Misra The University of Texas at Austin July 1, A Very Incomplete Introduction To Information Theory 2

Dictionary-Based Fast Transform for Text Compression with High Compression Ratio

Compression of Concatenated Web Pages Using XBW

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Multimedia Systems. Part 20. Mahdi Vasighi

University of Waterloo CS240R Winter 2018 Help Session Problems

Improving LZW Image Compression

Analysis of Parallelization Effects on Textual Data Compression

University of Waterloo CS240R Fall 2017 Review Problems

The Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression

Managing Gigabytes Overheads

Chapter 7 Lossless Compression Algorithms

Visualizing Lempel-Ziv-Welch

arxiv: v2 [cs.it] 15 Jan 2011

Multiple-Pattern Matching In LZW Compressed Files Using Aho-Corasick Algorithm ABSTRACT 1 INTRODUCTION

Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we ha

Lecture 13 Notes Sets

EE67I Multimedia Communication Systems Lecture 4

Image compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

Data compression with Huffman and LZW

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression

University of Waterloo CS240R Fall 2017 Solutions to Review Problems

Enhancing Text Compression Method Using Information Source Indexing

Image Compression. cs2: Computational Thinking for Scientists.

A Secondary storage Algorithms and Data Structures Supplementary Questions and Exercises

Information Theory and Communication

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

Chapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)

Lempel-Ziv compression: how and why?

Lecture 13 Notes Sets

Fast Two-Stage Lempel-Ziv Lossless Numeric Telemetry Data Compression Using a Neural Network Predictor

Introduction to Compression. Norm Zeck

Suffix-based text indices, construction algorithms, and applications.

Multimedia Networking ECE 599

Transcription:

CS 493: Algorithms for Massive Data Sets February 14, 2002 Dictionary-based compression Scribe: Tony Wirth This lecture will explore two adaptive dictionary compression schemes: LZ77 and LZ78. We use the word adaptive, because the dictionary is not an a priori model of the text, but changes as the text is processed. A compromise approach is to scan the whole of the text in advance and then construct a semi-static dictionary. Often the cost of sending the dictionary as a preamble is not worth the improvement in start-up compression. The aim of both LZ77 and LZ78, published by Abraham Ziv and Jacob Lempel, is to encode a new phrase by providing a reference to an already-seen phrase that is very similar. These methods are easy to implement; decoding uses little memory and is particularly quick. LZ77 We can explain LZ77 most easily by looking the decoding process. The decoder sees a sequence of triples, consisting of a back-pointer, a length and a character. For example, consider the sequence < 0, 0, a >, < 0, 0, b >, < 2, 1, a >, < 3, 2, b >, < 5, 3, b >, < 1, 10, a >. The first two triples simply refer to the symbols a and b. The third triple says to look back two positions and copy a string of length 1 and then add a final a: that is the string aa. So far the decoded string is abaa. The fourth triple decodes to baab, overall abaabaab. The fifth is abaabb. Now we see something that at first glance is contradictory: a pointer one position back, but to a string of length ten. Consider that the decoder is writing the text in an array, so it moves the copy-from pointer one space to the left and moves it to the right in step with the copy-to pointer. Hence < 1, 10, a > is a kind of run-length encoding for bbbbbbbbbba and the final string is abaabaababaabbbbbbbbbbbba. So how are these triples actually encoded? Firstly, the pointer is capped to a maximum offset of 8Kb, so it can be represented with 13 bits: this is a reasonable bound for English text. The phrase length is often set to be no more than sixteen characters, hence a four bit representation. Needless to say, neither the pointer nor length distributions are uniform in practice. There is a bias towards smaller quantities which therefore demand shorter codewords. In some implementations the new symbol is not included in the triple all the time. Sample encoding and decoding algorithms are presented in Figure 1. In producing a fast encoding algorithm, we need to consider how to accelerate the search for the longest phrase match in the window. Various data structures such as a trie, a hash table or a binary search tree could be useful. For instance, we could use pairs of characters as the search index, with each index entry assigned a linked list of pointers to occurrences of the pair. We might want to limit the length of these linked lists, at some cost to the compression effectiveness. 1

To encode the text S[1... N] using the LZ77 method, with a sliding window of W characters, 1. Set p 1. 2. While more text remains: (a) Search for the longest match for S[p...] in S[p W... p 1]. Suppose that the match occurs at position m and is of length l. (b) Output the triple < p m, l, S[p + l] >. (c) Set p p + l + 1. To decode the text S[1... N] using the LZ77 method, with a sliding window of W characters, 1. Set p 1. 2. For each triple < f, l, c > in the input do (a) Set S[p... p + l 1] S[p f... p f + l 1]. (b) Set S[p + l] c. (c) Set p p + l + 1. Figure 1: Encoding and decoding using LZ77. 2

GZIP One of the best compression programs derived from the LZ77 scheme is GZIP. It uses a hash table with a three-symbol index, where the entries are linked to bounded lists of pointers. The user can select the point of compromise between throughput and compression. Often it is useful to limit the list length when there are long runs of some symbol in which case little is gained by having many entries. The lists can be reordered to favour recent matches. Instead of encoding triples, gzip sends either a pointer and length pair, or a new symbol. The method of doing this is a little unusual. One set of codewords is used for both match lengths and characters; another is used for the pointer offsets. The match length is sent first so the decoder can tell immediately whether it really is a match length, and thus a pointer will follow, or whether it is actually a new character. The user can instruct the GZIP encoder to scan ahead to see whether it might be better to encode a raw character instead of a pointer, so that compression of the following symbols is improved. Once again, this would come at the cost of GZIP s throughput. Finally, the GZIP Huffman codes are not generated dynamically (and therefore slowly); the raw text is read in blocks of 64 kilobytes and codes are generated for each block. Although this means GZIP is not really a one-pass encoder, the blocks are small enough to fit into memory or perhaps even cache. LZ78 The LZ77 scheme allows us to point back to any substring, but places a limit on the size of the prior window. In contrast, LZ78 restricts the types of substrings that may be referred to, but not their locations. The hope is that we can avoid some redundancy by keeping only one copy of prior phrases. Again, it is instructive to study the decoding process. Imagine that the text abaababaa has already been seen. The LZ78 scheme breaks the text into the phrases a, b, aa, ba, baa. The idea is to split the new text into the longest phrase that has been seen already, plus one new character at the end. New phrases are encoded by stating the old phrase number and the new end symbol. The text so far would have been sent as < 0, a >< 0, b >< 1, a >< 2, a >< 4, a >, where phrase 0 refers to the empty phrase. If the decoder then saw the input, < 4, b >< 2, b >< 7, b >< 8, b >, it would print bab, then bb, then bbb and finally bbbb. The encoder can parse preceding text quickly by storing phrases in a trie. However, the trie can grow quite large and may need to be destroyed every so often. To avoid poor start-up compression, it is handy to have available a trie based on the last few hundred characters. Although LZ78 encoding is often faster than that of LZ77, decoding is more involved as the LZ78 decoder also needs the parse trie. 3

Table 1: Throughput (relative to encoding speed of compress) and compression effectiveness of a selection of programs on the Canterbury corpus. METHOD Encoding Decoding Compression speed speed (bpc) cat 0.2 0.2 8.00 compress 1.0 0.6 3.31 GZIP-f 1.1 0.4 2.91 GZIP-b 7.0 0.3 2.53 bzip2 5.5 2.0 2.23 PPM 5.3 5.9 2.11 LZW LZW is the basis of the compress utility and is a variant of LZ78. Unlike LZ78, LZW does not send an extra character after the phrase number. Instead it has all the symbols already entered as phrases; phrases 0 to 127 are the ASCII symbols, for instance. Each new phrase is constructed from the one just coded plus the first character of the next phrase. If the encoder saw the stream, abaababbaabaabaa, it would break it into a,b,a,ab,ab,ba,aba,abaa. The corresponding new phrases would be, (ab,128), (ba,129), (aa,130), (aba,131), (abb,132), (baa,133), (abaa,134), while the phrase codes sent to the decoder would be 97, 98, 97, 128, 128, 129, 131, 134. You might ask, how does the decoder know what phrase 134 is, when it has not yet seen the last symbol of phrase 134? Well, it knows that phrase 134 is the same as 131, with another character appended. Therefore it knows that the new text will be aba?, that is, phrase 131 plus unknown. But that tells it that the last character of phrase 134 was an a, as a is the first letter of this new phrase. Therefore phrase 134 is abaa. Having seen both symbol-based and dictionary-based compression schemes, it is worth comparing the performance of the best implementations (Table 1). Note that bzip2 is an implementation of the Burrows-Wheeler transform that uses a sophisticated selection of Huffman codes, as against arithmetic coding, to keep throughput high without compromising significantly on compression. There is little point in using compress, given that GZIP-f is just as quick and is significantly more effective. The better-compressing version of GZIP suffers only in encoding speed, but actually decodes faster than GZIP-f, as it uses the same algorithm, but has slightly less information to decode. So it seems that if compression is the ultimate aim, then PPM is still best, but for speed, go for GZIP. 4

A problem to consider Let X be a random variable representing the number of tosses of a fair coin until a head appears. In this case the best set of codewords for X, that is those that minimize the expected codeword length, is the set of sequences of tosses themselves represented in binary. For next class, work out what the codewords ought to be if the coin is not fair, but has some known probability of showing heads. Much of this material was based on: IH Witten, A Moffat, TC Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, Second edition, Morgan Kaufmann, San Francisco, 1999, pp74 84. 5