Dictionary techniques

Similar documents
EE-575 INFORMATION THEORY - SEM 092

Data Compression Techniques

Entropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code

ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

Chapter 7 Lossless Compression Algorithms

Lossless compression II

Lempel-Ziv-Welch (LZW) Compression Algorithm

Simple variant of coding with a variable number of symbols and fixlength codewords.

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

Engineering Mathematics II Lecture 16 Compression

Overview. Last Lecture. This Lecture. Next Lecture. Data Transmission. Data Compression Source: Lecture notes

Text Compression. General remarks and Huffman coding Adobe pages Arithmetic coding Adobe pages 15 25

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods

Intro. To Multimedia Engineering Lossless Compression

Ch. 2: Compression Basics Multimedia Systems

Lossless Compression Algorithms

Compression; Error detection & correction

Compression; Error detection & correction

A Comprehensive Review of Data Compression Techniques

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

Data Compression 신찬수

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

CS/COE 1501

A study in compression algorithms

Study of LZ77 and LZ78 Data Compression Techniques

FPGA based Data Compression using Dictionary based LZW Algorithm

Ch. 2: Compression Basics Multimedia Systems

Multimedia Systems. Part 20. Mahdi Vasighi

VIDEO SIGNALS. Lossless coding

Comparative Study of Dictionary based Compression Algorithms on Text Data

LZW Compression. Ramana Kumar Kundella. Indiana State University December 13, 2014

Digital Image Processing

DEFLATE COMPRESSION ALGORITHM

Data Compression. Guest lecture, SGDS Fall 2011

Optimized Compression and Decompression Software

Compressing Data. Konstantin Tretyakov

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

CS/COE 1501

Image compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year

7: Image Compression

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

CHAPTER II LITERATURE REVIEW

Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we ha

Analysis of Parallelization Effects on Textual Data Compression

Distributed source coding

Fundamentals of Multimedia. Lecture 5 Lossless Data Compression Variable Length Coding

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

I. Introduction II. Mathematical Context

CIS 121 Data Structures and Algorithms with Java Spring 2018

A New Compression Method Strictly for English Textual Data

Lecture 6 Review of Lossless Coding (II)

Text Compression through Huffman Coding. Terminology

Category: Informational May DEFLATE Compressed Data Format Specification version 1.3

Data Compression Techniques

MODELING DELTA ENCODING OF COMPRESSED FILES. and. and

EE67I Multimedia Communication Systems Lecture 4

Error Resilient LZ 77 Data Compression

Chapter 1. Digital Data Representation and Communication. Part 2

HARDWARE IMPLEMENTATION OF LOSSLESS LZMA DATA COMPRESSION ALGORITHM

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

Modeling Delta Encoding of Compressed Files

Lossless compression II

CSE 421 Greedy: Huffman Codes

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

WIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION

Basic Compression Library

A Comparative Study Of Text Compression Algorithms

Modeling Delta Encoding of Compressed Files

Department of electronics and telecommunication, J.D.I.E.T.Yavatmal, India 2

You can say that again! Text compression

A Method for Virtual Extension of LZW Compression Dictionary

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

Image coding and compression

OPTIMIZATION OF LZW (LEMPEL-ZIV-WELCH) ALGORITHM TO REDUCE TIME COMPLEXITY FOR DICTIONARY CREATION IN ENCODING AND DECODING

In this simple example, it is quite clear that there are exactly two strings that match the above grammar, namely: abc and abcc

VC 12/13 T16 Video Compression

An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

IMAGE COMPRESSION TECHNIQUES

Optimization of Bit Rate in Medical Image Compression

IMAGE COMPRESSION- I. Week VIII Feb /25/2003 Image Compression-I 1

Textual Data Compression Speedup by Parallelization

Optimal Parsing. In Dictionary-Symbolwise. Compression Algorithms

Horn Formulae. CS124 Course Notes 8 Spring 2018

CS106B Handout 34 Autumn 2012 November 12 th, 2012 Data Compression and Huffman Encoding

So, what is data compression, and why do we need it?

Digital Image Processing

Data Compression Fundamentals

Implementation of Robust Compression Technique using LZ77 Algorithm on Tensilica s Xtensa Processor

International Journal of Advanced Research in Computer Science and Software Engineering

6. Finding Efficient Compressions; Huffman and Hu-Tucker

Formal Languages and Compilers Lecture VI: Lexical Analysis

More Bits and Bytes Huffman Coding

15 July, Huffman Trees. Heaps

Transcription:

Dictionary techniques The final concept that we will mention in this chapter is about dictionary techniques. Many modern compression algorithms rely on the modified versions of various dictionary techniques. The basic idea is to exploit the symbol repetitions inside the source. Let us start with a very basic dictionary technique which was literally designed to compress the text dictionary entries. In this technique, the repeating clusters of letters in the front part of the previous word is represented by a number which shows the amount of repetition. The following table shows the English dictionary entries and their compressed (called front compression) counterparts: a aardvark aback abandon abandoning abandonment abasement abash abate abated abbot abbey abbreviating a 1ardvark 1back 3ndon 7ing 7ment 3sement 4h 3te 5d 2bot 3ey 3reviating Notice that the right column (compressed part) is somewhat shorter than the original dictionary entries. In general, we encode symbols which do not appear anywhere before as they are, but the symbols (or symbol sequences) which have occured before are only encoded by representing a pointer to the previous occurence:

Move-to-front coding The front coding scheme in the previous example shows that if we have a lot of repeating latters in the front part of two English words, it produces an efficient compression. The move-to-front coding algorithm (J. L. Bentley, 1986) tries to bring the more frequently occuring symbols to the front position in a list of symbols. The reason for changing the positions of the symbols in the list is ; the first symbols in the list have fewer bits than the last symbols to represent them. For this reason, we first have to form a list of binary representations which should satisfy the two conditions: 1. The first binary numbers should be shorter than the later ones. 2. The binary codes must be uniquely decodable. A commonly used binary list is as follows: 1 1 2 010 3 011 4 00100 5 00101 6 00110 7 00111 8 0001000 9 0001001

10 0001010 11 0001011 12 0001100 13 0001101 14 0001110 15 0001111 16 000010000 The above binary codewors are generated using a simple prefix technique. The red bins represent the prefix. If the number of bits in the prefix is N, then the number of bits that follow the prefix is N-1. Using N-1 bits, we can generate different binary numbers. Here, for instance, when the prefix is 001, N=3, so N-1=2 and we can generate 4 symbols which have the prefix of 001. Exercise: In the continuation of the above list, how many numbers are there that follows the prefix 00001? (We have shown only one of them in the above list). Check Answer Reset Using this method, we have obtained a suitable ordering of uniquely decodable binary numbers. The next stage is to use this list to encode our symbols. The moveto-front technique is an adaptive one which dynamically changes the binary representation of a symbol as the new symbols arrive in the source. We try to maintain our alphabet (or symbol list) as a list where frequently occuring symbols are located near the front (which have fewer bits). Exercise: Let us perform move-to-front coding to the following text: "the boy on my right is the right boy" We will consider that the symbols are the words. Step-by-step, this is what happens: Initially, the list is empty. Counter is 0. First symbol is "the". It does not exist in the list, so we emit the code : "0the". It comes directly to the front of the list, and the list becomes {0:the}. Counter is 1. Second symbol is "boy". It does not exist in the list, so we emit the code : "1boy". Since it occured later than "the", it gets inserted to the top of the list (it moves to the front). The list is now {0:boy, 1:the}. Counter is 2. Next symbol is "on". It does not exist in the list, so we emit the code : "2on". Since it occured later than "boy", it gets inserted to the top of the list (it moves to the front). The list is now {0:on, 1:boy, 2:the}. Counter is 3. Next symbol is "my". It does not exist in the list, so we emit the code : "3my". Since it occured later than "on", it gets inserted to the top of the list (it moves to the front). The list is now {0:my, 1:on, 2:boy, 3:the}. Counter is 4. Next symbol is "right". It does not exist in the list, so we emit the code : "4right". Since it occured later than "my", it gets inserted to the top of the list (it moves to the front). The list is now {0:right, 1:my, 2:on, 3:boy,

4:the}. Counter is 5. Next symbol is "is". It does not exist in the list, so we emit the code : "5is". Since it occured later than "my", it gets inserted to the top of the list (it moves to the front). The list is now {0:is, 1:right, 2:my, 3:on, 4:boy, 5:the}. Counter is 6. Next symbol is "the". Now, this symbol exists in the list, and its rank is 5. So we emit the code : "5". The occurence of "the" became more frequent than the others, so we have to move "the" symbol to the front and the list becomes: {0:the, 1:is, 2:right, 3:my, 4:on, 5:boy}. Counter is 7. Next symbol is "right". This symbol exists in the list, and its rank is 2. So we emit the code : "2". The occurence of "right" became more frequent than the others (actually same as "the", but "right" came later, so it has the priority), so we have to move "right" symbol to the front and the list becomes: {0:right, 1:the, 2:is, 3:my, 4:on, 5:boy}. Counter is 8. Next symbol is "boy". This symbol exists in the list, and its rank is 5. So we emit the code : "5". The occurence of "boy" became more frequent than the others (actually same as "the" and "right", but "boy" came later, so it has the priority), so we have to move "boy" symbol to the front and the list becomes: {0:boy, 1:right, 2:the, 3:is, 4:my, 5:on}. The overall compressed data is: {0the 1boy 2on 3my 4right 5is 5 2 5}. In this way, we not only expressed the repeating words with simple numbers, but also tried to use smaller numbers for them, as well. The efficiency of this method becomes more clear when longer sources are used. Notice that the codebook that we use is time varying. This is a common property in most of the dictionary based compression techniques. Lempel-Ziv data compression A famous compression algorithm is named after Lempel and Ziv had developed (1977 and 1978) their successful dictionary technique. The first implementation is called LZ77 and the second one is called LZ78. Strangely, LZ78 is a simpler algorithm, therefore it has been used first (still used in the UNIX compression and zip). With the improvement of the computers, the implementations of LZ77 became feasible, and it is still used in Windows based compression utilities. These techniques are different from other basic techniques in the following way: The encoded symbol amount and the bits per encoded symbol continuously change while compression (time varying). There is no a-priori knowledge about the probabilities (or other statistics) of the input source. The system is totally adaptive. The adaptation is in such a way that the average code length per symbol is minimized as time evolves. Note: this behavior is called "universal coding". They are really commonly used. The general Lempel-Ziv algorithm parses the input stream into symbols that occur several times in the source. In this way, the repeating patterns become more

efficient than the basic method as illustrated in the top figure. As an example, the following parsing and codebook generation illustrates an LZ(78) coder: The algorithm searches the window for the longest match from the beginning of the lookahead buffer and outputs a pointer to that match. Since it is possible that not even a one-symbol match can be found, the output should not contain just pointers. LZ77 solves this problem the following way: after each pointer, it outputs the first symbol in the lookahead buffer after the match. If there is no match, it LZ77: The algorithm encodes a sequence of length N which has been generated using M distict symbols. In order to describe the algorithm, let us make the following definitions: Input stream: the sequence of symbols to be compressed Symbol: the basic data element in the input stream; Coding position: the position of the symbol in the input stream that is currently being coded (the beginning of the lookahead buffer); Lookahead buffer: the symbol sequence from the coding position to the end of the input stream; The Window of size W contains W characters from the coding position backwards, i.e. the last W processed symbols; P: A pointer which points to the match in the window and also specifies its length. We will try to encode a sub-sequence in the input stream by trying to locate the same sequence somewhere else in the input stream. The location of the same sequence will correspond to P (pointer), and the size of the sub-sequence is W (window).

outputs a null-pointer and then outputs the symbol at the coding position. We can summarize this with the following encoding algorithm: 1. Set the coding position to the beginning of the input stream; 2. find the longest match in the window for the lookahead buffer; 3. output the pair (P,S) with the following meaning: P is the pointer to the match in the window; S is the first symbol in the lookahead buffer that didn't match; 4. if the lookahead buffer is not empty, move the coding position (and the window) L+1 symbols forward and return to step 2. Exercise: Encode the following input using LZ77: Position (P) 1 2 3 4 5 6 7 8 9 Symbol (S) A A B C B B A B C The encoding is done step by step as given in the following table: Step Position Match Symbol Output 1. 1 -- A (0,0) A 2. 2 A B (1,1) B 3. 4 -- C (0,0) C 4. 5 B B (2,1) B 5. 7 A B C (5,2) C The behavior of the table can be described as follows: "Step" indicates the number of the encoding step. It completes each time the encoding wmits an output. With LZ77 this happens in each pass through the step 3 of the described algorithm. "Position" indicates the coding position. The first character in the input stream has the coding position 1. " Match" shows the longest match found in the window. "Symbol" shows the first symbol in the lookahead buffer after the match. "Output" represents the emitted output in the format (B,L) S: (B,L) is the "beginning" and "length" information of the pointer (P) to the Match. This gives the following instruction to the decoder: "Go back B symbols in the window and copy L symbols to the output"; S is the isolated Symbol. Let us decode the emitted symbols (the last column of the above table): (0,0) A : Initially, output = {A} (1,1) B : Go back 1 symbol, copy 1 symbol to the output, then emit B, ouput now = {A A B} (0,0) C : Go back 0 symbols and copy 0 symbols (this makes nothing...), then emit C, output ={A A B C}

(2,1) B : Go back 2 symbols and copy 1 symbol, then emit B, output = {A A B C B B} (5,2) C : Go back 5 symbols and copy 2 symbols, then emit C, output = {A A B C B B A B C} Notice that the encoder requires extensive search of repeating characters to find the longest match. However, the decoder is very simple in terms of computational complexity. Although the compression was very efficient, this was a bad point for slow computers, therefore another algorithm (LZ78) was proposed. Exercise: Find the answers to the three questions according to the given input stream: ALITOPUALGEL Distance (B) of a match in the text window 5 The length (L) of the match phrase 1 The first symbol (S) in the look-ahead buffer that follows the phrase. Check Answer Click and watch the following flash animation (by Kemal Bayrakceken - in Turkish) illustrating LZ77 coding. In practice, the necessity of using three codewords for each emitted symbol is also a redundant situation. Let us see how LZ78 eliminates these problems: LZ78: Once again, we need to define and clarify some of the terminology that we use here: SymbolStream: a sequence of data to be encoded; Symbol: the basic data element in the SymbolStream; Prefix: a sequence of symbols that precede one symbol; String: the prefix together with the symbol it precedes; Code word: a basic data element in the codestream. It represents a string from the dictionary; Codestream: the sequence of code words and symbols (the output of the encoding algorithm); Dictionary: a table of strings. Every string is assigned a code word according to its index number in the dictionary; Current prefix: the prefix currently being processed in the encoding algorithm. Denote with: P Current symbol: a symbol determined in the endocing algorithm. Generally this is the symbol preceded by the current prefix. Denote with: C Current code word: the code word currently processed in the decoding algorithm. Denoted by the string W. ":=" means "assignment". Using these definitions, we can, now, list the encoding algorithm: G

1. Initially, the dictionary and P are empty; 2. S:= next symbol in the symbolstream; 3. Is the string P+S present in the dictionary? if it is, P := P+S (extend P with S); if not, i. output these two objects to the codestream: the code word corresponding to P (if P is empty, output a zero); S, in the same form as input from the symbolstream; ii. add the string P+S to the dictionary; iii. P := empty; 4. are there more symbols in the symbolstream? if yes, return to step 2; if not: i. if P is not empty, output the code word corresponding to P; ii. END. The algorithm steps may look cluttered and dificult to comprehend. Let us also try to explain the algorithm step by step: At the beginning of encoding the dictionary is empty. In order to explain the principle of encoding, let's consider a point within the encoding process, when the dictionary already contains some strings. We start analyzing a new prefix in the symbolstream, beginning with an empty prefix. If its corresponding string (prefix + the symbol after it : P+C) is present in the dictionary, the prefix is extended with the character C. This extending is repeated until we get a string which is not present in the dictionary. This is a very clever way of searching for the maximum window length that repeats itself. At the point where the extended prefix does not exist in the library, we emit two outputs to the codestream: the code word that represents the prefix P, and then the symbol S. Then we add the final whole string (P+S) to the dictionary and start processing the next prefix in the symbolstream. Implementation note: A special case occurs if the dictionary doesn't contain a single symbol, even the starting one (for example, this always happens in the first encoding step). In that case we output a special code word that represents an empty string, followed by this character and add this character to the dictionary. The output from this algorithm is a sequence of codeword-symbol pairs (W,S). Each time a pair is emitted to the codestream, the string from the dictionary corresponding to W is extended with the symbol S and the resulting string is added to the dictionary. Notice that when a new string is added to the dictionary, the dictionary already contains all the substrings formed by removing characters from the end of the new string. For example, if ABBACB is added, then the dictionary should already contain the codewords: ABBAC, ABBA, ABB, AB, and A. Exercise: Encode the following input using LZ78: Position (P) 1 2 3 4 5 6 7 8 9

Symbol (S) A B B C B C A B A The encoding is done step by step as given in the following table: Step Position Dictionary addition Output 1. 1 A (0,A) 2. 2 B (0,B) 3. 3 BC (2,C) 4. 5 BCA (3,A) 5. 8 BA (2,A) Let us describe the example operation: The column Step indicates the number of the encoding step. Each encoding step is completed when the step 3.b. in the encoding algorithm is executed. The column Position indicates the current position in the input data. The column Dictionary added shows what string has been added to the dictionary. The index of the string is equal to the step number. The column Output presents the output in the form (W,C). The output of each step decodes to the string that has been added to the dictionary. The decoding is, again, quite simple. Decode (0,A), emit A, the dictionary is {A} Decode (0,A), emit B, the dictionary is {A, B} Decode (2,C), the second in the dictionary is B, so emit BC. the dictionary is {A, B, BC} Decode (3,A), the third in the dictionary is BC, so emit BCA. the dictionary is {A, B, BC, BCA} Decode (2,A), the second in the dictionary is B, so emit BA. the dictionary is {A, B, BC, BCA, BA} The overall emitted symbols are ABBCBCABA. Important note: The emitted symbols during encoding are not necessarily represented like (0,A), or (2,C), etc. The use of parantheses greatly reduces efficiency. Usually, the parantheses are omitted. However, the prefix and symbol (for example, in (2,C), 2 is the prefix location and C is the symbol) must be separated during decoding. Otherwise how could you resolve 0A0B2C3A2A? The numbers like 2 and 3 could as well be the symbols, themselves. In practice, the compressed data consists of the following syntax: AB*2C*3A*2A This eliminates the redundancy of parantheses and the redundancy of indicating an index for the symbols that do not exist in the dictionary (previously, indicated by something like (0,A), etc.) The symbol * followed by a number indicates the

dictionary location. Of course, we assume that the symbol * does not exist in our source. If, exceptionally, the * symbol occurs, then i is easily overcome by putting an escape sequence before *, i.e., we say **. This way, ** is decodable, because, since * is not followed by a number, the decoder can decide that we mean to represent the symbol of *. Click and watch the following flash animation (by Ahmet Gurbuz) illustrating LZ78 coding. The matlab script lz78.m interactively encodes an entered string using LZ78 algorithm (by Serhan Yavuz, requires ispresent.m). Concluding remarks: 1. There are variations and improements over the classical LZ77 and LZ78 algorithms. LZSS and LZW are, perhaps, the most popular ones. Indeed, LZW is really one of the dominant compression algorithms, and it is used in really many commercially available programs such as WinZip, PKzip, etc. 2. Although we have considered the lossless compression schemes (huffman, arithmetic, RLE, LZ, etc.) individually, you should not forget that they can also be used inside a lossy compression algorithms. Remember that we have covered scalar and vector quantization in the previous chapter. They were the steps that incorporated the loss into the coder. The result of quantization was a list of symbols from codebook entries. You should always keep in mind that the codebook can be considered as your alphabet, and the symbol list generated by the quantizer can be considered as a symbol stream which can be compressed using the described lossless coders. The students are urged to remember the overall block diagram of a typical signal coding algorithm. Check out a very nice JAVA applet which compresses the entered string using LZW algorithm online. You can go to the original page of the above applet here. Available links: You can find, literally, zillions of information on the LZ compression algorithms. Here are a few: 1. Lempel-Ziv compression algorithms 2. Interactive LZW compression 3. Lempel-Ziv-Welch Compression (LZW) 4. Lempel-Ziv compression of a file 5. Lempel-Ziv file compression