Dictionary techniques The final concept that we will mention in this chapter is about dictionary techniques. Many modern compression algorithms rely on the modified versions of various dictionary techniques. The basic idea is to exploit the symbol repetitions inside the source. Let us start with a very basic dictionary technique which was literally designed to compress the text dictionary entries. In this technique, the repeating clusters of letters in the front part of the previous word is represented by a number which shows the amount of repetition. The following table shows the English dictionary entries and their compressed (called front compression) counterparts: a aardvark aback abandon abandoning abandonment abasement abash abate abated abbot abbey abbreviating a 1ardvark 1back 3ndon 7ing 7ment 3sement 4h 3te 5d 2bot 3ey 3reviating Notice that the right column (compressed part) is somewhat shorter than the original dictionary entries. In general, we encode symbols which do not appear anywhere before as they are, but the symbols (or symbol sequences) which have occured before are only encoded by representing a pointer to the previous occurence:
Move-to-front coding The front coding scheme in the previous example shows that if we have a lot of repeating latters in the front part of two English words, it produces an efficient compression. The move-to-front coding algorithm (J. L. Bentley, 1986) tries to bring the more frequently occuring symbols to the front position in a list of symbols. The reason for changing the positions of the symbols in the list is ; the first symbols in the list have fewer bits than the last symbols to represent them. For this reason, we first have to form a list of binary representations which should satisfy the two conditions: 1. The first binary numbers should be shorter than the later ones. 2. The binary codes must be uniquely decodable. A commonly used binary list is as follows: 1 1 2 010 3 011 4 00100 5 00101 6 00110 7 00111 8 0001000 9 0001001
10 0001010 11 0001011 12 0001100 13 0001101 14 0001110 15 0001111 16 000010000 The above binary codewors are generated using a simple prefix technique. The red bins represent the prefix. If the number of bits in the prefix is N, then the number of bits that follow the prefix is N-1. Using N-1 bits, we can generate different binary numbers. Here, for instance, when the prefix is 001, N=3, so N-1=2 and we can generate 4 symbols which have the prefix of 001. Exercise: In the continuation of the above list, how many numbers are there that follows the prefix 00001? (We have shown only one of them in the above list). Check Answer Reset Using this method, we have obtained a suitable ordering of uniquely decodable binary numbers. The next stage is to use this list to encode our symbols. The moveto-front technique is an adaptive one which dynamically changes the binary representation of a symbol as the new symbols arrive in the source. We try to maintain our alphabet (or symbol list) as a list where frequently occuring symbols are located near the front (which have fewer bits). Exercise: Let us perform move-to-front coding to the following text: "the boy on my right is the right boy" We will consider that the symbols are the words. Step-by-step, this is what happens: Initially, the list is empty. Counter is 0. First symbol is "the". It does not exist in the list, so we emit the code : "0the". It comes directly to the front of the list, and the list becomes {0:the}. Counter is 1. Second symbol is "boy". It does not exist in the list, so we emit the code : "1boy". Since it occured later than "the", it gets inserted to the top of the list (it moves to the front). The list is now {0:boy, 1:the}. Counter is 2. Next symbol is "on". It does not exist in the list, so we emit the code : "2on". Since it occured later than "boy", it gets inserted to the top of the list (it moves to the front). The list is now {0:on, 1:boy, 2:the}. Counter is 3. Next symbol is "my". It does not exist in the list, so we emit the code : "3my". Since it occured later than "on", it gets inserted to the top of the list (it moves to the front). The list is now {0:my, 1:on, 2:boy, 3:the}. Counter is 4. Next symbol is "right". It does not exist in the list, so we emit the code : "4right". Since it occured later than "my", it gets inserted to the top of the list (it moves to the front). The list is now {0:right, 1:my, 2:on, 3:boy,
4:the}. Counter is 5. Next symbol is "is". It does not exist in the list, so we emit the code : "5is". Since it occured later than "my", it gets inserted to the top of the list (it moves to the front). The list is now {0:is, 1:right, 2:my, 3:on, 4:boy, 5:the}. Counter is 6. Next symbol is "the". Now, this symbol exists in the list, and its rank is 5. So we emit the code : "5". The occurence of "the" became more frequent than the others, so we have to move "the" symbol to the front and the list becomes: {0:the, 1:is, 2:right, 3:my, 4:on, 5:boy}. Counter is 7. Next symbol is "right". This symbol exists in the list, and its rank is 2. So we emit the code : "2". The occurence of "right" became more frequent than the others (actually same as "the", but "right" came later, so it has the priority), so we have to move "right" symbol to the front and the list becomes: {0:right, 1:the, 2:is, 3:my, 4:on, 5:boy}. Counter is 8. Next symbol is "boy". This symbol exists in the list, and its rank is 5. So we emit the code : "5". The occurence of "boy" became more frequent than the others (actually same as "the" and "right", but "boy" came later, so it has the priority), so we have to move "boy" symbol to the front and the list becomes: {0:boy, 1:right, 2:the, 3:is, 4:my, 5:on}. The overall compressed data is: {0the 1boy 2on 3my 4right 5is 5 2 5}. In this way, we not only expressed the repeating words with simple numbers, but also tried to use smaller numbers for them, as well. The efficiency of this method becomes more clear when longer sources are used. Notice that the codebook that we use is time varying. This is a common property in most of the dictionary based compression techniques. Lempel-Ziv data compression A famous compression algorithm is named after Lempel and Ziv had developed (1977 and 1978) their successful dictionary technique. The first implementation is called LZ77 and the second one is called LZ78. Strangely, LZ78 is a simpler algorithm, therefore it has been used first (still used in the UNIX compression and zip). With the improvement of the computers, the implementations of LZ77 became feasible, and it is still used in Windows based compression utilities. These techniques are different from other basic techniques in the following way: The encoded symbol amount and the bits per encoded symbol continuously change while compression (time varying). There is no a-priori knowledge about the probabilities (or other statistics) of the input source. The system is totally adaptive. The adaptation is in such a way that the average code length per symbol is minimized as time evolves. Note: this behavior is called "universal coding". They are really commonly used. The general Lempel-Ziv algorithm parses the input stream into symbols that occur several times in the source. In this way, the repeating patterns become more
efficient than the basic method as illustrated in the top figure. As an example, the following parsing and codebook generation illustrates an LZ(78) coder: The algorithm searches the window for the longest match from the beginning of the lookahead buffer and outputs a pointer to that match. Since it is possible that not even a one-symbol match can be found, the output should not contain just pointers. LZ77 solves this problem the following way: after each pointer, it outputs the first symbol in the lookahead buffer after the match. If there is no match, it LZ77: The algorithm encodes a sequence of length N which has been generated using M distict symbols. In order to describe the algorithm, let us make the following definitions: Input stream: the sequence of symbols to be compressed Symbol: the basic data element in the input stream; Coding position: the position of the symbol in the input stream that is currently being coded (the beginning of the lookahead buffer); Lookahead buffer: the symbol sequence from the coding position to the end of the input stream; The Window of size W contains W characters from the coding position backwards, i.e. the last W processed symbols; P: A pointer which points to the match in the window and also specifies its length. We will try to encode a sub-sequence in the input stream by trying to locate the same sequence somewhere else in the input stream. The location of the same sequence will correspond to P (pointer), and the size of the sub-sequence is W (window).
outputs a null-pointer and then outputs the symbol at the coding position. We can summarize this with the following encoding algorithm: 1. Set the coding position to the beginning of the input stream; 2. find the longest match in the window for the lookahead buffer; 3. output the pair (P,S) with the following meaning: P is the pointer to the match in the window; S is the first symbol in the lookahead buffer that didn't match; 4. if the lookahead buffer is not empty, move the coding position (and the window) L+1 symbols forward and return to step 2. Exercise: Encode the following input using LZ77: Position (P) 1 2 3 4 5 6 7 8 9 Symbol (S) A A B C B B A B C The encoding is done step by step as given in the following table: Step Position Match Symbol Output 1. 1 -- A (0,0) A 2. 2 A B (1,1) B 3. 4 -- C (0,0) C 4. 5 B B (2,1) B 5. 7 A B C (5,2) C The behavior of the table can be described as follows: "Step" indicates the number of the encoding step. It completes each time the encoding wmits an output. With LZ77 this happens in each pass through the step 3 of the described algorithm. "Position" indicates the coding position. The first character in the input stream has the coding position 1. " Match" shows the longest match found in the window. "Symbol" shows the first symbol in the lookahead buffer after the match. "Output" represents the emitted output in the format (B,L) S: (B,L) is the "beginning" and "length" information of the pointer (P) to the Match. This gives the following instruction to the decoder: "Go back B symbols in the window and copy L symbols to the output"; S is the isolated Symbol. Let us decode the emitted symbols (the last column of the above table): (0,0) A : Initially, output = {A} (1,1) B : Go back 1 symbol, copy 1 symbol to the output, then emit B, ouput now = {A A B} (0,0) C : Go back 0 symbols and copy 0 symbols (this makes nothing...), then emit C, output ={A A B C}
(2,1) B : Go back 2 symbols and copy 1 symbol, then emit B, output = {A A B C B B} (5,2) C : Go back 5 symbols and copy 2 symbols, then emit C, output = {A A B C B B A B C} Notice that the encoder requires extensive search of repeating characters to find the longest match. However, the decoder is very simple in terms of computational complexity. Although the compression was very efficient, this was a bad point for slow computers, therefore another algorithm (LZ78) was proposed. Exercise: Find the answers to the three questions according to the given input stream: ALITOPUALGEL Distance (B) of a match in the text window 5 The length (L) of the match phrase 1 The first symbol (S) in the look-ahead buffer that follows the phrase. Check Answer Click and watch the following flash animation (by Kemal Bayrakceken - in Turkish) illustrating LZ77 coding. In practice, the necessity of using three codewords for each emitted symbol is also a redundant situation. Let us see how LZ78 eliminates these problems: LZ78: Once again, we need to define and clarify some of the terminology that we use here: SymbolStream: a sequence of data to be encoded; Symbol: the basic data element in the SymbolStream; Prefix: a sequence of symbols that precede one symbol; String: the prefix together with the symbol it precedes; Code word: a basic data element in the codestream. It represents a string from the dictionary; Codestream: the sequence of code words and symbols (the output of the encoding algorithm); Dictionary: a table of strings. Every string is assigned a code word according to its index number in the dictionary; Current prefix: the prefix currently being processed in the encoding algorithm. Denote with: P Current symbol: a symbol determined in the endocing algorithm. Generally this is the symbol preceded by the current prefix. Denote with: C Current code word: the code word currently processed in the decoding algorithm. Denoted by the string W. ":=" means "assignment". Using these definitions, we can, now, list the encoding algorithm: G
1. Initially, the dictionary and P are empty; 2. S:= next symbol in the symbolstream; 3. Is the string P+S present in the dictionary? if it is, P := P+S (extend P with S); if not, i. output these two objects to the codestream: the code word corresponding to P (if P is empty, output a zero); S, in the same form as input from the symbolstream; ii. add the string P+S to the dictionary; iii. P := empty; 4. are there more symbols in the symbolstream? if yes, return to step 2; if not: i. if P is not empty, output the code word corresponding to P; ii. END. The algorithm steps may look cluttered and dificult to comprehend. Let us also try to explain the algorithm step by step: At the beginning of encoding the dictionary is empty. In order to explain the principle of encoding, let's consider a point within the encoding process, when the dictionary already contains some strings. We start analyzing a new prefix in the symbolstream, beginning with an empty prefix. If its corresponding string (prefix + the symbol after it : P+C) is present in the dictionary, the prefix is extended with the character C. This extending is repeated until we get a string which is not present in the dictionary. This is a very clever way of searching for the maximum window length that repeats itself. At the point where the extended prefix does not exist in the library, we emit two outputs to the codestream: the code word that represents the prefix P, and then the symbol S. Then we add the final whole string (P+S) to the dictionary and start processing the next prefix in the symbolstream. Implementation note: A special case occurs if the dictionary doesn't contain a single symbol, even the starting one (for example, this always happens in the first encoding step). In that case we output a special code word that represents an empty string, followed by this character and add this character to the dictionary. The output from this algorithm is a sequence of codeword-symbol pairs (W,S). Each time a pair is emitted to the codestream, the string from the dictionary corresponding to W is extended with the symbol S and the resulting string is added to the dictionary. Notice that when a new string is added to the dictionary, the dictionary already contains all the substrings formed by removing characters from the end of the new string. For example, if ABBACB is added, then the dictionary should already contain the codewords: ABBAC, ABBA, ABB, AB, and A. Exercise: Encode the following input using LZ78: Position (P) 1 2 3 4 5 6 7 8 9
Symbol (S) A B B C B C A B A The encoding is done step by step as given in the following table: Step Position Dictionary addition Output 1. 1 A (0,A) 2. 2 B (0,B) 3. 3 BC (2,C) 4. 5 BCA (3,A) 5. 8 BA (2,A) Let us describe the example operation: The column Step indicates the number of the encoding step. Each encoding step is completed when the step 3.b. in the encoding algorithm is executed. The column Position indicates the current position in the input data. The column Dictionary added shows what string has been added to the dictionary. The index of the string is equal to the step number. The column Output presents the output in the form (W,C). The output of each step decodes to the string that has been added to the dictionary. The decoding is, again, quite simple. Decode (0,A), emit A, the dictionary is {A} Decode (0,A), emit B, the dictionary is {A, B} Decode (2,C), the second in the dictionary is B, so emit BC. the dictionary is {A, B, BC} Decode (3,A), the third in the dictionary is BC, so emit BCA. the dictionary is {A, B, BC, BCA} Decode (2,A), the second in the dictionary is B, so emit BA. the dictionary is {A, B, BC, BCA, BA} The overall emitted symbols are ABBCBCABA. Important note: The emitted symbols during encoding are not necessarily represented like (0,A), or (2,C), etc. The use of parantheses greatly reduces efficiency. Usually, the parantheses are omitted. However, the prefix and symbol (for example, in (2,C), 2 is the prefix location and C is the symbol) must be separated during decoding. Otherwise how could you resolve 0A0B2C3A2A? The numbers like 2 and 3 could as well be the symbols, themselves. In practice, the compressed data consists of the following syntax: AB*2C*3A*2A This eliminates the redundancy of parantheses and the redundancy of indicating an index for the symbols that do not exist in the dictionary (previously, indicated by something like (0,A), etc.) The symbol * followed by a number indicates the
dictionary location. Of course, we assume that the symbol * does not exist in our source. If, exceptionally, the * symbol occurs, then i is easily overcome by putting an escape sequence before *, i.e., we say **. This way, ** is decodable, because, since * is not followed by a number, the decoder can decide that we mean to represent the symbol of *. Click and watch the following flash animation (by Ahmet Gurbuz) illustrating LZ78 coding. The matlab script lz78.m interactively encodes an entered string using LZ78 algorithm (by Serhan Yavuz, requires ispresent.m). Concluding remarks: 1. There are variations and improements over the classical LZ77 and LZ78 algorithms. LZSS and LZW are, perhaps, the most popular ones. Indeed, LZW is really one of the dominant compression algorithms, and it is used in really many commercially available programs such as WinZip, PKzip, etc. 2. Although we have considered the lossless compression schemes (huffman, arithmetic, RLE, LZ, etc.) individually, you should not forget that they can also be used inside a lossy compression algorithms. Remember that we have covered scalar and vector quantization in the previous chapter. They were the steps that incorporated the loss into the coder. The result of quantization was a list of symbols from codebook entries. You should always keep in mind that the codebook can be considered as your alphabet, and the symbol list generated by the quantizer can be considered as a symbol stream which can be compressed using the described lossless coders. The students are urged to remember the overall block diagram of a typical signal coding algorithm. Check out a very nice JAVA applet which compresses the entered string using LZW algorithm online. You can go to the original page of the above applet here. Available links: You can find, literally, zillions of information on the LZ compression algorithms. Here are a few: 1. Lempel-Ziv compression algorithms 2. Interactive LZW compression 3. Lempel-Ziv-Welch Compression (LZW) 4. Lempel-Ziv compression of a file 5. Lempel-Ziv file compression