A Hybrid Approach to Text Compression

Size: px

Start display at page:

Download "A Hybrid Approach to Text Compression"

Noreen Gilmore
5 years ago
Views:

1 A Hybrid Approach to Text Compression Peter C Gutmann Computer Science, University of Auckland, New Zealand Telephone ; pgut 1 Bcs.aukuni.ac.nz Timothy C Bell Computer Science, University of Canterbury, Christchurch 1, New Zealand Telephone ; fax ; 1 Introduction Text compression schemes have sometimes been divided into two classes: symbolwise methods, which form a source model, typically using a finite context to predict symbols; and dictionary methods, which replace phrases (groups of symbols) in the input with a code. Symbolwise methods tend to give better compression because they form more accurate models of text, while dictionary methods tend to be faster because multiple symbols are coded at once. It is possible to decompose some dictionary methods into equivalent symbolwise methods (Langdon 1983, Bell & Witten in press). The decomposed method gives identical compression performance, but is slower because more coded symbols are transmitted. This decomposition is of interest primarily because it is helpful in making comparisons of the two methods. In this paper we explore a hybrid approach based on the opposite of this decomposition: the predictions of a symbolwise method are grouped together so that several characters can be coded at once. The objective is to combine the good compression of symbolwise methods with the high speed of dictionary methods. The hybrid allows tradeoffs to be made in terms of compression speed, compression performance, and memory usage. More importantly, investigating a hybrid method gives extra insights into the relationship between dictionary and symbolwise methods, and reveals that they are more closely related than might be expected. The primary goal in the design of the hybrid method described here was to create a very fast system that is based on context modelling. We therefore begin by surveying techniques that have been used in the literature to achieve fast compression. The current method of choice for very fast adaptive compressors is to use some variant of the LZ77 method (Ziv & Lempel 1977) in which the extent of the search for repeated strings is limited. In general this is accomplished by terminating the search after a predetermined number of potential matches have been checked. An extreme example of $ IEEE 225

2 226 this is LZRWl (Williams 1991a), which hashes the next few characters of the input into a table of pointers that point back into the sliding window. A new phrase entering the window is added by overwriting any existing phrase that is stored at the same location in the hash table. Consequently, only the most recent occurrence of a phrase is stored, and even this may be lost if another phrase collides with it in the hash table. This very simple replacement strategy achieves very fast compression. The output is packed into 16-bit words to make coding even faster, with 12 bits of position information (corresponding to a window size of 4K characters), and 4 bits of length information. LZRW2 (Williams 1991b) extends LZRWl by storing a table of selected phrases instead of referencing the sliding window directly. The hash table entries point to a phrase table that contains pointers to the sliding window. Since the window size is no longer limited by the hash table size (the phrase table entries can point back an arbitrarily large distance), a much larger window is available, and the index can access 4K phrases instead of 4K characters. The price paid is that the decompressor has the extra overhead of maintaining the same hash table and phrase table used in the compressor, and an extra level of indirection is introduced. LZRW3 is another refinement, which merges the hash and phrase table into one unified lookup table (Williams 1991b). In addition, LZRW3 variants store multiple pointers at each hash table location, with a commensurate decrease in the number of hash buckets so that the table is the same size. Although the reduced number of buckets leads to more collisions, the increased bucket size means that more strings can be searched for a given hash value than in the simpler versions. A bucket size of 4 or 8 phrases seems to be the best tradeoff for an overall table size of 4K entries. Several strategies can be used to decide which entry in a bucket should be overwritten. Methods such as overwriting the least-recently used entry could be applied, but a particularly simple strategy that performs well is to simply overwrite a random entry. Rather than using a random number generator, a single counter can be maintained that is incremented each time a pointer is stored into a bucket. This achieves a combination of cyclic and random overwrite. Schemes based on this idea can outperform the standard Unix compress utility in terms of both compression and (generally) speed, while using an order of magnitude less memory. The compress program only outperforms LZRW3 methods on larger files in which its enormous dictionary is able to contain a more accurate model of the source statistics. Hash tables are currently widely used for Ziv-Lempel methods because they provide very fast searching for prior phrases. The number of references stored at each hash table location can be lited (saving storage and time), or this limit can be applied at search time by simply only searching the first few references (saving time but not storage). Collision resolution can be ignored if desired because the price paid is simply poorer compression, and not failure of the system. Hash tables can also be used for symbolwise methods, in this case to locate information about previous occurrences of contexts (Raita & Teuhola 1987). Again, the speed can be improved by ignoring collisions or limiting the extent of a search, giving a trade-off against the amount of compression.

3 227 2 A hybrid symbolwise/dictionary method - a ~b C d 7 v t r v 7 7 abcdbcabdabddbc already coded I Figure 1: Index to prior text coding position Rather than maintain an explicit data structure of contexts and phrases, our hybrid method keeps a window of previously-coded text (see Figure 1). An index to the window is used to locate the phrases that are available in each context. The index could be a hash table of contexts; however, an even faster approach is to use a straight look-up table using a single symbol as the context. The look-up table contains a maximum of k pointers for each symbol, allowing k phrases to be stored for each context. This means that not all occurrences of a context will necessarily be indexed-for example, the earliest occurrence of the context b in Figure 1 is not indexed. Larger contexts could be used with a lookup table, but the cost in memory increases exponentially with the size, and a hash table would be better for larger contexts because such an index would be very sparse.

4 228 At each coding step, the symbol that has just been encoded (an a in Figure 1) is used as the context. Previous occurrences of the context are located using the index, and the longest sequence of characters that has occurred in that context is located. In the example, the longest previous sequence is the second phrase indexed by a. This phrase is then identified in the output. This can be done using log& bits, since at most k prior contexts are indexed. Typically k is around eight, so only three bits are required to transmit the location. The number of characters that match is then transmitted. If the number of matching characters is zero, then an escape message is sent, and the next character is transmitted explicitly. Decoding is very fast. The decoder maintains the same index structure, and simply copies the appropriate phrase from the current context. If suitable codes are used, this involves very few instructions for each symbol decoded. The above is a very general description of the hybrid method. In the following section we describe some specific implementations. 3 Variations of the hybrid method The main aspects of the hybrid method described previously that are yet to be specified are the codes used for the output, and the method of updating the pointers. There are several components of the output that must be coded. The encoder must identify which of the k previous contexts is to be used. If we assume that each is equally likely then a simple code of logzk bits is appropriate. Variable length codes could be used to favour more recent phrases, although this was not investigated because of the speed penalty of the extra book-keeping required. Coding the length of matches is more critical to compression performance, because shorter matches are considerably more likely than longer ones. One strategy is to limit the length of matches to, say, m symbols, and to code the length in logzm bits. We have also investigated a variable length code for this purpose. The best compression would be obtained by codes generated by Huffman s algorithm from sample length distributions, but this would incur a significant speed penalty. When a context cannot be used for coding because the next symbol has never occurred in that context, an escape symbol must be transmitted. This can be represented by a single bit that is sent at each coding step, although this assumes that the probability of an escape occurring is 50%. Better compression can be obtained by transmitting the length of a match before identifying the phrase, and using a zero length to indicate an escape. If this is represented by a variable length code then a more appropriate length can be used. An advantage of the single-bit flag is that they can conveniently be stored eight to a byte, which admits a faster implementation. An altemative that eliminates the need for escape codes is to always transmit the context symbol explicitly. This means that progress will be made at each coding step even if the match length is zero. Another possibility is for the escape code to switch to a literal mode, where symbols are transmitted explicitly until a second escape code switches back to

5 the context coding mode. This approach has been used for some Ziv-Lempel methods (e.g. Fiala and -ne 1989). but it is not suitable in this situation because escape symbols do not tend to occur in clusters. Another possibility is to use multi-bit flags at each coding step (Fenwick 1993). Such a flag could indicate more than one literal symbol, or could select from more than one representation of phrases. A final possibility is to use a fast statistical coder (such as a table-driven Huffman coder) to represent the encoder output. This approach is used by the more successful Ziv- Lempel compression systems, which use a two-pass Huffman code to represent the output. A two-pass Huffman code gives similar compression to a single pass adaptive one, but is many times faster for both encoding and decoding if a table-driven canonical code is used (e.g. Siemidski 1988). Other methods, such as arithmetic coding, could be used, but these tend to be slower and require relatively complex models to be maintained (Gutmann 1992). The choice of method for updating pointers in the table of contexts also requires a compromise between compression and speed. In initial experiments we stored pointers to the k most recent occurrences of each context. The amount of book-keeping can be reduced by simply overwriting a randomly chosen pointer instead of the oldest one. The probability of consistently overwriting useful pointers is relatively low, and so this approach has relatively little effect on the compression performance. As with LZRW3, a suitable pseudorandom overwrite is achieved by a single counter that cycles through 1 to k to determine which pointer is to be replaced next Performance of the hybrid method In this section we evaluate the effect of the different choices suggested in the previous section. In order to determine how the parameters k (the number of pointers stored for each context) and m (the maximum match length) should be chosen, a simple hybrid compressor was implemented with a one-bit escape code followed by an 8-bit representation of the next character. The k pointers for each context were maintained using the pseudo-random cyclic overwrite. The system was used to compress the files of the Calgary corpus (Bell et al. 1990). Figure 2 shows how the compression performance of this method depends on k and m. The graph shows the average (unweighted) compression over all the files in the corpus. Figure 2 shows that compression improves as the number of phrases stored in each context increases, although the returns are diminishing by the time k = 64. The disadvantage of increasing k is that it causes a corresponding increase in encoding time due to the overall increase in the number of strings to search for matches, and it also requires more memory. If k is large and encoding speed is a problem then a more complex strategy than the simple linear search of the k entries could be used. The compression performance is relatively insensitive to the maximum match length, m, provided that it is greater than 8. The degradation in compression for longer matches could be avoided by using variable length codes for the match lengths, and we explore a simple form of this later.

6 230 Compression (bitsper chartacter) -t k=16 9 k= Marimwn match length (m) Figure. 2: Compression of a simple hybrid coder averaged over the Calgary corpus (k = number of phrases per context) To simplify coding, it is convenient if the phrase identifier and the match length can fit into one byte, that is, log$ + logzm = 8. The best compression that satisfies this constraint occurs when k = 32 and m = 8, where the corpus files are compressed to 4.12 bits per character on average (that is, files are reduced to approximately half their original size). This is remarkably good performance considering the speed and simplicity of the scheme, particularly for decoding. A multi-bit encoding has been evaluated, where a two-bit flag is sent at each coding step. Table 1 shows how the four values of the flag are interpreted. Two of the values correspond to the two values of the one-bit flag used previously; the other two values are used for shorter encodings of literals and codewords respectively. The short literal coding represents characters in six bits. These characters are selected from an adaptive list of the 64 most recently used symbols. The short codeword still has log2k bits to choose the phrase, but has fewer bits to represent the length. This takes advantage of the high frequency of short lengths. I Flag I Use 00 I 8 bits: Literal 6 bits: 64 most recently used ;i I literals Sbits:Index+length 6 bits: Index + short len th Table 1: Interpretation of the two bit flag The size of the flags have been chosen so that they can conveniently be packed into bytes to enable processing to be fast (Fenwick 1993). Four flags are packed into one byte, and they are also stored in the two remaining bits of the 6-bit literals and codewords. Using two-bit flags is equivalent to a two-level coding method for characters and lengths, and so indicates the kind of improvement that can be expected from moving to variable length codes.

7 The use of two-bit flags achieves compression of 3.57 bits per character (bpc) averaged over the Calgary corpus. This compares with 3.83 bpc for the best parameters using singlebit flag. Table 2 shows comparable results achieved by other fast compression methods. I Method I Compression I 23 1 Hybrid, 2-bit flags Compress 2.70 Table 2: Compression performance of fast methods, averaged over the Calgary corpus. The compression of the hybrid method is not quite as good as that of Unix compress, but it has the advantage that the output fits conveniently into bytes and so is able to operate faster. The Gzip method is one of the best Ziv-Lempel based methods currently available, implemented by GNU. The version reported here was set for best compression. It achieves superior compression to the hybrid method at the expense of a more extensive search for matching phrases, and using two passes to generate Huffman codes for the output. The hybrid method described here is very fast for both encoding and decoding. Searching for matches involves evaluating just k matches, where k is typically between 8 and 64. Even faster coding is possible by reducing k. Decoding requires just two indirections to locate the phrase to be copied. Input and output is fast because no codes cross byte boundaries, and are easily inserted and extracted within bytes. The memory requirements of the hybrid method are relatively low. Most of the memory is consumed by the window of prior characters and the index structure. A window of about 8Kbytes is suitable (the experiments reported above used a window of 32K, which is slightly better). If k = 32 then the index uses 16 Kbytes of memory. Analysing the output of the hybrid method reveals that the literal characters (i.e. escapes to zero-order) occur frequently-ften more frequently than coded phrases. Presumably this is because phrases tend to end when a novel character occurs, and so the first character at each coding step is less likely to have occurred in the first-order context. This suggests the possibility of a2ways encoding a literal character, eliminating the need for the escape flag. Initial experiments have indicated that this degrades compression by just a few percent. 5 Conclusion The idea of a hybrid between dictionary and symbol-wise methods has several consequences. It demonstrates that the two approaches are more closely related that might be expected. The implementation described here suggests analogies between the two classes. For example, the escape symbol used by context coders performs a function that in analogous to that of the literal flag used by Ziv-Lempel methods. Characters that are difficult to predict cause a phrase boundary to occur in the hybrid method, indicating a

8 232 correlation between low probability symbols and phrase boundaries in Ziv-Lempel coders. The index that is used to locate previous occurrences of phrases for a Ziv-Lempel method is closely related to the data structure used by a context-based method for keeping statistics about the contexts. Ziv-Lempel coders use several different strategies to determine which phrases will be made available for coding; likewise, symbolwise models must determine which contexts are the most useful to store. These issues also get caught up with compromises caused by the choice of data structure used, such as a hash table, a trie (digital search tree), or a simple look-up table. These relationships raise the possibility of a very general model that includes the two approaches-and hybrid methods-as special cases. This in tum will help to formalise the continuum of tradeoffs between compression performance, memory usage, and speed. In our investigation we have pursued speed rather than compression performance, and have created a context-based method that is extremely fast. Many other trade-offs between speed and compression performance are possible, and work is continuing on this. For example, using a Huffman code for the output is likely to give significantly better compression. Fast approximate arithmetic coders might also be used for this purpose. Acknowledgements The authors are grateful to T ho Raita and Ross Williams for helpful comments on this work. References Bell, T. C., A unifying theory and improvements for existing approaches to text compression, Ph.D. thesis. Department of Computer Science, University of Canterbury, New Zealand Bell, T.C., Cleary, J.G. and Witten, I.H. (1990) Text compression. Prentice Hall, Englewood Cliffs, NJ. Bell, T. C. and Witten, I. H., The relationship between greedy parsing and symbolwise text compression, J. ACM, in press. Fenwick, P. Ziv-Lempel Coding with Multi-bit Flags, Proceedings of DCC 93, April 1993, p.138. Fiala, E. R. and Greene, D. H., Data compression with finite windows, CO. ACM, (4): p Gutmann, P. Practical Dictionary/Arithmetic Data Compression Synthesis, University of Auckland MSc thesis, February Langdon. G. G., A note on the Ziv-Limpel model for compressing individual sequences, JEEE Transactions on Information Theory, (2): p

9 233 Raita, T. and Teuhola, J., Predictive text compression by hashing, Proceedings of the Tenth Annual International ACMSIGIR Conference, New Orleans, Raita, T., (1987) Generalized Coding Algorithms for Predictive Text Compression, Report AY7, Department of Computer Science, University of Turku, Finland. Siemihski, A. Fast Decoding of Huffman Codes, Information Processinghtters, Vo1.26, No.5 (January 1988), p.237. Williams R.N. (1991a) An Extremely Fast Ziv-Lempel Data Compression Algorithm, Proceedings of DCC 91, p.362. Williams R.N (1991b). Notes on the LZRW3 Algorithm, posted to the Usenet comp.compression newsgroup in June Ziv, J. and hmpel, A., A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, IT-23(3): p

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS Yair Wiseman 1* * 1 Computer Science Department, Bar-Ilan University, Ramat-Gan 52900, Israel Email: wiseman@cs.huji.ac.il, http://www.cs.biu.ac.il/~wiseman