A Hybrid Approach to Text Compression
|
|
- Noreen Gilmore
- 5 years ago
- Views:
Transcription
1 A Hybrid Approach to Text Compression Peter C Gutmann Computer Science, University of Auckland, New Zealand Telephone ; pgut 1 Bcs.aukuni.ac.nz Timothy C Bell Computer Science, University of Canterbury, Christchurch 1, New Zealand Telephone ; fax ; 1 Introduction Text compression schemes have sometimes been divided into two classes: symbolwise methods, which form a source model, typically using a finite context to predict symbols; and dictionary methods, which replace phrases (groups of symbols) in the input with a code. Symbolwise methods tend to give better compression because they form more accurate models of text, while dictionary methods tend to be faster because multiple symbols are coded at once. It is possible to decompose some dictionary methods into equivalent symbolwise methods (Langdon 1983, Bell & Witten in press). The decomposed method gives identical compression performance, but is slower because more coded symbols are transmitted. This decomposition is of interest primarily because it is helpful in making comparisons of the two methods. In this paper we explore a hybrid approach based on the opposite of this decomposition: the predictions of a symbolwise method are grouped together so that several characters can be coded at once. The objective is to combine the good compression of symbolwise methods with the high speed of dictionary methods. The hybrid allows tradeoffs to be made in terms of compression speed, compression performance, and memory usage. More importantly, investigating a hybrid method gives extra insights into the relationship between dictionary and symbolwise methods, and reveals that they are more closely related than might be expected. The primary goal in the design of the hybrid method described here was to create a very fast system that is based on context modelling. We therefore begin by surveying techniques that have been used in the literature to achieve fast compression. The current method of choice for very fast adaptive compressors is to use some variant of the LZ77 method (Ziv & Lempel 1977) in which the extent of the search for repeated strings is limited. In general this is accomplished by terminating the search after a predetermined number of potential matches have been checked. An extreme example of $ IEEE 225
2 226 this is LZRWl (Williams 1991a), which hashes the next few characters of the input into a table of pointers that point back into the sliding window. A new phrase entering the window is added by overwriting any existing phrase that is stored at the same location in the hash table. Consequently, only the most recent occurrence of a phrase is stored, and even this may be lost if another phrase collides with it in the hash table. This very simple replacement strategy achieves very fast compression. The output is packed into 16-bit words to make coding even faster, with 12 bits of position information (corresponding to a window size of 4K characters), and 4 bits of length information. LZRW2 (Williams 1991b) extends LZRWl by storing a table of selected phrases instead of referencing the sliding window directly. The hash table entries point to a phrase table that contains pointers to the sliding window. Since the window size is no longer limited by the hash table size (the phrase table entries can point back an arbitrarily large distance), a much larger window is available, and the index can access 4K phrases instead of 4K characters. The price paid is that the decompressor has the extra overhead of maintaining the same hash table and phrase table used in the compressor, and an extra level of indirection is introduced. LZRW3 is another refinement, which merges the hash and phrase table into one unified lookup table (Williams 1991b). In addition, LZRW3 variants store multiple pointers at each hash table location, with a commensurate decrease in the number of hash buckets so that the table is the same size. Although the reduced number of buckets leads to more collisions, the increased bucket size means that more strings can be searched for a given hash value than in the simpler versions. A bucket size of 4 or 8 phrases seems to be the best tradeoff for an overall table size of 4K entries. Several strategies can be used to decide which entry in a bucket should be overwritten. Methods such as overwriting the least-recently used entry could be applied, but a particularly simple strategy that performs well is to simply overwrite a random entry. Rather than using a random number generator, a single counter can be maintained that is incremented each time a pointer is stored into a bucket. This achieves a combination of cyclic and random overwrite. Schemes based on this idea can outperform the standard Unix compress utility in terms of both compression and (generally) speed, while using an order of magnitude less memory. The compress program only outperforms LZRW3 methods on larger files in which its enormous dictionary is able to contain a more accurate model of the source statistics. Hash tables are currently widely used for Ziv-Lempel methods because they provide very fast searching for prior phrases. The number of references stored at each hash table location can be lited (saving storage and time), or this limit can be applied at search time by simply only searching the first few references (saving time but not storage). Collision resolution can be ignored if desired because the price paid is simply poorer compression, and not failure of the system. Hash tables can also be used for symbolwise methods, in this case to locate information about previous occurrences of contexts (Raita & Teuhola 1987). Again, the speed can be improved by ignoring collisions or limiting the extent of a search, giving a trade-off against the amount of compression.
3 227 2 A hybrid symbolwise/dictionary method - a ~b C d 7 v t r v 7 7 abcdbcabdabddbc already coded I Figure 1: Index to prior text coding position Rather than maintain an explicit data structure of contexts and phrases, our hybrid method keeps a window of previously-coded text (see Figure 1). An index to the window is used to locate the phrases that are available in each context. The index could be a hash table of contexts; however, an even faster approach is to use a straight look-up table using a single symbol as the context. The look-up table contains a maximum of k pointers for each symbol, allowing k phrases to be stored for each context. This means that not all occurrences of a context will necessarily be indexed-for example, the earliest occurrence of the context b in Figure 1 is not indexed. Larger contexts could be used with a lookup table, but the cost in memory increases exponentially with the size, and a hash table would be better for larger contexts because such an index would be very sparse.
4 228 At each coding step, the symbol that has just been encoded (an a in Figure 1) is used as the context. Previous occurrences of the context are located using the index, and the longest sequence of characters that has occurred in that context is located. In the example, the longest previous sequence is the second phrase indexed by a. This phrase is then identified in the output. This can be done using log& bits, since at most k prior contexts are indexed. Typically k is around eight, so only three bits are required to transmit the location. The number of characters that match is then transmitted. If the number of matching characters is zero, then an escape message is sent, and the next character is transmitted explicitly. Decoding is very fast. The decoder maintains the same index structure, and simply copies the appropriate phrase from the current context. If suitable codes are used, this involves very few instructions for each symbol decoded. The above is a very general description of the hybrid method. In the following section we describe some specific implementations. 3 Variations of the hybrid method The main aspects of the hybrid method described previously that are yet to be specified are the codes used for the output, and the method of updating the pointers. There are several components of the output that must be coded. The encoder must identify which of the k previous contexts is to be used. If we assume that each is equally likely then a simple code of logzk bits is appropriate. Variable length codes could be used to favour more recent phrases, although this was not investigated because of the speed penalty of the extra book-keeping required. Coding the length of matches is more critical to compression performance, because shorter matches are considerably more likely than longer ones. One strategy is to limit the length of matches to, say, m symbols, and to code the length in logzm bits. We have also investigated a variable length code for this purpose. The best compression would be obtained by codes generated by Huffman s algorithm from sample length distributions, but this would incur a significant speed penalty. When a context cannot be used for coding because the next symbol has never occurred in that context, an escape symbol must be transmitted. This can be represented by a single bit that is sent at each coding step, although this assumes that the probability of an escape occurring is 50%. Better compression can be obtained by transmitting the length of a match before identifying the phrase, and using a zero length to indicate an escape. If this is represented by a variable length code then a more appropriate length can be used. An advantage of the single-bit flag is that they can conveniently be stored eight to a byte, which admits a faster implementation. An altemative that eliminates the need for escape codes is to always transmit the context symbol explicitly. This means that progress will be made at each coding step even if the match length is zero. Another possibility is for the escape code to switch to a literal mode, where symbols are transmitted explicitly until a second escape code switches back to
5 the context coding mode. This approach has been used for some Ziv-Lempel methods (e.g. Fiala and -ne 1989). but it is not suitable in this situation because escape symbols do not tend to occur in clusters. Another possibility is to use multi-bit flags at each coding step (Fenwick 1993). Such a flag could indicate more than one literal symbol, or could select from more than one representation of phrases. A final possibility is to use a fast statistical coder (such as a table-driven Huffman coder) to represent the encoder output. This approach is used by the more successful Ziv- Lempel compression systems, which use a two-pass Huffman code to represent the output. A two-pass Huffman code gives similar compression to a single pass adaptive one, but is many times faster for both encoding and decoding if a table-driven canonical code is used (e.g. Siemidski 1988). Other methods, such as arithmetic coding, could be used, but these tend to be slower and require relatively complex models to be maintained (Gutmann 1992). The choice of method for updating pointers in the table of contexts also requires a compromise between compression and speed. In initial experiments we stored pointers to the k most recent occurrences of each context. The amount of book-keeping can be reduced by simply overwriting a randomly chosen pointer instead of the oldest one. The probability of consistently overwriting useful pointers is relatively low, and so this approach has relatively little effect on the compression performance. As with LZRW3, a suitable pseudorandom overwrite is achieved by a single counter that cycles through 1 to k to determine which pointer is to be replaced next Performance of the hybrid method In this section we evaluate the effect of the different choices suggested in the previous section. In order to determine how the parameters k (the number of pointers stored for each context) and m (the maximum match length) should be chosen, a simple hybrid compressor was implemented with a one-bit escape code followed by an 8-bit representation of the next character. The k pointers for each context were maintained using the pseudo-random cyclic overwrite. The system was used to compress the files of the Calgary corpus (Bell et al. 1990). Figure 2 shows how the compression performance of this method depends on k and m. The graph shows the average (unweighted) compression over all the files in the corpus. Figure 2 shows that compression improves as the number of phrases stored in each context increases, although the returns are diminishing by the time k = 64. The disadvantage of increasing k is that it causes a corresponding increase in encoding time due to the overall increase in the number of strings to search for matches, and it also requires more memory. If k is large and encoding speed is a problem then a more complex strategy than the simple linear search of the k entries could be used. The compression performance is relatively insensitive to the maximum match length, m, provided that it is greater than 8. The degradation in compression for longer matches could be avoided by using variable length codes for the match lengths, and we explore a simple form of this later.
6 230 Compression (bitsper chartacter) -t k=16 9 k= Marimwn match length (m) Figure. 2: Compression of a simple hybrid coder averaged over the Calgary corpus (k = number of phrases per context) To simplify coding, it is convenient if the phrase identifier and the match length can fit into one byte, that is, log$ + logzm = 8. The best compression that satisfies this constraint occurs when k = 32 and m = 8, where the corpus files are compressed to 4.12 bits per character on average (that is, files are reduced to approximately half their original size). This is remarkably good performance considering the speed and simplicity of the scheme, particularly for decoding. A multi-bit encoding has been evaluated, where a two-bit flag is sent at each coding step. Table 1 shows how the four values of the flag are interpreted. Two of the values correspond to the two values of the one-bit flag used previously; the other two values are used for shorter encodings of literals and codewords respectively. The short literal coding represents characters in six bits. These characters are selected from an adaptive list of the 64 most recently used symbols. The short codeword still has log2k bits to choose the phrase, but has fewer bits to represent the length. This takes advantage of the high frequency of short lengths. I Flag I Use 00 I 8 bits: Literal 6 bits: 64 most recently used ;i I literals Sbits:Index+length 6 bits: Index + short len th Table 1: Interpretation of the two bit flag The size of the flags have been chosen so that they can conveniently be packed into bytes to enable processing to be fast (Fenwick 1993). Four flags are packed into one byte, and they are also stored in the two remaining bits of the 6-bit literals and codewords. Using two-bit flags is equivalent to a two-level coding method for characters and lengths, and so indicates the kind of improvement that can be expected from moving to variable length codes.
7 The use of two-bit flags achieves compression of 3.57 bits per character (bpc) averaged over the Calgary corpus. This compares with 3.83 bpc for the best parameters using singlebit flag. Table 2 shows comparable results achieved by other fast compression methods. I Method I Compression I 23 1 Hybrid, 2-bit flags Compress 2.70 Table 2: Compression performance of fast methods, averaged over the Calgary corpus. The compression of the hybrid method is not quite as good as that of Unix compress, but it has the advantage that the output fits conveniently into bytes and so is able to operate faster. The Gzip method is one of the best Ziv-Lempel based methods currently available, implemented by GNU. The version reported here was set for best compression. It achieves superior compression to the hybrid method at the expense of a more extensive search for matching phrases, and using two passes to generate Huffman codes for the output. The hybrid method described here is very fast for both encoding and decoding. Searching for matches involves evaluating just k matches, where k is typically between 8 and 64. Even faster coding is possible by reducing k. Decoding requires just two indirections to locate the phrase to be copied. Input and output is fast because no codes cross byte boundaries, and are easily inserted and extracted within bytes. The memory requirements of the hybrid method are relatively low. Most of the memory is consumed by the window of prior characters and the index structure. A window of about 8Kbytes is suitable (the experiments reported above used a window of 32K, which is slightly better). If k = 32 then the index uses 16 Kbytes of memory. Analysing the output of the hybrid method reveals that the literal characters (i.e. escapes to zero-order) occur frequently-ften more frequently than coded phrases. Presumably this is because phrases tend to end when a novel character occurs, and so the first character at each coding step is less likely to have occurred in the first-order context. This suggests the possibility of a2ways encoding a literal character, eliminating the need for the escape flag. Initial experiments have indicated that this degrades compression by just a few percent. 5 Conclusion The idea of a hybrid between dictionary and symbol-wise methods has several consequences. It demonstrates that the two approaches are more closely related that might be expected. The implementation described here suggests analogies between the two classes. For example, the escape symbol used by context coders performs a function that in analogous to that of the literal flag used by Ziv-Lempel methods. Characters that are difficult to predict cause a phrase boundary to occur in the hybrid method, indicating a
8 232 correlation between low probability symbols and phrase boundaries in Ziv-Lempel coders. The index that is used to locate previous occurrences of phrases for a Ziv-Lempel method is closely related to the data structure used by a context-based method for keeping statistics about the contexts. Ziv-Lempel coders use several different strategies to determine which phrases will be made available for coding; likewise, symbolwise models must determine which contexts are the most useful to store. These issues also get caught up with compromises caused by the choice of data structure used, such as a hash table, a trie (digital search tree), or a simple look-up table. These relationships raise the possibility of a very general model that includes the two approaches-and hybrid methods-as special cases. This in tum will help to formalise the continuum of tradeoffs between compression performance, memory usage, and speed. In our investigation we have pursued speed rather than compression performance, and have created a context-based method that is extremely fast. Many other trade-offs between speed and compression performance are possible, and work is continuing on this. For example, using a Huffman code for the output is likely to give significantly better compression. Fast approximate arithmetic coders might also be used for this purpose. Acknowledgements The authors are grateful to T ho Raita and Ross Williams for helpful comments on this work. References Bell, T. C., A unifying theory and improvements for existing approaches to text compression, Ph.D. thesis. Department of Computer Science, University of Canterbury, New Zealand Bell, T.C., Cleary, J.G. and Witten, I.H. (1990) Text compression. Prentice Hall, Englewood Cliffs, NJ. Bell, T. C. and Witten, I. H., The relationship between greedy parsing and symbolwise text compression, J. ACM, in press. Fenwick, P. Ziv-Lempel Coding with Multi-bit Flags, Proceedings of DCC 93, April 1993, p.138. Fiala, E. R. and Greene, D. H., Data compression with finite windows, CO. ACM, (4): p Gutmann, P. Practical Dictionary/Arithmetic Data Compression Synthesis, University of Auckland MSc thesis, February Langdon. G. G., A note on the Ziv-Limpel model for compressing individual sequences, JEEE Transactions on Information Theory, (2): p
9 233 Raita, T. and Teuhola, J., Predictive text compression by hashing, Proceedings of the Tenth Annual International ACMSIGIR Conference, New Orleans, Raita, T., (1987) Generalized Coding Algorithms for Predictive Text Compression, Report AY7, Department of Computer Science, University of Turku, Finland. Siemihski, A. Fast Decoding of Huffman Codes, Information Processinghtters, Vo1.26, No.5 (January 1988), p.237. Williams R.N. (1991a) An Extremely Fast Ziv-Lempel Data Compression Algorithm, Proceedings of DCC 91, p.362. Williams R.N (1991b). Notes on the LZRW3 Algorithm, posted to the Usenet comp.compression newsgroup in June Ziv, J. and hmpel, A., A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, IT-23(3): p
THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS
THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS Yair Wiseman 1* * 1 Computer Science Department, Bar-Ilan University, Ramat-Gan 52900, Israel Email: wiseman@cs.huji.ac.il, http://www.cs.biu.ac.il/~wiseman
More informationCS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77
CS 493: Algorithms for Massive Data Sets February 14, 2002 Dictionary-based compression Scribe: Tony Wirth This lecture will explore two adaptive dictionary compression schemes: LZ77 and LZ78. We use the
More informationA Fast Block sorting Algorithm for lossless Data Compression
A Fast Block sorting Algorithm for lossless Data Compression DI Michael Schindler Vienna University of Technology Karlsplatz 13/1861, A 1040 Wien, Austria, Europe michael@eiunix.tuwien.ac.at if.at is transformed
More informationComparative Study of Dictionary based Compression Algorithms on Text Data
88 Comparative Study of Dictionary based Compression Algorithms on Text Data Amit Jain Kamaljit I. Lakhtaria Sir Padampat Singhania University, Udaipur (Raj.) 323601 India Abstract: With increasing amount
More informationData Compression Techniques
Data Compression Techniques Part 2: Text Compression Lecture 6: Dictionary Compression Juha Kärkkäinen 15.11.2017 1 / 17 Dictionary Compression The compression techniques we have seen so far replace individual
More informationThe Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods
The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods R. Nigel Horspool Dept. of Computer Science, University of Victoria P. O. Box 3055, Victoria, B.C., Canada V8W 3P6 E-mail address: nigelh@csr.uvic.ca
More informationAn Order-2 Context Model for Data Compression. With Reduced Time and Space Requirements. Technical Report No
An Order-2 Context Model for Data Compression With Reduced Time and Space Requirements Debra A. Lelewer and Daniel S. Hirschberg Technical Report No. 90-33 Abstract Context modeling has emerged as the
More informationS 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources
Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources Author: Supervisor: Luhao Liu Dr. -Ing. Thomas B. Preußer Dr. -Ing. Steffen Köhler 09.10.2014
More informationSIGNAL COMPRESSION Lecture Lempel-Ziv Coding
SIGNAL COMPRESSION Lecture 5 11.9.2007 Lempel-Ziv Coding Dictionary methods Ziv-Lempel 77 The gzip variant of Ziv-Lempel 77 Ziv-Lempel 78 The LZW variant of Ziv-Lempel 78 Asymptotic optimality of Ziv-Lempel
More informationAn Asymmetric, Semi-adaptive Text Compression Algorithm
An Asymmetric, Semi-adaptive Text Compression Algorithm Harry Plantinga Department of Computer Science University of Pittsburgh Pittsburgh, PA 15260 planting@cs.pitt.edu Abstract A new heuristic for text
More informationIntegrating Error Detection into Arithmetic Coding
Integrating Error Detection into Arithmetic Coding Colin Boyd Λ, John G. Cleary, Sean A. Irvine, Ingrid Rinsma-Melchert, Ian H. Witten Department of Computer Science University of Waikato Hamilton New
More informationarxiv: v2 [cs.it] 15 Jan 2011
Improving PPM Algorithm Using Dictionaries Yichuan Hu Department of Electrical and Systems Engineering University of Pennsylvania Email: yichuan@seas.upenn.edu Jianzhong (Charlie) Zhang, Farooq Khan and
More informationCategory: Informational May DEFLATE Compressed Data Format Specification version 1.3
Network Working Group P. Deutsch Request for Comments: 1951 Aladdin Enterprises Category: Informational May 1996 DEFLATE Compressed Data Format Specification version 1.3 Status of This Memo This memo provides
More informationEntropy Coding. - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic Code
Entropy Coding } different probabilities for the appearing of single symbols are used - to shorten the average code length by assigning shorter codes to more probable symbols => Morse-, Huffman-, Arithmetic
More informationLossless compression II
Lossless II D 44 R 52 B 81 C 84 D 86 R 82 A 85 A 87 A 83 R 88 A 8A B 89 A 8B Symbol Probability Range a 0.2 [0.0, 0.2) e 0.3 [0.2, 0.5) i 0.1 [0.5, 0.6) o 0.2 [0.6, 0.8) u 0.1 [0.8, 0.9)! 0.1 [0.9, 1.0)
More informationNetwork Working Group. Category: Informational August 1996
Network Working Group J. Woods Request for Comments: 1979 Proteon, Inc. Category: Informational August 1996 Status of This Memo PPP Deflate Protocol This memo provides information for the Internet community.
More informationA Comparative Study Of Text Compression Algorithms
International Journal of Wisdom Based Computing, Vol. 1 (3), December 2011 68 A Comparative Study Of Text Compression Algorithms Senthil Shanmugasundaram Department of Computer Science, Vidyasagar College
More informationAn On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland
An On-line Variable Length inary Encoding Tinku Acharya Joseph F. Ja Ja Institute for Systems Research and Institute for Advanced Computer Studies University of Maryland College Park, MD 242 facharya,
More informationA Simple Lossless Compression Heuristic for Grey Scale Images
L. Cinque 1, S. De Agostino 1, F. Liberati 1 and B. Westgeest 2 1 Computer Science Department University La Sapienza Via Salaria 113, 00198 Rome, Italy e-mail: deagostino@di.uniroma1.it 2 Computer Science
More informationDictionary selection using partial matching
Information Sciences 119 (1999) 57±72 www.elsevier.com/locate/ins Dictionary selection using partial matching Dzung T. Hoang a,1, Philip M. Long b, *,2, Je rey Scott Vitter c,3 a Digital Video Systems,
More informationSimple variant of coding with a variable number of symbols and fixlength codewords.
Dictionary coding Simple variant of coding with a variable number of symbols and fixlength codewords. Create a dictionary containing 2 b different symbol sequences and code them with codewords of length
More informationEE-575 INFORMATION THEORY - SEM 092
EE-575 INFORMATION THEORY - SEM 092 Project Report on Lempel Ziv compression technique. Department of Electrical Engineering Prepared By: Mohammed Akber Ali Student ID # g200806120. ------------------------------------------------------------------------------------------------------------------------------------------
More informationITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding
ITCT Lecture 8.2: Dictionary Codes and Lempel-Ziv Coding Huffman codes require us to have a fairly reasonable idea of how source symbol probabilities are distributed. There are a number of applications
More informationYou can say that again! Text compression
Activity 3 You can say that again! Text compression Age group Early elementary and up. Abilities assumed Copying written text. Time 10 minutes or more. Size of group From individuals to the whole class.
More informationADVANCED LOSSLESS TEXT COMPRESSION ALGORITHM BASED ON SPLAY TREE ADAPTIVE METHODS
ADVANCED LOSSLESS TEXT COMPRESSION ALGORITHM BASED ON SPLAY TREE ADAPTIVE METHODS RADU RĂDESCU, ANDREEA HONCIUC *1 Key words: Data compression, Splay Tree, Prefix, ratio. This paper presents an original
More informationModeling Delta Encoding of Compressed Files
Shmuel T. Klein 1, Tamar C. Serebro 1, and Dana Shapira 2 1 Department of Computer Science Bar Ilan University Ramat Gan, Israel tomi@cs.biu.ac.il, t lender@hotmail.com 2 Department of Computer Science
More informationDesign and Implementation of FPGA- based Systolic Array for LZ Data Compression
Design and Implementation of FPGA- based Systolic Array for LZ Data Compression Mohamed A. Abd El ghany Electronics Dept. German University in Cairo Cairo, Egypt E-mail: mohamed.abdel-ghany@guc.edu.eg
More informationCode Compression for DSP
Code for DSP Charles Lefurgy and Trevor Mudge {lefurgy,tnm}@eecs.umich.edu EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 http://www.eecs.umich.edu/~tnm/compress Abstract
More informationModeling Delta Encoding of Compressed Files
Modeling Delta Encoding of Compressed Files EXTENDED ABSTRACT S.T. Klein, T.C. Serebro, and D. Shapira 1 Dept of CS Bar Ilan University Ramat Gan, Israel tomi@cs.biu.ac.il 2 Dept of CS Bar Ilan University
More informationTEXT COMPRESSION ALGORITHMS - A COMPARATIVE STUDY
S SENTHIL AND L ROBERT: TEXT COMPRESSION ALGORITHMS A COMPARATIVE STUDY DOI: 10.21917/ijct.2011.0062 TEXT COMPRESSION ALGORITHMS - A COMPARATIVE STUDY S. Senthil 1 and L. Robert 2 1 Department of Computer
More informationError Resilient LZ 77 Data Compression
Error Resilient LZ 77 Data Compression Stefano Lonardi Wojciech Szpankowski Mark Daniel Ward Presentation by Peter Macko Motivation Lempel-Ziv 77 lacks any form of error correction Introducing a single
More informationMODELING DELTA ENCODING OF COMPRESSED FILES. and. and
International Journal of Foundations of Computer Science c World Scientific Publishing Company MODELING DELTA ENCODING OF COMPRESSED FILES SHMUEL T. KLEIN Department of Computer Science, Bar-Ilan University
More informationFPGA based Data Compression using Dictionary based LZW Algorithm
FPGA based Data Compression using Dictionary based LZW Algorithm Samish Kamble PG Student, E & TC Department, D.Y. Patil College of Engineering, Kolhapur, India Prof. S B Patil Asso.Professor, E & TC Department,
More informationCompression by Induction of Hierarchical Grammars
Compression by Induction of Hierarchical Grammars Craig G. Nevill-Manning Computer Science, University of Waikato, Hamilton, New Zealand Telephone +64 7 838 4021; email cgn@waikato.ac.nz Ian H. Witten
More informationPractical Fixed Length Lempel Ziv Coding
Practical Fixed Length Lempel Ziv Coding Shmuel T. Klein 1 and Dana Shapira 2 1 Dept. of Computer Science, Bar Ilan University, Ramat Gan 52900, Israel tomi@cs.biu.ac.il 2 Dept. of Computer Science, Ashkelon
More informationDesign and Implementation of a Data Compression Scheme: A Partial Matching Approach
Design and Implementation of a Data Compression Scheme: A Partial Matching Approach F. Choong, M. B. I. Reaz, T. C. Chin, F. Mohd-Yasin Faculty of Engineering, Multimedia University, 63100 Cyberjaya, Selangor,
More informationCategory: Informational December 1998
Network Working Group R. Pereira Request for Comments: 2394 TimeStep Corporation Category: Informational December 1998 Status of this Memo IP Payload Compression Using DEFLATE This memo provides information
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationPractical Fixed Length Lempel Ziv Coding
Practical Fixed Length Lempel Ziv Coding Shmuel T. Klein a, Dana Shapira b a Department of Computer Science, Bar Ilan University, Ramat Gan 52900, Israel tomi@cs.biu.ac.il b Dept. of Computer Science,
More informationDictionary techniques
Dictionary techniques The final concept that we will mention in this chapter is about dictionary techniques. Many modern compression algorithms rely on the modified versions of various dictionary techniques.
More informationGipfeli - High Speed Compression Algorithm
Gipfeli - High Speed Compression Algorithm Rastislav Lenhardt I, II and Jyrki Alakuijala II I University of Oxford United Kingdom rastislav.lenhardt@cs.ox.ac.uk II Google Switzerland GmbH jyrki@google.com
More informationAAL 217: DATA STRUCTURES
Chapter # 4: Hashing AAL 217: DATA STRUCTURES The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions, and finds in constant average
More informationLossless Compression Algorithms
Multimedia Data Compression Part I Chapter 7 Lossless Compression Algorithms 1 Chapter 7 Lossless Compression Algorithms 1. Introduction 2. Basics of Information Theory 3. Lossless Compression Algorithms
More informationCTW in Dasher: Summary and results.
CTW in Dasher: Summary and results. After finishing my graduation thesis Using CTW as a language modeler in Dasher, I have visited the Inference group of the Physics department of the University of Cambridge,
More informationBasic Compression Library
Basic Compression Library Manual API version 1.2 July 22, 2006 c 2003-2006 Marcus Geelnard Summary This document describes the algorithms used in the Basic Compression Library, and how to use the library
More informationLZ UTF8. LZ UTF8 is a practical text compression library and stream format designed with the following objectives and properties:
LZ UTF8 LZ UTF8 is a practical text compression library and stream format designed with the following objectives and properties: 1. Compress UTF 8 and 7 bit ASCII strings only. No support for arbitrary
More information5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing. 6. Meta-heuristic Algorithms and Rectangular Packing
1. Introduction 2. Cutting and Packing Problems 3. Optimisation Techniques 4. Automated Packing Techniques 5. Computational Geometry, Benchmarks and Algorithms for Rectangular and Irregular Packing 6.
More informationCompression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:
CS231 Algorithms Handout # 31 Prof. Lyn Turbak November 20, 2001 Wellesley College Compression The Big Picture We want to be able to store and retrieve data, as well as communicate it with others. In general,
More informationImage Compression - An Overview Jagroop Singh 1
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 5 Issues 8 Aug 2016, Page No. 17535-17539 Image Compression - An Overview Jagroop Singh 1 1 Faculty DAV Institute
More informationDEFLATE COMPRESSION ALGORITHM
DEFLATE COMPRESSION ALGORITHM Savan Oswal 1, Anjali Singh 2, Kirthi Kumari 3 B.E Student, Department of Information Technology, KJ'S Trinity College Of Engineering and Research, Pune, India 1,2.3 Abstract
More informationLempel-Ziv-Welch (LZW) Compression Algorithm
Lempel-Ziv-Welch (LZW) Compression lgorithm Introduction to the LZW lgorithm Example 1: Encoding using LZW Example 2: Decoding using LZW LZW: Concluding Notes Introduction to LZW s mentioned earlier, static
More informationLIPT-Derived Transform Methods Used in Lossless Compression of Text Files
ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY Volume 14, Number 2, 2011, 149 158 LIPT-Derived Transform Methods Used in Lossless Compression of Text Files Radu RĂDESCU Politehnica University of
More informationPREDICTIVE CODING WITH NEURAL NETS: APPLICATION TO TEXT COMPRESSION
PREDICTIVE CODING WITH NEURAL NETS: APPLICATION TO TEXT COMPRESSION J iirgen Schmidhuber Fakultat fiir Informatik Technische Universitat Miinchen 80290 Miinchen, Germany Stefan Heil Abstract To compress
More informationProgram Construction and Data Structures Course 1DL201 at Uppsala University Autumn 2010 / Spring 2011 Homework 6: Data Compression
Program Construction and Data Structures Course 1DL201 at Uppsala University Autumn 2010 / Spring 2011 Homework 6: Data Compression Prepared by Pierre Flener Lab: Thursday 17 February 2011 Submission Deadline:
More informationCHAPTER II LITERATURE REVIEW
CHAPTER II LITERATURE REVIEW 2.1 BACKGROUND OF THE STUDY The purpose of this chapter is to study and analyze famous lossless data compression algorithm, called LZW. The main objective of the study is to
More informationDigital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay
Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 29 Source Coding (Part-4) We have already had 3 classes on source coding
More informationA Quality of Service Decision Model for ATM-LAN/MAN Interconnection
A Quality of Service Decision for ATM-LAN/MAN Interconnection N. Davies, P. Francis-Cobley Department of Computer Science, University of Bristol Introduction With ATM networks now coming of age, there
More informationOn Additional Constrains in Lossless Compression of Text Files
ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY Volume 18, Number 4, 2015, 299 311 On Additional Constrains in Lossless Compression of Text Files Radu RĂDESCU Politehnica University of Bucharest,
More informationIndexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton
Indexing Index Construction CS6200: Information Retrieval Slides by: Jesse Anderton Motivation: Scale Corpus Terms Docs Entries A term incidence matrix with V terms and D documents has O(V x D) entries.
More informationThe PackBits program on the Macintosh used a generalized RLE scheme for data compression.
Tidbits on Image Compression (Above, Lena, unwitting data compression spokeswoman) In CS203 you probably saw how to create Huffman codes with greedy algorithms. Let s examine some other methods of compressing
More informationT. Bell and K. Pawlikowski University of Canterbury Christchurch, New Zealand
The effect of data compression on packet sizes in data communication systems T. Bell and K. Pawlikowski University of Canterbury Christchurch, New Zealand Abstract.?????????? 1. INTRODUCTION Measurements
More informationAdaptive Compression of Graph Structured Text
Adaptive Compression of Graph Structured Text John Gilbert and David M Abrahamson Department of Computer Science Trinity College Dublin {gilberj, david.abrahamson}@cs.tcd.ie Abstract In this paper we introduce
More informationCS/COE 1501
CS/COE 1501 www.cs.pitt.edu/~lipschultz/cs1501/ Compression What is compression? Represent the same data using less storage space Can get more use out a disk of a given size Can get more use out of memory
More informationMemory Design. Cache Memory. Processor operates much faster than the main memory can.
Memory Design Cache Memory Processor operates much faster than the main memory can. To ameliorate the sitution, a high speed memory called a cache memory placed between the processor and main memory. Barry
More informationThe Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression
The Effect of Flexible Parsing for Dynamic Dictionary Based Data Compression Yossi Matias Nasir Rajpoot Süleyman Cenk Ṣahinalp Abstract We report on the performance evaluation of greedy parsing with a
More informationEngineering Mathematics II Lecture 16 Compression
010.141 Engineering Mathematics II Lecture 16 Compression Bob McKay School of Computer Science and Engineering College of Engineering Seoul National University 1 Lossless Compression Outline Huffman &
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationV.2 Index Compression
V.2 Index Compression Heap s law (empirically observed and postulated): Size of the vocabulary (distinct terms) in a corpus E[ distinct terms in corpus] n with total number of term occurrences n, and constants,
More informationImproving LZW Image Compression
European Journal of Scientific Research ISSN 1450-216X Vol.44 No.3 (2010), pp.502-509 EuroJournals Publishing, Inc. 2010 http://www.eurojournals.com/ejsr.htm Improving LZW Image Compression Sawsan A. Abu
More informationOptimal Parsing. In Dictionary-Symbolwise. Compression Algorithms
Università degli Studi di Palermo Facoltà Di Scienze Matematiche Fisiche E Naturali Tesi Di Laurea In Scienze Dell Informazione Optimal Parsing In Dictionary-Symbolwise Compression Algorithms Il candidato
More informationStudy of LZ77 and LZ78 Data Compression Techniques
Study of LZ77 and LZ78 Data Compression Techniques Suman M. Choudhary, Anjali S. Patel, Sonal J. Parmar Abstract Data Compression is defined as the science and art of the representation of information
More informationHorn Formulae. CS124 Course Notes 8 Spring 2018
CS124 Course Notes 8 Spring 2018 In today s lecture we will be looking a bit more closely at the Greedy approach to designing algorithms. As we will see, sometimes it works, and sometimes even when it
More informationData Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression
An overview of Compression Multimedia Systems and Applications Data Compression Compression becomes necessary in multimedia because it requires large amounts of storage space and bandwidth Types of Compression
More informationText Compression. General remarks and Huffman coding Adobe pages Arithmetic coding Adobe pages 15 25
Text Compression General remarks and Huffman coding Adobe pages 2 14 Arithmetic coding Adobe pages 15 25 Dictionary coding and the LZW family Adobe pages 26 46 Performance considerations Adobe pages 47
More informationLossless Image Compression having Compression Ratio Higher than JPEG
Cloud Computing & Big Data 35 Lossless Image Compression having Compression Ratio Higher than JPEG Madan Singh madan.phdce@gmail.com, Vishal Chaudhary Computer Science and Engineering, Jaipur National
More informationA novel lossless data compression scheme based on the error correcting Hamming codes
Computers and Mathematics with Applications 56 (2008) 143 150 www.elsevier.com/locate/camwa A novel lossless data compression scheme based on the error correcting Hamming codes Hussein Al-Bahadili Department
More informationCode Compression for RISC Processors with Variable Length Instruction Encoding
Code Compression for RISC Processors with Variable Length Instruction Encoding S. S. Gupta, D. Das, S.K. Panda, R. Kumar and P. P. Chakrabarty Department of Computer Science & Engineering Indian Institute
More informationDepartment of electronics and telecommunication, J.D.I.E.T.Yavatmal, India 2
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY LOSSLESS METHOD OF IMAGE COMPRESSION USING HUFFMAN CODING TECHNIQUES Trupti S Bobade *, Anushri S. sastikar 1 Department of electronics
More informationCompression of Concatenated Web Pages Using XBW
Compression of Concatenated Web Pages Using XBW Radovan Šesták and Jan Lánský Charles University, Faculty of Mathematics and Physics, Department of Software Engineering Malostranské nám. 25, 118 00 Praha
More informationAN ANALYTICAL STUDY OF LOSSY COMPRESSION TECHINIQUES ON CONTINUOUS TONE GRAPHICAL IMAGES
AN ANALYTICAL STUDY OF LOSSY COMPRESSION TECHINIQUES ON CONTINUOUS TONE GRAPHICAL IMAGES Dr.S.Narayanan Computer Centre, Alagappa University, Karaikudi-South (India) ABSTRACT The programs using complex
More informationEastern Mediterranean University School of Computing and Technology CACHE MEMORY. Computer memory is organized into a hierarchy.
Eastern Mediterranean University School of Computing and Technology ITEC255 Computer Organization & Architecture CACHE MEMORY Introduction Computer memory is organized into a hierarchy. At the highest
More informationImage compression. Stefano Ferrari. Università degli Studi di Milano Methods for Image Processing. academic year
Image compression Stefano Ferrari Università degli Studi di Milano stefano.ferrari@unimi.it Methods for Image Processing academic year 2017 2018 Data and information The representation of images in a raw
More informationCSE 454. Index Compression Alta Vista PageRank
CSE 454 Index Compression Alta Vista PageRank 1 Review t 1 d i q Vector Space Representation Dot Product as Similarity Metric d j t 2 TF-IDF for Computing Weights w ij = f(i,j) * log(n/n i ) Where q =
More informationInformation Retrieval. Chap 7. Text Operations
Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing
More informationData Compression. Guest lecture, SGDS Fall 2011
Data Compression Guest lecture, SGDS Fall 2011 1 Basics Lossy/lossless Alphabet compaction Compression is impossible Compression is possible RLE Variable-length codes Undecidable Pigeon-holes Patterns
More informationA Research Paper on Lossless Data Compression Techniques
IJIRST International Journal for Innovative Research in Science & Technology Volume 4 Issue 1 June 2017 ISSN (online): 2349-6010 A Research Paper on Lossless Data Compression Techniques Prof. Dipti Mathpal
More informationChapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,
Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations
More informationOPTIMIZATION OF LZW (LEMPEL-ZIV-WELCH) ALGORITHM TO REDUCE TIME COMPLEXITY FOR DICTIONARY CREATION IN ENCODING AND DECODING
Asian Journal Of Computer Science And Information Technology 2: 5 (2012) 114 118. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science and Information Technology Journal
More informationOptimized Compression and Decompression Software
2015 IJSRSET Volume 1 Issue 3 Print ISSN : 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Optimized Compression and Decompression Software Mohd Shafaat Hussain, Manoj Yadav
More informationDistributed source coding
Distributed source coding Suppose that we want to encode two sources (X, Y ) with joint probability mass function p(x, y). If the encoder has access to both X and Y, it is sufficient to use a rate R >
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Enhanced LZW (Lempel-Ziv-Welch) Algorithm by Binary Search with
More informationHashing. Hashing Procedures
Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements
More informationWIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION
WIRE/WIRELESS SENSOR NETWORKS USING K-RLE ALGORITHM FOR A LOW POWER DATA COMPRESSION V.KRISHNAN1, MR. R.TRINADH 2 1 M. Tech Student, 2 M. Tech., Assistant Professor, Dept. Of E.C.E, SIR C.R. Reddy college
More informationDictionary-Based Fast Transform for Text Compression with High Compression Ratio
Dictionary-Based Fast for Text Compression with High Compression Ratio Weifeng Sun Amar Mukherjee School of Electrical Engineering and Computer Science University of Central Florida Orlando, FL. 32816
More informationChapter 5 VARIABLE-LENGTH CODING Information Theory Results (II)
Chapter 5 VARIABLE-LENGTH CODING ---- Information Theory Results (II) 1 Some Fundamental Results Coding an Information Source Consider an information source, represented by a source alphabet S. S = { s,
More informationImage Compression for Mobile Devices using Prediction and Direct Coding Approach
Image Compression for Mobile Devices using Prediction and Direct Coding Approach Joshua Rajah Devadason M.E. scholar, CIT Coimbatore, India Mr. T. Ramraj Assistant Professor, CIT Coimbatore, India Abstract
More informationDictionary Based Compression for Images
Dictionary Based Compression for Images Bruno Carpentieri Abstract Lempel-Ziv methods were original introduced to compress one-dimensional data (text, object codes, etc.) but recently they have been successfully
More informationUnified VLSI Systolic Array Design for LZ Data Compression
Unified VLSI Systolic Array Design for LZ Data Compression Shih-Arn Hwang, and Cheng-Wen Wu Dept. of EE, NTHU, Taiwan, R.O.C. IEEE Trans. on VLSI Systems Vol. 9, No.4, Aug. 2001 Pages: 489-499 Presenter:
More informationOrdered Indices To gain fast random access to records in a file, we can use an index structure. Each index structure is associated with a particular search key. Just like index of a book, library catalog,
More informationLossy Color Image Compression Based on Singular Value Decomposition and GNU GZIP
Lossy Color Image Compression Based on Singular Value Decomposition and GNU GZIP Jila-Ayubi 1, Mehdi-Rezaei 2 1 Department of Electrical engineering, Meraaj Institue, Salmas, Iran jila.ayubi@gmail.com
More information