number of passes. Each pass of the data improves the overall compression achieved and, in almost all cases, compression achieved is better than the ad

Size: px
Start display at page:

Download "number of passes. Each pass of the data improves the overall compression achieved and, in almost all cases, compression achieved is better than the ad"

Transcription

1 A General-Purpose Compression Scheme for Databases Adam Cannane Hugh E. Williams Justin Zobel Department of Computer Science, RMIT University, GPO Box 2476V, Melbourne 3001, Australia Abstract Current adaptive compression schemes such as gzip and compress are impractical for database compression as they do not allow random-access to individual records. The sequitur scheme of Nevill-Manning and Witten also adaptively compresses data, achieving excellent compression but with signicant main-memory requirements. A preliminary version of sequitur used a semi-static modeling approach to achieve slightly worse compression than the adaptive approach. We describe a new variant of the semi-static sequitur algorithm, ray, that reduces mainmemory use and is a candidate for general-purpose compression and random-access to databases. We show that ray achieves better compression than an ecient Humann scheme and popular adaptive compression techniques. Keywords database compression, semi-static modeling 1 Introduction A general-purpose database is a collection of data, stored as text, images, sound, or binary information such as numbers. Users interact with a database by posing queries to retrieve individual records or sets of records within the collection. General-purpose databases have continued demands for disk space as more data is stored online. An eective way of reducing the disk space occupied by the database is to apply a compression algorithm to the data. One aim of compression is to reduce storage requirements [5]. For text databases, however, compression schemes can allow retrieval of data to be faster than with uncompressed data, since the computational cost of decompression can be oset by reductions in disk seeking and transfer costs [6, 11]. Popular compression algorithms, such as gzip and compress, signicantly reduce the storage space required by general-purpose database Proceedings of the 1998 Computer Science Postgraduate Students Conference, Royal Melbourne Institute of Technology, Melbourne, Australia, December 8, systems. However, these compression algorithms use adaptive techniques to compress the data. Adaptive compression schemes are impractical for use in database systems as they do not permit both good compression ratios and allow random access to individual records. Moreover, at present there is no ecient compression algorithm for general-purpose data held in database systems. A compression algorithm for generalpurpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaining compact storage of data. The algorithm must also use a lossless compression technique, since data such as English text is stored in general-purpose database systems. Importantly, as users expect fast performance in response to queries, decompression must also be fast. In this paper we describe an alternative compression technique to adaptive modeling that allows random-access to data and atomic decompression of records. This semi-static modeling approach, which we call ray, is a variation of the sequitur algorithm of Nevill- Manning et al. [7, 8, 9]. Ray models repetition in sequences by progressively constructing a hierarchical grammar with multiple passes through the data. In contrast, the sequitur algorithm alters the grammar as each new symbol is sequentially inspected and, after a single pass of the data, codes the resultant interleaved grammar and data using an adaptive scheme. The multiple pass approach of ray uses statistics on character pair repetition, digram frequency, to create rules in the grammar. With each pass through the data, the depth of the rule hierarchy is increased, higher level redundancy is detected, and rules are substituted for candidate digrams to achieve further compression. After each pass, the current grammar is encoded and compression can optionally be stopped. In our experiments we have found that ray has practical main-memory requirements and we believe a production implementation will improve this further. While our preliminary implementation is not especially fast, the multi-pass approach permits reductions in compression time at the cost of eecting compression performance by limiting the

2 number of passes. Each pass of the data improves the overall compression achieved and, in almost all cases, compression achieved is better than the adaptive methods, gzip and compress, and better than an eciently implemented Humann coding scheme. This paper is organised as follows. Modeling data for compression is introduced in Section 2. General-purpose database compression is discussed in Section 3. The sequitur algorithm is introduced in Section 4. In Section 5 we describe our ray compression algorithm. We present our experimental results in Section 6 and our conclusions in Section 7. 2 Modeling Data for Compression Modeling involves generating a representation of the distinct symbols in the data. A model stores information on how often each symbol occurred in the data, as a probability. It is used by an encoder to construct a code for each symbol, based on the probabilities. The same model is then used by the decoder to reproduce the original data from the compressed symbol. There are 3 main types of models: static models, that remain the same for each symbol of the sequence; adaptive or dynamic models, that alter the model after each symbol is inspected; and semistatic models, that change the model during encoding, but remain static during decoding. Static models make assumptions about data that has not yet been seen; they are built by considering the average probability distribution derived from previously gathered statistics on similar data. Compression becomes inecient when the model provides a probability distribution that is a poor approximation of each symbol's frequency in the data. A static model performs poorly when the data being compressed is dissimilar to the data used to rst produce the model. A static model is unacceptable for the purposes of general-purpose compression as it cannot approximate symbol probabilities acurately for the dierent types of data. Adaptive models address the problem of poor probability distributions for data by recalculating the distribution after each new symbol of the data is inspected, taking advantage of the local properties of the data. Since the probability distribution for an adaptive model alters after each symbol, all preceding symbols need to be decoded to determine the model for any symbol in the sequence. A limitation on adaptive models is that in order to reconstruct the original sequence, the encoded symbols must be decoded from the beginning. A compromised approach to modeling data is to use a semi-static model. A semi-static model requires two passes over the data. An initial pass to gather the statistics necessary to build the model and a second pass to encode the symbols according to the model created during the rst pass. The characteristics of semi-static modeling combine together the benets achieved by adaptive and static models. A semi-static model makes good use of specic properties of the data while at the same time remains static during decoding to allow independent decompression. The disadvantage of a semi{static model is that two passes of the data are required and the model parameters need to be stored with the compressed data [6]. 3 General-purpose database compression A database requires a compression scheme that will allow ecient retrieval of the compressed data, as well as produce a saving in storage costs. A scheme that achieves optimal compression, yet is unreasonably slow during decompression, is not practical. Moreover, a lossless approach is required in applications where loss of the original data is unacceptable. We therefore restrict ourselves to lossless compression schemes for general-purpose database compression, as we are interested in investigating universal schemes that can be used for all data types. Modern compression techniques are generally adaptive, which avoids the transmission of a model, and allows data to be compressed in a single pass. Databases are divided into records, documents, or easily segmented components. It is necessary that these components can be decompressed independently. In general, adaptive techniques are not eective for database applications, since they code data as a function of both the preceding symbols and the initial probability distribution [6]. Similarly, an adaptive code, such as arithmetic coding, is not practical for database compression as it is too slow [1]. An adaptive code would have to encode each record individually to maintain independent decompressibility, thereby limiting the compression achieved. To allow atomic decompression, as well as fast access to data, a practical approach is to therefore have a single model for the entire database. Semi-static models use a single model for the entire database, allowing random-access and fast decompression. However, semi-static modeling approaches require two passes over the data. This does not disadvantage such compression techniques for largely static databases, since resources required for compression are less important, provided decompression and retrieval remains fast. A disadvantage, however, is that a semi-static scheme requires that the model be stored with the database. Semi-static approaches, such as ecient implementations of canonical Humann coding using ternary trie structures [2, 3], permit fast random access to large databases of text. However,

3 there are few techniques that work well for generalpurpose data and use semi-static modeling. A recent compression scheme, sequitur [9], uses semi-static modeling and is likely to be a good candidate to be adapted for general-purpose compression of databases. The sequitur method identies structure within the data, which is ideal for databases, as most are structured. We describe sequitur in the next section. 4 The Sequitur Algorithm Sequitur [7, 8, 9] forms a grammar from a sequence, identifying repeated phrases in the input data. It has been shown that the detection of repeated phrases performs well as a compression scheme [9] and sequitur is likely to be able to be adapted as a compression scheme for databases since, rst, it uses semi-static modeling so individual parts of the database can be decoded independently and, second, it has been shown to achieve good compression for large collections. It is also likely that a special-purpose implementation of sequitur would oer both fast compression and have reasonable main-memory requirements. Compression with sequitur is achieved by removing repetitions, where repetitions are identical non-overlapping subsequences of the original input. Smaller repetitions often occur within matching subsequences, where the smallest possible such repetition detected by sequitur is two consecutive symbols or a digram. By storing repeated digrams at the base of a hierarchy, a hierarchical structure is formed that can be used to identify longer repeated subsequences. This hierarchy for a sequence is represented as a grammar. Sequitur is a dictionary-based scheme, with each dictionary entry corresponding to a rule from the grammar. The dictionary is adaptive during the rst pass over the data, as the model alters as sequitur sequentially inspects each symbol. Overlapping digrams are inspected by sequitur, where consecutive input characters are treated as a single symbol. However, two simple constraints are enforced: 1. Digram uniqueness: No digram may appear in the grammar more than once 2. Rule utility: A rule in the grammar must be referred to at least twice Digram uniqueness identies repetition of two symbols by creating a dictionary entry that is referenced by both occurrences of the digram. Rule utility permits the identication of repetitions that are longer than two symbols, by eliminating unnecessary rules. An example of the application of sequitur to an input string \rubdubrubdub" is shown in Figure 1. In this example, four of eleven steps are shown in the processing of the input. The rst step, the reading of the input \ru" creates the rst unique digram and the rst element of the grammar, rule 1. The second step shows processing after the reading of \rubdub" and the identication of four unique digrams; this step violates the digram uniqueness constraint through the addition of a second occurrence of \ub" and thereby creates rule 2. The third step shows similar processing after a violation of digram uniqueness, but also the elimination or unfolding of a rule through the violation of the rule utility constraint. The nal step shown is the complete sequitur grammar and unique digram list for the input string the result is the identication and coding of the two repeated subsequences \rubdub". 5 ray We describe in this section a new approach to compression for general-purpose databases that is based on the sequitur approach. This new algorithm, which we call ray, is a multi-pass adaptation of the sequitur approach. Our motivation in proposing a multi-pass approach was to develop a general-purpose compression scheme for databases. Sequitur identies repeated digrams by identifying repetitions in previously processed data, requiring the entire rst rule to be kept in main-memory. In general, the majority of a sequitur grammar is contained within the rst rule and our multi-pass approach to reducing memory usage removes this need to maintain the rst rule in memory. In ray we use frequency information to construct a grammar and maintain only the rules of the grammar in memory. In contrast, sequitur stores both the input which is stored as the rst rule and the grammar in memory. Similarly to sequitur, however, the grammar derived using ray enforces the constraints of rule utility and digram uniqueness. We consider the grammar formed by ray to consist of two separate sections: the rst rule, and all other rules which we will call the rule set. In addition, our approach to creating rules is dierent to that of sequitur: rather than selecting rules in a left-to-right single pass through the data, we use a mulit-pass scheme that selects digrams to form rules based on their known frequency in the data. One complete pass of the data in ray has three separate stages: statistics generation and digram selection, rule substitution, and grammar encoding. We describe each stage below. 5.1 Statistics generation and digram selection Statistics on digram frequency are gathered from the digrams in the rst rule. The frequencies are used to determine the digrams that are likely to

4 Sequence Grammar Digram Constraint Processed list violated ru 1! ru ru rubdub 1! rubdub ru; ub; bd; du digram uniqueness rubdub 1! r2d2 r2; ub; 2d; d2 rubdubrubd 1! 3d23d r2; ub; 3d; d2; 23 digram uniqueness 3! r2 rubdubrubd 1! 424 r2; ub; 3d; 42; 24 rule utility 3! r2 4! 3d rubdubrubd 1! 424 r2; ub; 2d; 42; 24 4! r2d rubdubrubdub 1! 55 r2; ub; 2d; d2; 55 5! r2d2 Figure 1: Application of sequitur to the sequence \rubdubrubdub". In this example, four steps are shown. In the rst step, the rst digram (\ru") is identied and added to the grammar and the unique digram list. After reading six characters (\rubdub"), the digram uniqueness constaint is violated through the identication of a second occurrence of \ub". This causes a new rule, rule 2, to be created with the two occurrences of \ub" in rule 1 substituted with references to rule 2; additionally, three digrams are formed that include the rule number 2. The third step shown illustrates both the violation of digram uniqueness and, after creating a new rule 4, a violation of the rule utility constraint, where rule 3 is only referenced once (in the body of rule 4); this results in the elimination or \unfolding" of rule 3. The nal step shows the complete grammar for the input. produce a set of rules that will minimise space requirements. We use a simple heuristic to select digrams to form rules, where we select digrams that occur more frequently in the data assuming that these digrams will form rules that oer the best compression. Overlapping digrams, that is, consecutive digrams, consist of three symbols shared between two digrams. Because two overlapping digrams can only be used to form one rule, our high-frequency selection heuristic will not necessarily result in the minimal possible space requirement. In particular, when overlapping digrams have the same frequencies we select the left-most digram, an arbitrary selection that may not result in the minimum space requirement on subsequent passes. However, we have found that selecting the digram with the higher frequency works well in practice. The dominant digram of the overlapping digrams, the digram with the higher global frequency, is added to a set of candidate digrams that are used to form rules during rule substitution. Statistics are maintained on how frequent a digram was selected. 5.2 Rule substitution Selection of digrams to be substituted for rules is similar to the selection of rule-candidate digrams. After gathering frequencies, the data is processed and rules created. Each consecutive pair of digrams are considered for rule substitution in isolation during processing, based on the frequency gathered in the rst stage. Ray matches digrams in the rst rule to rulecandidate digrams. Frequencies of matching rulecandidate digrams are used to enforce the rule utility constraint, ensuring that a rule will be referenced more than once. The frequencies also determine the best rule to be substituted when consecutive digrams are both rule-candidate digrams and, as described above, priority is given to the digram with the highest rule-candidate frequency. In the case where both digrams are equally likely to produce the best rule formation, we arbitrarily choose the left-most digram. Ideally, selecting the digram with the higher frequency will allow large repetitive sequences to be identied by ray during later passes. A new rule is created for rule-candidate digrams that adhere to the digram uniqueness constraint. Similarly to sequitur, the matching digram in the rst rule is replaced by a reference to the new rule. When a rule for a candidate digram already exists, the matching digram is substituted for a reference to that existing rule. New rules are appended to any existing rules inferred from previous passes, and are only substituted into the rst rule, where

5 Digram Frequency Initial Rule-candidate Resulting in Rule 1 Grammar digrams Grammar ru:2,ub:4,bd:2 1! rubdubrubdub ub:4 1 r2d2r2d2 du:2,br:1 2 ub r2:2,2d:2 1 r2d2r2d2 r2: d2:2,2r:1 d2:2 2 ub 3 r2 4 d2 34:2,43: : ub 2 ub 3 r2 5! r2d2 4 d2 55: ub 2 ub 5! r2d2 5! r2d2 Figure 2: An example of applying ray to the sequence \rubdubrubdub". In the example, ray selects only the (\ub") digram as a rule-candidate digram in the rst pass. A new rule, rule 2 is formed and substituted when the digram \ub" is encountered. In the second pass, ray selects two digrams, (\r2") and (\d2") as rule-candidate digrams. New rules are formed for each of the digrams. The third pass identies (\34") as a rule-candidate digram that forms rule 5. Creation of rule 5 causes both rules 3 and 4 to become underused, and are subsequently unfolded. A nal pass leaves the grammar unchanged as there are no rule-candidate digrams, hence, no new rules can be formed. the rst rule is stored separately from the other rules of the grammar. Compression is maintained by ensuring a rule is never underused. If a rule of the grammar violates rule utility, it is unfolded by replacing the single reference with the contents of the rule. Rule utility is only enforced on rules that were created in the current pass, as rules created during earlier passes already satisfy this constraint. Previously existing rules are only checked for rule utility if a new rule contains a reference to them. Figure 2 illustrates the application of the ray algorithm to the same input sequence used earlier in the sequitur example, \rubdubrubdub". The rst column shows each unique digram in the rst rule and its frequency. In the second column the grammar before rule substitution is given. A list of digrams that can form rules are shown in the third column and the resulting grammar, including new rules created, is presented in the last column. Frequencies for normal or rule-candidate digrams immediately follow the digram for which they apply. The rst pass identies \ub" as the only rule-candidate digram, as a result of it being involved in every consecutive digram and having the highest frequency. A new rule, rule 2, is formed and substituted into the rst rule, resulting in a grammar consisting of two rules. On the second pass, two new rules are created from the candidate digrams, leaving a nal grammar containing four rules. Another rule is created on the third pass that results in the rule utility constraint being violated for both rules 3 and 4. The resulting grammar for this pass shows rule 5 containing the unfolded contents of rules 3 and 4. A nal pass does not alter the grammar as no rule-candidate digrams were selected. A single reference to an underused rule can be found in the rst rule. When the second occurence of the rule-candidate digram, for which a rule has already been created, shares a symbol with another rule-candidate digram whose frequency is greater, the second occurence does not reference the rule. The new rule becomes underused. Underused rules referenced in the rst rule are removed and unfolded as the data is being encoded. 5.3 Encoding the grammar At the end of a pass the grammar is encoded so that compression can be optionally stopped. We use a minimum-redundancy code for all distinct symbols and rule references to allow decompression from any point in the input. Decompression only requires the rule set to be maintained in memory and we therefore encode the rule set separately from the rst rule. The rst rule is coded by replacing each symbol with a code. Any reference to an underused rule that is encountered while encoding the rst rule is removed from the rule set, and the contents are unfolded into the rst rule. Our implementation eciently codes the rule set by coding an integer value for the number of symbols in the rule, and then stores each symbol in the rule as its assigned code. The Human decoding table is also eciently implemented by using canonical Human coding [10]. Distinct codewords needed by the coding table are stored using a parameterised integer coding scheme, Golomb coding [10]. A parameter, b, is determined by calcu-

6 Compression (bits/char) RAY (fixed) RAY (variable) Minimum Frequency (k) Figure 3: Compression rate for ray on a large text le using two dierent approaches. The rst is a xed-frequency approach to ray that uses a value, k, as the minimum number of occurences required before a rule can be created. Dierent values of k were used for this approach. The alternate method was a variable approach to ray that alters the value of k after each pass. lating a local Bernoulli model, which can be approximated as b 0:69 average x. where x represents each coded value [10]. 5.4 Heuristics for Rule Formation Nevill-Manning and Witten have shown that inhibiting rule formation until a pre-determined number of occurences of a digram have been inspected reduces the number of rules created and, in many cases, improves the compression achieved [7]. We have implemented a similar approach to inhibiting rule formation by only creating rules when a predetermined frequency is reached, thereby reducing the amount of memory consumed by ray. We also investigated rule inhibition by only substituting repeated sequences when the cost of storing a new rule was less than the current cost of storing the existing repetitions. This calculation was based on statistics gathered in the last pass through the data and we have found that this results in worse compression than with a simple xed-frequency approach. An alternative method that we experimented with varied the minimum number of occurences, k, needed to form a rule on each pass through the data. We tried a simple approach that varied k according to the average frequency of distinct digrams. In order to satisfy the constraints of sequitur, the average needs to be greater than one. Rules are created for rule-candidate digrams whose frequency is greater than the average frequency with a goal to remove around half of the repetitions on each pass. Figure 3 shows the dierence in compression performance between the variable and xed frequency implementations of ray. The compression performance is shown for the smalltrec text le with dierent values of k for the xed frequency approach. Our simple method of varying the number of occurences on each new pass achieved 2% better compression than the best of the static methods. 6 Results In order to compare the compression performance of ray to other compression schemes we constructed three test collections of text and weather data. The smallest le in our test collection, smalltrec, is 2.86 Mb of text data taken from the TREC collection. TREC is an ongoing international collaborative experiment in information retrieval sponsored by NIST and ARPA [4]. The weather data (weather) contains 20,175 records collected from each of 5 weather stations, where each station record contains 4 sets of 22 measurements (such as temperatures, elevations, rainfall, and humidity); in total weather is 38.2 Mb in size. Comact, the largest le in the test collection is Mb of text data, taken from the Australian Commonwealth Acts. We have compared the compression performance of ray to the well-known adaptive compression schemes, gzip and compress. The compression performance of ray is also compared to huffword, a semi-static compression program that uses canonical Human coding and an ecient ternary trie [2, 3]. Because of the large les in our test collections, we were unable to experiment with sequitur.

7 File ray huffword gzip compress Compression comact (bpc) weather smalltrec Decompression comact (sec) weather smalltrec Table 1: Compression performance of various compression schemes on the test collection. The compression performance is represented as bits per character (bpc), and the time to decompress the les is given in seconds. Table 1 shows the results of compression performance and the decompression speed of each scheme on the test collections. The compression results are presented in bits per character (bpc), and the decompression time in seconds. Ray achieves better compression than all of the other compression schemes on the test collections, except for huffword on smalltrec. For the largest le, comact, ray achieves almost 10% better compression performance than the other semi-static scheme, huffword, that is specic for text compression. However, our current implementation of ray is between 1.2 and 2.2 times as slow as huffword to decompress data, but decompression time still remains practical. The performance of sequitur on comact would consume around 385 Mb of main-memory given that there are about 7 times fewer symbols in the hierarchy than the input, and that each symbol in the hierarchy consumes 20 bytes; we were unable to measure this in practice due to hardware limitations, and this calculation is based on an estimation technique for sequitur described by Nevill-Manning et al. [7]. In contrast, our preliminary version of ray uses approximately 225Mb of main-memory to compress the same le; we believe that a production implementation of ray, using more ecient memory structures, will use less than 200 Mb of main-memory on the comact collection. Our current preliminary implementation of ray is slow during compression due mainly to its multipass nature. However, as we mentioned earlier, speed is less important when compressing largely static databases. 7 Conclusion We have described a practical compression scheme for databases. The scheme has reasonable decompression speed and achieves excellent compression. Moreover, our scheme allows random access to data and is not restricted to databases of text. Ray works well as a possible compression scheme for general-purpose databases. Ray consumes less memory than that required by sequitur. We believe that our preliminary implementation can be improved further to reduce memory requirements during both compression and decompression, with a likely improvement in compression times. Moreover, other improvements are possible in techniques to store and code both rules and data and these are likely to result in further eciency gains. Acknowledgments We thank Andrew Turpin from The University of Melbourne for valuable discussions and his implementation of huffword. This work was supported by the Australian Research Council and the Multimedia Database Systems group at RMIT University. References [1] T.C. Bell, A. Moat, C.G. Nevill-Manning, I.H. Witten and J. Zobel. Data compression in full-text retrieval systems. Journal of the American Society for Information Science, Volume 44, Number 9, pages 508{531, October [2] J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 360{ 369, New Orleans, Louisiana, 5{7 January [3] J. Clement, P. Flajolet and B. Vallee. The analysis of hybrid trie structures. In Proc. of the Ninth ACM-SIAM Symposium on Discrete Algorithms, pages 531{539, San Francisco, California, January [4] D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, Volume 31, Number 3, pages 271{289, [5] D.A. Lelewer and D.S. Hirschberg. Data compression. Computing Surveys, Volume 19, Number 3, pages 261{296, September 1987.

8 [6] A. Moat, J. Zobel and N. Sharman. Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering, Volume 9, Number 2, pages 302{ 313, [7] C.G. Nevill-Manning and I.H. Witten. Phrase hierarchy inference and compression in bounded space. In J.A. Storer and M. Cohn (editors), Proc. IEEE Data Compression Conference, pages 179{188, Snowbird, Utah, March IEEE Computer Society Press, Los Alamitos, California. [8] C.G. Nevill-Manning, I.H. Witten and D.L. Maulsby. Compression by induction of hierarchical grammars. In J.A. Storer and M. Cohn (editors), Proc. IEEE Data Compression Conference, pages 244{253, Snowbird, Utah, March IEEE Computer Society Press, Los Alamitos, California. [9] C.G. Nevill-Manning, I.H. Witten and D.R. Olsen, Jr. Compressing semi-structured text using hierarchical phrase identication. In J.A. Storer and M. Cohn (editors), Proc. IEEE Data Compression Conference, pages 63{72, Snowbird, Utah, April IEEE Computer Society Press, Los Alamitos, California. [10] I.H. Witten, A. Moat and T.C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York, [11] J. Zobel and A. Moat. Adding compression to a full-text retrieval system. Software Practice and Experience, Volume 25, Number 8, pages 891{903, August 1995.

As an additional safeguard on the total buer size required we might further

As an additional safeguard on the total buer size required we might further As an additional safeguard on the total buer size required we might further require that no superblock be larger than some certain size. Variable length superblocks would then require the reintroduction

More information

An Asymmetric, Semi-adaptive Text Compression Algorithm

An Asymmetric, Semi-adaptive Text Compression Algorithm An Asymmetric, Semi-adaptive Text Compression Algorithm Harry Plantinga Department of Computer Science University of Pittsburgh Pittsburgh, PA 15260 planting@cs.pitt.edu Abstract A new heuristic for text

More information

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland An On-line Variable Length inary Encoding Tinku Acharya Joseph F. Ja Ja Institute for Systems Research and Institute for Advanced Computer Studies University of Maryland College Park, MD 242 facharya,

More information

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS Yair Wiseman 1* * 1 Computer Science Department, Bar-Ilan University, Ramat-Gan 52900, Israel Email: wiseman@cs.huji.ac.il, http://www.cs.biu.ac.il/~wiseman

More information

Compressing Integers for Fast File Access

Compressing Integers for Fast File Access Compressing Integers for Fast File Access Hugh E. Williams Justin Zobel Benjamin Tripp COSI 175a: Data Compression October 23, 2006 Introduction Many data processing applications depend on access to integer

More information

Compression of Inverted Indexes For Fast Query Evaluation

Compression of Inverted Indexes For Fast Query Evaluation Compression of Inverted Indexes For Fast Query Evaluation Falk Scholer Hugh E. Williams John Yiannis Justin Zobel School of Computer Science and Information Technology RMIT University, GPO Box 2476V Melbourne,

More information

arxiv: v2 [cs.it] 15 Jan 2011

arxiv: v2 [cs.it] 15 Jan 2011 Improving PPM Algorithm Using Dictionaries Yichuan Hu Department of Electrical and Systems Engineering University of Pennsylvania Email: yichuan@seas.upenn.edu Jianzhong (Charlie) Zhang, Farooq Khan and

More information

Cluster based Mixed Coding Schemes for Inverted File Index Compression

Cluster based Mixed Coding Schemes for Inverted File Index Compression Cluster based Mixed Coding Schemes for Inverted File Index Compression Jinlin Chen 1, Ping Zhong 2, Terry Cook 3 1 Computer Science Department Queen College, City University of New York USA jchen@cs.qc.edu

More information

Efficient Building and Querying of Asian Language Document Databases

Efficient Building and Querying of Asian Language Document Databases Efficient Building and Querying of Asian Language Document Databases Phil Vines Justin Zobel Department of Computer Science, RMIT University PO Box 2476V Melbourne 3001, Victoria, Australia Email: phil@cs.rmit.edu.au

More information

In-Place Calculation of Minimum-Redundancy Codes

In-Place Calculation of Minimum-Redundancy Codes In-Place Calculation of Minimum-Redundancy Codes Alistair Moffat 1 Jyrki Katajainen 2 Department of Computer Science, The University of Melbourne, Parkville 3052, Australia alistair@cs.mu.oz.an 2 Department

More information

Compression Techniques for Fast External Sorting

Compression Techniques for Fast External Sorting Compression Techniques for Fast External Sorting John Yiannis Justin Zobel School of Computer Science and Information Technology RMIT University, Melbourne, Australia, 3000 jyiannis@cs.rmit.edu.au,jz@cs.rmit.edu.au

More information

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77

CS 493: Algorithms for Massive Data Sets Dictionary-based compression February 14, 2002 Scribe: Tony Wirth LZ77 CS 493: Algorithms for Massive Data Sets February 14, 2002 Dictionary-based compression Scribe: Tony Wirth This lecture will explore two adaptive dictionary compression schemes: LZ77 and LZ78. We use the

More information

Optimised Phrase Querying and Browsing of Large Text Databases

Optimised Phrase Querying and Browsing of Large Text Databases Optimised Phrase Querying and Browsing of Large Text Databases Dirk Bahle Hugh E. Williams Justin Zobel dirk@mds.rmit.edu.au hugh@cs.rmit.edu.au jz@cs.rmit.edu.au Department of Computer Science, RMIT University

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

A Research Paper on Lossless Data Compression Techniques

A Research Paper on Lossless Data Compression Techniques IJIRST International Journal for Innovative Research in Science & Technology Volume 4 Issue 1 June 2017 ISSN (online): 2349-6010 A Research Paper on Lossless Data Compression Techniques Prof. Dipti Mathpal

More information

of m clauses, each containing the disjunction of boolean variables from a nite set V = fv 1 ; : : : ; vng of size n [8]. Each variable occurrence with

of m clauses, each containing the disjunction of boolean variables from a nite set V = fv 1 ; : : : ; vng of size n [8]. Each variable occurrence with A Hybridised 3-SAT Algorithm Andrew Slater Automated Reasoning Project, Computer Sciences Laboratory, RSISE, Australian National University, 0200, Canberra Andrew.Slater@anu.edu.au April 9, 1999 1 Introduction

More information

Interactive Progressive Encoding System For Transmission of Complex Images

Interactive Progressive Encoding System For Transmission of Complex Images Interactive Progressive Encoding System For Transmission of Complex Images Borko Furht 1, Yingli Wang 1, and Joe Celli 2 1 NSF Multimedia Laboratory Florida Atlantic University, Boca Raton, Florida 33431

More information

EE-575 INFORMATION THEORY - SEM 092

EE-575 INFORMATION THEORY - SEM 092 EE-575 INFORMATION THEORY - SEM 092 Project Report on Lempel Ziv compression technique. Department of Electrical Engineering Prepared By: Mohammed Akber Ali Student ID # g200806120. ------------------------------------------------------------------------------------------------------------------------------------------

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

A New Compression Method Strictly for English Textual Data

A New Compression Method Strictly for English Textual Data A New Compression Method Strictly for English Textual Data Sabina Priyadarshini Department of Computer Science and Engineering Birla Institute of Technology Abstract - Data compression is a requirement

More information

COMPRESSION OF SMALL TEXT FILES

COMPRESSION OF SMALL TEXT FILES COMPRESSION OF SMALL TEXT FILES Jan Platoš, Václav Snášel Department of Computer Science VŠB Technical University of Ostrava, Czech Republic jan.platos.fei@vsb.cz, vaclav.snasel@vsb.cz Eyas El-Qawasmeh

More information

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department Using Statistical Properties of Text to Create Metadata Grace Crowder crowder@cs.umbc.edu Charles Nicholas nicholas@cs.umbc.edu Computer Science and Electrical Engineering Department University of Maryland

More information

A Hybrid Approach to Text Compression

A Hybrid Approach to Text Compression A Hybrid Approach to Text Compression Peter C Gutmann Computer Science, University of Auckland, New Zealand Telephone +64 9 426-5097; email pgut 1 Bcs.aukuni.ac.nz Timothy C Bell Computer Science, University

More information

Compression techniques for fast external sorting

Compression techniques for fast external sorting The VLDB Journal (2007) 16(2): 269 291 DOI 10.1007/s00778-006-0005-2 ORIGINAL ARTICLE John Yiannis Justin Zobel Compression techniques for fast external sorting Received: 22 June 2004 / Accepted: 8 December

More information

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2 Volume 5, Issue 5, May 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Adaptive Huffman

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY Rashmi Gadbail,, 2013; Volume 1(8): 783-791 INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK EFFECTIVE XML DATABASE COMPRESSION

More information

An Order-2 Context Model for Data Compression. With Reduced Time and Space Requirements. Technical Report No

An Order-2 Context Model for Data Compression. With Reduced Time and Space Requirements. Technical Report No An Order-2 Context Model for Data Compression With Reduced Time and Space Requirements Debra A. Lelewer and Daniel S. Hirschberg Technical Report No. 90-33 Abstract Context modeling has emerged as the

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc.

Keywords Data compression, Lossless data compression technique, Huffman Coding, Arithmetic coding etc. Volume 6, Issue 2, February 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Comparative

More information

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t Data Reduction - an Adaptation Technique for Mobile Environments A. Heuer, A. Lubinski Computer Science Dept., University of Rostock, Germany Keywords. Reduction. Mobile Database Systems, Data Abstract.

More information

35 th Design Automation Conference Copyright 1998 ACM

35 th Design Automation Conference Copyright 1998 ACM Code Compression for Embedded Systems Haris Lekatsas and Wayne Wolf Department of Electrical Engineering Princeton University flekatsas,wolfg@ee.princeton.edu Abstract Memory is one of the most restricted

More information

Volume 2, Issue 9, September 2014 ISSN

Volume 2, Issue 9, September 2014 ISSN Fingerprint Verification of the Digital Images by Using the Discrete Cosine Transformation, Run length Encoding, Fourier transformation and Correlation. Palvee Sharma 1, Dr. Rajeev Mahajan 2 1M.Tech Student

More information

A Simple Lossless Compression Heuristic for Grey Scale Images

A Simple Lossless Compression Heuristic for Grey Scale Images L. Cinque 1, S. De Agostino 1, F. Liberati 1 and B. Westgeest 2 1 Computer Science Department University La Sapienza Via Salaria 113, 00198 Rome, Italy e-mail: deagostino@di.uniroma1.it 2 Computer Science

More information

Relative Reduced Hops

Relative Reduced Hops GreedyDual-Size: A Cost-Aware WWW Proxy Caching Algorithm Pei Cao Sandy Irani y 1 Introduction As the World Wide Web has grown in popularity in recent years, the percentage of network trac due to HTTP

More information

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding

SIGNAL COMPRESSION Lecture Lempel-Ziv Coding SIGNAL COMPRESSION Lecture 5 11.9.2007 Lempel-Ziv Coding Dictionary methods Ziv-Lempel 77 The gzip variant of Ziv-Lempel 77 Ziv-Lempel 78 The LZW variant of Ziv-Lempel 78 Asymptotic optimality of Ziv-Lempel

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 6: Dictionary Compression Juha Kärkkäinen 15.11.2017 1 / 17 Dictionary Compression The compression techniques we have seen so far replace individual

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp Scientia Iranica, Vol. 11, No. 3, pp 159{164 c Sharif University of Technology, July 2004 On Routing Architecture for Hybrid FPGA M. Nadjarbashi, S.M. Fakhraie 1 and A. Kaviani 2 In this paper, the routing

More information

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc. Siemens TREC-4 Report: Further Experiments with Database Merging Ellen M. Voorhees Siemens Corporate Research, Inc. Princeton, NJ ellen@scr.siemens.com Abstract A database merging technique is a strategy

More information

Data Compression Scheme of Dynamic Huffman Code for Different Languages

Data Compression Scheme of Dynamic Huffman Code for Different Languages 2011 International Conference on Information and Network Technology IPCSIT vol.4 (2011) (2011) IACSIT Press, Singapore Data Compression Scheme of Dynamic Huffman Code for Different Languages Shivani Pathak

More information

[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases,

[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases, [12] P. Larson. Linear hashing with partial expansions. In Proceedings of the 6th International Conference on Very Large Databases, pages 224{232, 1980. [13] W. Litwin. Linear hashing: a new tool for le

More information

A Comparative Study of Lossless Compression Algorithm on Text Data

A Comparative Study of Lossless Compression Algorithm on Text Data Proc. of Int. Conf. on Advances in Computer Science, AETACS A Comparative Study of Lossless Compression Algorithm on Text Data Amit Jain a * Kamaljit I. Lakhtaria b, Prateek Srivastava c a, b, c Department

More information

Andrew Moore. with arithmetic coding and Human coding. Dictionary Techniques. they may require the use of unmanageably large dictionaries

Andrew Moore. with arithmetic coding and Human coding. Dictionary Techniques. they may require the use of unmanageably large dictionaries Bayesian Networks for Lossless Dataset Compression Scott Davies Carnegie Mellon University scottd@cs.cmu.edu Andrew Moore Carnegie Mellon University awm@cs.cmu.edu Abstract The recent explosion in research

More information

Pacific Symposium on Biocomputing 4: (1999)

Pacific Symposium on Biocomputing 4: (1999) EFFECTIVE QUERY FILTERING FOR FAST HOMOLOGY SEARCHING HUGH E. WILLIAMS Department of Computer Science, RMIT University, GPO Box 2476V, Melbourne 3001, Australia hugh@cs.rmit.edu.au To improve the accuracy

More information

An Effective Approach to Improve Storage Efficiency Using Variable bit Representation

An Effective Approach to Improve Storage Efficiency Using Variable bit Representation Volume 114 No. 12 2017, 145-154 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu An Effective Approach to Improve Storage Efficiency Using Variable

More information

Access-Ordered Indexes

Access-Ordered Indexes Access-Ordered Indexes Steven Garcia Hugh E. Williams Adam Cannane School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne 3001, Australia. {garcias,hugh,cannane}@cs.rmit.edu.au

More information

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Web: http://www.cs.tamu.edu/faculty/vaidya/ Abstract

More information

Comparative Study of Dictionary based Compression Algorithms on Text Data

Comparative Study of Dictionary based Compression Algorithms on Text Data 88 Comparative Study of Dictionary based Compression Algorithms on Text Data Amit Jain Kamaljit I. Lakhtaria Sir Padampat Singhania University, Udaipur (Raj.) 323601 India Abstract: With increasing amount

More information

Optimization of Bit Rate in Medical Image Compression

Optimization of Bit Rate in Medical Image Compression Optimization of Bit Rate in Medical Image Compression Dr.J.Subash Chandra Bose 1, Mrs.Yamini.J 2, P.Pushparaj 3, P.Naveenkumar 4, Arunkumar.M 5, J.Vinothkumar 6 Professor and Head, Department of CSE, Professional

More information

Identifying Hierarchical Structure in Sequences. Sean Whalen

Identifying Hierarchical Structure in Sequences. Sean Whalen Identifying Hierarchical Structure in Sequences Sean Whalen 07.19.06 1 Review of Grammars 2 Syntax vs Semantics Syntax - the pattern of sentences in a language CS - the rules governing the formation of

More information

Experiments on string matching in memory structures

Experiments on string matching in memory structures Experiments on string matching in memory structures Thierry Lecroq LIR (Laboratoire d'informatique de Rouen) and ABISS (Atelier de Biologie Informatique Statistique et Socio-Linguistique), Universite de

More information

You can say that again! Text compression

You can say that again! Text compression Activity 3 You can say that again! Text compression Age group Early elementary and up. Abilities assumed Copying written text. Time 10 minutes or more. Size of group From individuals to the whole class.

More information

MERL { A MITSUBISHI ELECTRIC RESEARCH LABORATORY. Empirical Testing of Algorithms for. Variable-Sized Label Placement.

MERL { A MITSUBISHI ELECTRIC RESEARCH LABORATORY. Empirical Testing of Algorithms for. Variable-Sized Label Placement. MERL { A MITSUBISHI ELECTRIC RESEARCH LABORATORY http://www.merl.com Empirical Testing of Algorithms for Variable-Sized Placement Jon Christensen Painted Word, Inc. Joe Marks MERL Stacy Friedman Oracle

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

A Comprehensive Review of Data Compression Techniques

A Comprehensive Review of Data Compression Techniques Volume-6, Issue-2, March-April 2016 International Journal of Engineering and Management Research Page Number: 684-688 A Comprehensive Review of Data Compression Techniques Palwinder Singh 1, Amarbir Singh

More information

ADVANCED LOSSLESS TEXT COMPRESSION ALGORITHM BASED ON SPLAY TREE ADAPTIVE METHODS

ADVANCED LOSSLESS TEXT COMPRESSION ALGORITHM BASED ON SPLAY TREE ADAPTIVE METHODS ADVANCED LOSSLESS TEXT COMPRESSION ALGORITHM BASED ON SPLAY TREE ADAPTIVE METHODS RADU RĂDESCU, ANDREEA HONCIUC *1 Key words: Data compression, Splay Tree, Prefix, ratio. This paper presents an original

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval

Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval Nazli Goharian, Ankit Jain, Qian Sun Information Retrieval Laboratory Illinois Institute of Technology Chicago, Illinois {goharian,ajain,qian@ir.iit.edu}

More information

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources

S 1. Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources Evaluation of Fast-LZ Compressors for Compacting High-Bandwidth but Redundant Streams from FPGA Data Sources Author: Supervisor: Luhao Liu Dr. -Ing. Thomas B. Preußer Dr. -Ing. Steffen Köhler 09.10.2014

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Network Working Group Request for Comments: January IP Payload Compression Using ITU-T V.44 Packet Method

Network Working Group Request for Comments: January IP Payload Compression Using ITU-T V.44 Packet Method Network Working Group Request for Comments: 3051 Category: Informational J. Heath J. Border Hughes Network Systems January 2001 IP Payload Compression Using ITU-T V.44 Packet Method Status of this Memo

More information

In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems

In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems Nicholas Lester Justin Zobel Hugh E. Williams School of Computer Science and Information Technology RMIT

More information

However, no results are published that indicate the applicability for cycle-accurate simulation purposes. The language RADL [12] is derived from earli

However, no results are published that indicate the applicability for cycle-accurate simulation purposes. The language RADL [12] is derived from earli Retargeting of Compiled Simulators for Digital Signal Processors Using a Machine Description Language Stefan Pees, Andreas Homann, Heinrich Meyr Integrated Signal Processing Systems, RWTH Aachen pees[homann,meyr]@ert.rwth-aachen.de

More information

Block Addressing Indices for Approximate Text Retrieval. University of Chile. Blanco Encalada Santiago - Chile.

Block Addressing Indices for Approximate Text Retrieval. University of Chile. Blanco Encalada Santiago - Chile. Block Addressing Indices for Approximate Text Retrieval Ricardo Baeza-Yates Gonzalo Navarro Department of Computer Science University of Chile Blanco Encalada 212 - Santiago - Chile frbaeza,gnavarrog@dcc.uchile.cl

More information

Lossless Compression Algorithms

Lossless Compression Algorithms Multimedia Data Compression Part I Chapter 7 Lossless Compression Algorithms 1 Chapter 7 Lossless Compression Algorithms 1. Introduction 2. Basics of Information Theory 3. Lossless Compression Algorithms

More information

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces A user study on features supporting subjective relevance for information retrieval interfaces Lee, S.S., Theng, Y.L, Goh, H.L.D., & Foo, S. (2006). Proc. 9th International Conference of Asian Digital Libraries

More information

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods

The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods The Effect of Non-Greedy Parsing in Ziv-Lempel Compression Methods R. Nigel Horspool Dept. of Computer Science, University of Victoria P. O. Box 3055, Victoria, B.C., Canada V8W 3P6 E-mail address: nigelh@csr.uvic.ca

More information

2 Partitioning Methods for an Inverted Index

2 Partitioning Methods for an Inverted Index Impact of the Query Model and System Settings on Performance of Distributed Inverted Indexes Simon Jonassen and Svein Erik Bratsberg Abstract This paper presents an evaluation of three partitioning methods

More information

Lossless Image Compression having Compression Ratio Higher than JPEG

Lossless Image Compression having Compression Ratio Higher than JPEG Cloud Computing & Big Data 35 Lossless Image Compression having Compression Ratio Higher than JPEG Madan Singh madan.phdce@gmail.com, Vishal Chaudhary Computer Science and Engineering, Jaipur National

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Assignment1 - CSG1102: Virtual Memory. Christoer V. Hallstensen snr: March 28, 2011

Assignment1 - CSG1102: Virtual Memory. Christoer V. Hallstensen snr: March 28, 2011 Assignment1 - CSG1102: Virtual Memory Christoer V. Hallstensen snr:10220862 March 28, 2011 1 Contents 1 Abstract 3 2 Virtual Memory with Pages 4 2.1 Virtual memory management.................... 4 2.2

More information

A QUAD-TREE DECOMPOSITION APPROACH TO CARTOON IMAGE COMPRESSION. Yi-Chen Tsai, Ming-Sui Lee, Meiyin Shen and C.-C. Jay Kuo

A QUAD-TREE DECOMPOSITION APPROACH TO CARTOON IMAGE COMPRESSION. Yi-Chen Tsai, Ming-Sui Lee, Meiyin Shen and C.-C. Jay Kuo A QUAD-TREE DECOMPOSITION APPROACH TO CARTOON IMAGE COMPRESSION Yi-Chen Tsai, Ming-Sui Lee, Meiyin Shen and C.-C. Jay Kuo Integrated Media Systems Center and Department of Electrical Engineering University

More information

CS/COE 1501

CS/COE 1501 CS/COE 1501 www.cs.pitt.edu/~lipschultz/cs1501/ Compression What is compression? Represent the same data using less storage space Can get more use out a disk of a given size Can get more use out of memory

More information

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational Experiments in the Iterative Application of Resynthesis and Retiming Soha Hassoun and Carl Ebeling Department of Computer Science and Engineering University ofwashington, Seattle, WA fsoha,ebelingg@cs.washington.edu

More information

Efficient Trie-Based Sorting of Large Sets of Strings

Efficient Trie-Based Sorting of Large Sets of Strings Efficient Trie-Based Sorting of Large Sets of Strings Ranjan Sinha Justin Zobel School of Computer Science and Information Technology RMIT University, GPO Box 2476V, Melbourne 3001, Australia {rsinha,jz}@cs.rmit.edu.au

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines B. B. Zhou, R. P. Brent and A. Tridgell Computer Sciences Laboratory The Australian National University Canberra,

More information

Issues in Using Knowledge to Perform Similarity Searching in Multimedia Databases without Redundant Data Objects

Issues in Using Knowledge to Perform Similarity Searching in Multimedia Databases without Redundant Data Objects Issues in Using Knowledge to Perform Similarity Searching in Multimedia Databases without Redundant Data Objects Leonard Brown The University of Oklahoma School of Computer Science 200 Felgar St. EL #114

More information

A technique for adding range restrictions to. August 30, Abstract. In a generalized searching problem, a set S of n colored geometric objects

A technique for adding range restrictions to. August 30, Abstract. In a generalized searching problem, a set S of n colored geometric objects A technique for adding range restrictions to generalized searching problems Prosenjit Gupta Ravi Janardan y Michiel Smid z August 30, 1996 Abstract In a generalized searching problem, a set S of n colored

More information

Progressive Compression for Lossless Transmission of Triangle Meshes in Network Applications

Progressive Compression for Lossless Transmission of Triangle Meshes in Network Applications Progressive Compression for Lossless Transmission of Triangle Meshes in Network Applications Timotej Globačnik * Institute of Computer Graphics Laboratory for Geometric Modelling and Multimedia Algorithms

More information

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA GSAT and Local Consistency 3 Kalev Kask and Rina Dechter Department of Information and Computer Science University of California, Irvine, CA 92717-3425 fkkask,dechterg@ics.uci.edu Abstract It has been

More information

Huffman Code Application. Lecture7: Huffman Code. A simple application of Huffman coding of image compression which would be :

Huffman Code Application. Lecture7: Huffman Code. A simple application of Huffman coding of image compression which would be : Lecture7: Huffman Code Lossless Image Compression Huffman Code Application A simple application of Huffman coding of image compression which would be : Generation of a Huffman code for the set of values

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 11 Coding Strategies and Introduction to Huffman Coding The Fundamental

More information

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University of Maryland, College Park, MD 20742 oard@glue.umd.edu

More information

Research Article Does an Arithmetic Coding Followed by Run-length Coding Enhance the Compression Ratio?

Research Article Does an Arithmetic Coding Followed by Run-length Coding Enhance the Compression Ratio? Research Journal of Applied Sciences, Engineering and Technology 10(7): 736-741, 2015 DOI:10.19026/rjaset.10.2425 ISSN: 2040-7459; e-issn: 2040-7467 2015 Maxwell Scientific Publication Corp. Submitted:

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Intro. To Multimedia Engineering Lossless Compression

Intro. To Multimedia Engineering Lossless Compression Intro. To Multimedia Engineering Lossless Compression Kyoungro Yoon yoonk@konkuk.ac.kr 1/43 Contents Introduction Basics of Information Theory Run-Length Coding Variable-Length Coding (VLC) Dictionary-based

More information

Category: Informational May DEFLATE Compressed Data Format Specification version 1.3

Category: Informational May DEFLATE Compressed Data Format Specification version 1.3 Network Working Group P. Deutsch Request for Comments: 1951 Aladdin Enterprises Category: Informational May 1996 DEFLATE Compressed Data Format Specification version 1.3 Status of This Memo This memo provides

More information

Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries

Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries Ranjan Sinha Justin Zobel Abstract Ongoing changes in computer performance are affecting the efficiency of string sorting algorithms.

More information

Creating Meaningful Training Data for Dicult Job Shop Scheduling Instances for Ordinal Regression

Creating Meaningful Training Data for Dicult Job Shop Scheduling Instances for Ordinal Regression Creating Meaningful Training Data for Dicult Job Shop Scheduling Instances for Ordinal Regression Helga Ingimundardóttir University of Iceland March 28 th, 2012 Outline Introduction Job Shop Scheduling

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996 TR-CS-96-05 The rsync algorithm Andrew Tridgell and Paul Mackerras June 1996 Joint Computer Science Technical Report Series Department of Computer Science Faculty of Engineering and Information Technology

More information

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Lecture 10 (Chapter 7) ZHU Yongxin, Winson zhuyongxin@sjtu.edu.cn 2 Lossless Compression Algorithms 7.1 Introduction 7.2 Basics of Information

More information

IMAGE COMPRESSION TECHNIQUES

IMAGE COMPRESSION TECHNIQUES IMAGE COMPRESSION TECHNIQUES A.VASANTHAKUMARI, M.Sc., M.Phil., ASSISTANT PROFESSOR OF COMPUTER SCIENCE, JOSEPH ARTS AND SCIENCE COLLEGE, TIRUNAVALUR, VILLUPURAM (DT), TAMIL NADU, INDIA ABSTRACT A picture

More information

8 Integer encoding. scritto da: Tiziano De Matteis

8 Integer encoding. scritto da: Tiziano De Matteis 8 Integer encoding scritto da: Tiziano De Matteis 8.1 Unary code... 8-2 8.2 Elias codes: γ andδ... 8-2 8.3 Rice code... 8-3 8.4 Interpolative coding... 8-4 8.5 Variable-byte codes and (s,c)-dense codes...

More information

Compressing and Decoding Term Statistics Time Series

Compressing and Decoding Term Statistics Time Series Compressing and Decoding Term Statistics Time Series Jinfeng Rao 1,XingNiu 1,andJimmyLin 2(B) 1 University of Maryland, College Park, USA {jinfeng,xingniu}@cs.umd.edu 2 University of Waterloo, Waterloo,

More information

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the

More information