number of passes. Each pass of the data improves the overall compression achieved and, in almost all cases, compression achieved is better than the ad

Size: px

Start display at page:

Download "number of passes. Each pass of the data improves the overall compression achieved and, in almost all cases, compression achieved is better than the ad"

Gordon Wilkerson
5 years ago
Views:

1 A General-Purpose Compression Scheme for Databases Adam Cannane Hugh E. Williams Justin Zobel Department of Computer Science, RMIT University, GPO Box 2476V, Melbourne 3001, Australia Abstract Current adaptive compression schemes such as gzip and compress are impractical for database compression as they do not allow random-access to individual records. The sequitur scheme of Nevill-Manning and Witten also adaptively compresses data, achieving excellent compression but with signicant main-memory requirements. A preliminary version of sequitur used a semi-static modeling approach to achieve slightly worse compression than the adaptive approach. We describe a new variant of the semi-static sequitur algorithm, ray, that reduces mainmemory use and is a candidate for general-purpose compression and random-access to databases. We show that ray achieves better compression than an ecient Humann scheme and popular adaptive compression techniques. Keywords database compression, semi-static modeling 1 Introduction A general-purpose database is a collection of data, stored as text, images, sound, or binary information such as numbers. Users interact with a database by posing queries to retrieve individual records or sets of records within the collection. General-purpose databases have continued demands for disk space as more data is stored online. An eective way of reducing the disk space occupied by the database is to apply a compression algorithm to the data. One aim of compression is to reduce storage requirements [5]. For text databases, however, compression schemes can allow retrieval of data to be faster than with uncompressed data, since the computational cost of decompression can be oset by reductions in disk seeking and transfer costs [6, 11]. Popular compression algorithms, such as gzip and compress, signicantly reduce the storage space required by general-purpose database Proceedings of the 1998 Computer Science Postgraduate Students Conference, Royal Melbourne Institute of Technology, Melbourne, Australia, December 8, systems. However, these compression algorithms use adaptive techniques to compress the data. Adaptive compression schemes are impractical for use in database systems as they do not permit both good compression ratios and allow random access to individual records. Moreover, at present there is no ecient compression algorithm for general-purpose data held in database systems. A compression algorithm for generalpurpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaining compact storage of data. The algorithm must also use a lossless compression technique, since data such as English text is stored in general-purpose database systems. Importantly, as users expect fast performance in response to queries, decompression must also be fast. In this paper we describe an alternative compression technique to adaptive modeling that allows random-access to data and atomic decompression of records. This semi-static modeling approach, which we call ray, is a variation of the sequitur algorithm of Nevill- Manning et al. [7, 8, 9]. Ray models repetition in sequences by progressively constructing a hierarchical grammar with multiple passes through the data. In contrast, the sequitur algorithm alters the grammar as each new symbol is sequentially inspected and, after a single pass of the data, codes the resultant interleaved grammar and data using an adaptive scheme. The multiple pass approach of ray uses statistics on character pair repetition, digram frequency, to create rules in the grammar. With each pass through the data, the depth of the rule hierarchy is increased, higher level redundancy is detected, and rules are substituted for candidate digrams to achieve further compression. After each pass, the current grammar is encoded and compression can optionally be stopped. In our experiments we have found that ray has practical main-memory requirements and we believe a production implementation will improve this further. While our preliminary implementation is not especially fast, the multi-pass approach permits reductions in compression time at the cost of eecting compression performance by limiting the

2 number of passes. Each pass of the data improves the overall compression achieved and, in almost all cases, compression achieved is better than the adaptive methods, gzip and compress, and better than an eciently implemented Humann coding scheme. This paper is organised as follows. Modeling data for compression is introduced in Section 2. General-purpose database compression is discussed in Section 3. The sequitur algorithm is introduced in Section 4. In Section 5 we describe our ray compression algorithm. We present our experimental results in Section 6 and our conclusions in Section 7. 2 Modeling Data for Compression Modeling involves generating a representation of the distinct symbols in the data. A model stores information on how often each symbol occurred in the data, as a probability. It is used by an encoder to construct a code for each symbol, based on the probabilities. The same model is then used by the decoder to reproduce the original data from the compressed symbol. There are 3 main types of models: static models, that remain the same for each symbol of the sequence; adaptive or dynamic models, that alter the model after each symbol is inspected; and semistatic models, that change the model during encoding, but remain static during decoding. Static models make assumptions about data that has not yet been seen; they are built by considering the average probability distribution derived from previously gathered statistics on similar data. Compression becomes inecient when the model provides a probability distribution that is a poor approximation of each symbol's frequency in the data. A static model performs poorly when the data being compressed is dissimilar to the data used to rst produce the model. A static model is unacceptable for the purposes of general-purpose compression as it cannot approximate symbol probabilities acurately for the dierent types of data. Adaptive models address the problem of poor probability distributions for data by recalculating the distribution after each new symbol of the data is inspected, taking advantage of the local properties of the data. Since the probability distribution for an adaptive model alters after each symbol, all preceding symbols need to be decoded to determine the model for any symbol in the sequence. A limitation on adaptive models is that in order to reconstruct the original sequence, the encoded symbols must be decoded from the beginning. A compromised approach to modeling data is to use a semi-static model. A semi-static model requires two passes over the data. An initial pass to gather the statistics necessary to build the model and a second pass to encode the symbols according to the model created during the rst pass. The characteristics of semi-static modeling combine together the benets achieved by adaptive and static models. A semi-static model makes good use of specic properties of the data while at the same time remains static during decoding to allow independent decompression. The disadvantage of a semi{static model is that two passes of the data are required and the model parameters need to be stored with the compressed data [6]. 3 General-purpose database compression A database requires a compression scheme that will allow ecient retrieval of the compressed data, as well as produce a saving in storage costs. A scheme that achieves optimal compression, yet is unreasonably slow during decompression, is not practical. Moreover, a lossless approach is required in applications where loss of the original data is unacceptable. We therefore restrict ourselves to lossless compression schemes for general-purpose database compression, as we are interested in investigating universal schemes that can be used for all data types. Modern compression techniques are generally adaptive, which avoids the transmission of a model, and allows data to be compressed in a single pass. Databases are divided into records, documents, or easily segmented components. It is necessary that these components can be decompressed independently. In general, adaptive techniques are not eective for database applications, since they code data as a function of both the preceding symbols and the initial probability distribution [6]. Similarly, an adaptive code, such as arithmetic coding, is not practical for database compression as it is too slow [1]. An adaptive code would have to encode each record individually to maintain independent decompressibility, thereby limiting the compression achieved. To allow atomic decompression, as well as fast access to data, a practical approach is to therefore have a single model for the entire database. Semi-static models use a single model for the entire database, allowing random-access and fast decompression. However, semi-static modeling approaches require two passes over the data. This does not disadvantage such compression techniques for largely static databases, since resources required for compression are less important, provided decompression and retrieval remains fast. A disadvantage, however, is that a semi-static scheme requires that the model be stored with the database. Semi-static approaches, such as ecient implementations of canonical Humann coding using ternary trie structures [2, 3], permit fast random access to large databases of text. However,

3 there are few techniques that work well for generalpurpose data and use semi-static modeling. A recent compression scheme, sequitur [9], uses semi-static modeling and is likely to be a good candidate to be adapted for general-purpose compression of databases. The sequitur method identies structure within the data, which is ideal for databases, as most are structured. We describe sequitur in the next section. 4 The Sequitur Algorithm Sequitur [7, 8, 9] forms a grammar from a sequence, identifying repeated phrases in the input data. It has been shown that the detection of repeated phrases performs well as a compression scheme [9] and sequitur is likely to be able to be adapted as a compression scheme for databases since, rst, it uses semi-static modeling so individual parts of the database can be decoded independently and, second, it has been shown to achieve good compression for large collections. It is also likely that a special-purpose implementation of sequitur would oer both fast compression and have reasonable main-memory requirements. Compression with sequitur is achieved by removing repetitions, where repetitions are identical non-overlapping subsequences of the original input. Smaller repetitions often occur within matching subsequences, where the smallest possible such repetition detected by sequitur is two consecutive symbols or a digram. By storing repeated digrams at the base of a hierarchy, a hierarchical structure is formed that can be used to identify longer repeated subsequences. This hierarchy for a sequence is represented as a grammar. Sequitur is a dictionary-based scheme, with each dictionary entry corresponding to a rule from the grammar. The dictionary is adaptive during the rst pass over the data, as the model alters as sequitur sequentially inspects each symbol. Overlapping digrams are inspected by sequitur, where consecutive input characters are treated as a single symbol. However, two simple constraints are enforced: 1. Digram uniqueness: No digram may appear in the grammar more than once 2. Rule utility: A rule in the grammar must be referred to at least twice Digram uniqueness identies repetition of two symbols by creating a dictionary entry that is referenced by both occurrences of the digram. Rule utility permits the identication of repetitions that are longer than two symbols, by eliminating unnecessary rules. An example of the application of sequitur to an input string \rubdubrubdub" is shown in Figure 1. In this example, four of eleven steps are shown in the processing of the input. The rst step, the reading of the input \ru" creates the rst unique digram and the rst element of the grammar, rule 1. The second step shows processing after the reading of \rubdub" and the identication of four unique digrams; this step violates the digram uniqueness constraint through the addition of a second occurrence of \ub" and thereby creates rule 2. The third step shows similar processing after a violation of digram uniqueness, but also the elimination or unfolding of a rule through the violation of the rule utility constraint. The nal step shown is the complete sequitur grammar and unique digram list for the input string the result is the identication and coding of the two repeated subsequences \rubdub". 5 ray We describe in this section a new approach to compression for general-purpose databases that is based on the sequitur approach. This new algorithm, which we call ray, is a multi-pass adaptation of the sequitur approach. Our motivation in proposing a multi-pass approach was to develop a general-purpose compression scheme for databases. Sequitur identies repeated digrams by identifying repetitions in previously processed data, requiring the entire rst rule to be kept in main-memory. In general, the majority of a sequitur grammar is contained within the rst rule and our multi-pass approach to reducing memory usage removes this need to maintain the rst rule in memory. In ray we use frequency information to construct a grammar and maintain only the rules of the grammar in memory. In contrast, sequitur stores both the input which is stored as the rst rule and the grammar in memory. Similarly to sequitur, however, the grammar derived using ray enforces the constraints of rule utility and digram uniqueness. We consider the grammar formed by ray to consist of two separate sections: the rst rule, and all other rules which we will call the rule set. In addition, our approach to creating rules is dierent to that of sequitur: rather than selecting rules in a left-to-right single pass through the data, we use a mulit-pass scheme that selects digrams to form rules based on their known frequency in the data. One complete pass of the data in ray has three separate stages: statistics generation and digram selection, rule substitution, and grammar encoding. We describe each stage below. 5.1 Statistics generation and digram selection Statistics on digram frequency are gathered from the digrams in the rst rule. The frequencies are used to determine the digrams that are likely to

4 Sequence Grammar Digram Constraint Processed list violated ru 1! ru ru rubdub 1! rubdub ru; ub; bd; du digram uniqueness rubdub 1! r2d2 r2; ub; 2d; d2 rubdubrubd 1! 3d23d r2; ub; 3d; d2; 23 digram uniqueness 3! r2 rubdubrubd 1! 424 r2; ub; 3d; 42; 24 rule utility 3! r2 4! 3d rubdubrubd 1! 424 r2; ub; 2d; 42; 24 4! r2d rubdubrubdub 1! 55 r2; ub; 2d; d2; 55 5! r2d2 Figure 1: Application of sequitur to the sequence \rubdubrubdub". In this example, four steps are shown. In the rst step, the rst digram (\ru") is identied and added to the grammar and the unique digram list. After reading six characters (\rubdub"), the digram uniqueness constaint is violated through the identication of a second occurrence of \ub". This causes a new rule, rule 2, to be created with the two occurrences of \ub" in rule 1 substituted with references to rule 2; additionally, three digrams are formed that include the rule number 2. The third step shown illustrates both the violation of digram uniqueness and, after creating a new rule 4, a violation of the rule utility constraint, where rule 3 is only referenced once (in the body of rule 4); this results in the elimination or \unfolding" of rule 3. The nal step shows the complete grammar for the input. produce a set of rules that will minimise space requirements. We use a simple heuristic to select digrams to form rules, where we select digrams that occur more frequently in the data assuming that these digrams will form rules that oer the best compression. Overlapping digrams, that is, consecutive digrams, consist of three symbols shared between two digrams. Because two overlapping digrams can only be used to form one rule, our high-frequency selection heuristic will not necessarily result in the minimal possible space requirement. In particular, when overlapping digrams have the same frequencies we select the left-most digram, an arbitrary selection that may not result in the minimum space requirement on subsequent passes. However, we have found that selecting the digram with the higher frequency works well in practice. The dominant digram of the overlapping digrams, the digram with the higher global frequency, is added to a set of candidate digrams that are used to form rules during rule substitution. Statistics are maintained on how frequent a digram was selected. 5.2 Rule substitution Selection of digrams to be substituted for rules is similar to the selection of rule-candidate digrams. After gathering frequencies, the data is processed and rules created. Each consecutive pair of digrams are considered for rule substitution in isolation during processing, based on the frequency gathered in the rst stage. Ray matches digrams in the rst rule to rulecandidate digrams. Frequencies of matching rulecandidate digrams are used to enforce the rule utility constraint, ensuring that a rule will be referenced more than once. The frequencies also determine the best rule to be substituted when consecutive digrams are both rule-candidate digrams and, as described above, priority is given to the digram with the highest rule-candidate frequency. In the case where both digrams are equally likely to produce the best rule formation, we arbitrarily choose the left-most digram. Ideally, selecting the digram with the higher frequency will allow large repetitive sequences to be identied by ray during later passes. A new rule is created for rule-candidate digrams that adhere to the digram uniqueness constraint. Similarly to sequitur, the matching digram in the rst rule is replaced by a reference to the new rule. When a rule for a candidate digram already exists, the matching digram is substituted for a reference to that existing rule. New rules are appended to any existing rules inferred from previous passes, and are only substituted into the rst rule, where

5 Digram Frequency Initial Rule-candidate Resulting in Rule 1 Grammar digrams Grammar ru:2,ub:4,bd:2 1! rubdubrubdub ub:4 1 r2d2r2d2 du:2,br:1 2 ub r2:2,2d:2 1 r2d2r2d2 r2: d2:2,2r:1 d2:2 2 ub 3 r2 4 d2 34:2,43: : ub 2 ub 3 r2 5! r2d2 4 d2 55: ub 2 ub 5! r2d2 5! r2d2 Figure 2: An example of applying ray to the sequence \rubdubrubdub". In the example, ray selects only the (\ub") digram as a rule-candidate digram in the rst pass. A new rule, rule 2 is formed and substituted when the digram \ub" is encountered. In the second pass, ray selects two digrams, (\r2") and (\d2") as rule-candidate digrams. New rules are formed for each of the digrams. The third pass identies (\34") as a rule-candidate digram that forms rule 5. Creation of rule 5 causes both rules 3 and 4 to become underused, and are subsequently unfolded. A nal pass leaves the grammar unchanged as there are no rule-candidate digrams, hence, no new rules can be formed. the rst rule is stored separately from the other rules of the grammar. Compression is maintained by ensuring a rule is never underused. If a rule of the grammar violates rule utility, it is unfolded by replacing the single reference with the contents of the rule. Rule utility is only enforced on rules that were created in the current pass, as rules created during earlier passes already satisfy this constraint. Previously existing rules are only checked for rule utility if a new rule contains a reference to them. Figure 2 illustrates the application of the ray algorithm to the same input sequence used earlier in the sequitur example, \rubdubrubdub". The rst column shows each unique digram in the rst rule and its frequency. In the second column the grammar before rule substitution is given. A list of digrams that can form rules are shown in the third column and the resulting grammar, including new rules created, is presented in the last column. Frequencies for normal or rule-candidate digrams immediately follow the digram for which they apply. The rst pass identies \ub" as the only rule-candidate digram, as a result of it being involved in every consecutive digram and having the highest frequency. A new rule, rule 2, is formed and substituted into the rst rule, resulting in a grammar consisting of two rules. On the second pass, two new rules are created from the candidate digrams, leaving a nal grammar containing four rules. Another rule is created on the third pass that results in the rule utility constraint being violated for both rules 3 and 4. The resulting grammar for this pass shows rule 5 containing the unfolded contents of rules 3 and 4. A nal pass does not alter the grammar as no rule-candidate digrams were selected. A single reference to an underused rule can be found in the rst rule. When the second occurence of the rule-candidate digram, for which a rule has already been created, shares a symbol with another rule-candidate digram whose frequency is greater, the second occurence does not reference the rule. The new rule becomes underused. Underused rules referenced in the rst rule are removed and unfolded as the data is being encoded. 5.3 Encoding the grammar At the end of a pass the grammar is encoded so that compression can be optionally stopped. We use a minimum-redundancy code for all distinct symbols and rule references to allow decompression from any point in the input. Decompression only requires the rule set to be maintained in memory and we therefore encode the rule set separately from the rst rule. The rst rule is coded by replacing each symbol with a code. Any reference to an underused rule that is encountered while encoding the rst rule is removed from the rule set, and the contents are unfolded into the rst rule. Our implementation eciently codes the rule set by coding an integer value for the number of symbols in the rule, and then stores each symbol in the rule as its assigned code. The Human decoding table is also eciently implemented by using canonical Human coding [10]. Distinct codewords needed by the coding table are stored using a parameterised integer coding scheme, Golomb coding [10]. A parameter, b, is determined by calcu-

6 Compression (bits/char) RAY (fixed) RAY (variable) Minimum Frequency (k) Figure 3: Compression rate for ray on a large text le using two dierent approaches. The rst is a xed-frequency approach to ray that uses a value, k, as the minimum number of occurences required before a rule can be created. Dierent values of k were used for this approach. The alternate method was a variable approach to ray that alters the value of k after each pass. lating a local Bernoulli model, which can be approximated as b 0:69 average x. where x represents each coded value [10]. 5.4 Heuristics for Rule Formation Nevill-Manning and Witten have shown that inhibiting rule formation until a pre-determined number of occurences of a digram have been inspected reduces the number of rules created and, in many cases, improves the compression achieved [7]. We have implemented a similar approach to inhibiting rule formation by only creating rules when a predetermined frequency is reached, thereby reducing the amount of memory consumed by ray. We also investigated rule inhibition by only substituting repeated sequences when the cost of storing a new rule was less than the current cost of storing the existing repetitions. This calculation was based on statistics gathered in the last pass through the data and we have found that this results in worse compression than with a simple xed-frequency approach. An alternative method that we experimented with varied the minimum number of occurences, k, needed to form a rule on each pass through the data. We tried a simple approach that varied k according to the average frequency of distinct digrams. In order to satisfy the constraints of sequitur, the average needs to be greater than one. Rules are created for rule-candidate digrams whose frequency is greater than the average frequency with a goal to remove around half of the repetitions on each pass. Figure 3 shows the dierence in compression performance between the variable and xed frequency implementations of ray. The compression performance is shown for the smalltrec text le with dierent values of k for the xed frequency approach. Our simple method of varying the number of occurences on each new pass achieved 2% better compression than the best of the static methods. 6 Results In order to compare the compression performance of ray to other compression schemes we constructed three test collections of text and weather data. The smallest le in our test collection, smalltrec, is 2.86 Mb of text data taken from the TREC collection. TREC is an ongoing international collaborative experiment in information retrieval sponsored by NIST and ARPA [4]. The weather data (weather) contains 20,175 records collected from each of 5 weather stations, where each station record contains 4 sets of 22 measurements (such as temperatures, elevations, rainfall, and humidity); in total weather is 38.2 Mb in size. Comact, the largest le in the test collection is Mb of text data, taken from the Australian Commonwealth Acts. We have compared the compression performance of ray to the well-known adaptive compression schemes, gzip and compress. The compression performance of ray is also compared to huffword, a semi-static compression program that uses canonical Human coding and an ecient ternary trie [2, 3]. Because of the large les in our test collections, we were unable to experiment with sequitur.

7 File ray huffword gzip compress Compression comact (bpc) weather smalltrec Decompression comact (sec) weather smalltrec Table 1: Compression performance of various compression schemes on the test collection. The compression performance is represented as bits per character (bpc), and the time to decompress the les is given in seconds. Table 1 shows the results of compression performance and the decompression speed of each scheme on the test collections. The compression results are presented in bits per character (bpc), and the decompression time in seconds. Ray achieves better compression than all of the other compression schemes on the test collections, except for huffword on smalltrec. For the largest le, comact, ray achieves almost 10% better compression performance than the other semi-static scheme, huffword, that is specic for text compression. However, our current implementation of ray is between 1.2 and 2.2 times as slow as huffword to decompress data, but decompression time still remains practical. The performance of sequitur on comact would consume around 385 Mb of main-memory given that there are about 7 times fewer symbols in the hierarchy than the input, and that each symbol in the hierarchy consumes 20 bytes; we were unable to measure this in practice due to hardware limitations, and this calculation is based on an estimation technique for sequitur described by Nevill-Manning et al. [7]. In contrast, our preliminary version of ray uses approximately 225Mb of main-memory to compress the same le; we believe that a production implementation of ray, using more ecient memory structures, will use less than 200 Mb of main-memory on the comact collection. Our current preliminary implementation of ray is slow during compression due mainly to its multipass nature. However, as we mentioned earlier, speed is less important when compressing largely static databases. 7 Conclusion We have described a practical compression scheme for databases. The scheme has reasonable decompression speed and achieves excellent compression. Moreover, our scheme allows random access to data and is not restricted to databases of text. Ray works well as a possible compression scheme for general-purpose databases. Ray consumes less memory than that required by sequitur. We believe that our preliminary implementation can be improved further to reduce memory requirements during both compression and decompression, with a likely improvement in compression times. Moreover, other improvements are possible in techniques to store and code both rules and data and these are likely to result in further eciency gains. Acknowledgments We thank Andrew Turpin from The University of Melbourne for valuable discussions and his implementation of huffword. This work was supported by the Australian Research Council and the Multimedia Database Systems group at RMIT University. References [1] T.C. Bell, A. Moat, C.G. Nevill-Manning, I.H. Witten and J. Zobel. Data compression in full-text retrieval systems. Journal of the American Society for Information Science, Volume 44, Number 9, pages 508{531, October [2] J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 360{ 369, New Orleans, Louisiana, 5{7 January [3] J. Clement, P. Flajolet and B. Vallee. The analysis of hybrid trie structures. In Proc. of the Ninth ACM-SIAM Symposium on Discrete Algorithms, pages 531{539, San Francisco, California, January [4] D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, Volume 31, Number 3, pages 271{289, [5] D.A. Lelewer and D.S. Hirschberg. Data compression. Computing Surveys, Volume 19, Number 3, pages 261{296, September 1987.

8 [6] A. Moat, J. Zobel and N. Sharman. Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering, Volume 9, Number 2, pages 302{ 313, [7] C.G. Nevill-Manning and I.H. Witten. Phrase hierarchy inference and compression in bounded space. In J.A. Storer and M. Cohn (editors), Proc. IEEE Data Compression Conference, pages 179{188, Snowbird, Utah, March IEEE Computer Society Press, Los Alamitos, California. [8] C.G. Nevill-Manning, I.H. Witten and D.L. Maulsby. Compression by induction of hierarchical grammars. In J.A. Storer and M. Cohn (editors), Proc. IEEE Data Compression Conference, pages 244{253, Snowbird, Utah, March IEEE Computer Society Press, Los Alamitos, California. [9] C.G. Nevill-Manning, I.H. Witten and D.R. Olsen, Jr. Compressing semi-structured text using hierarchical phrase identication. In J.A. Storer and M. Cohn (editors), Proc. IEEE Data Compression Conference, pages 63{72, Snowbird, Utah, April IEEE Computer Society Press, Los Alamitos, California. [10] I.H. Witten, A. Moat and T.C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, New York, [11] J. Zobel and A. Moat. Adding compression to a full-text retrieval system. Software Practice and Experience, Volume 25, Number 8, pages 891{903, August 1995.

As an additional safeguard on the total buer size required we might further

As an additional safeguard on the total buer size required we might further require that no superblock be larger than some certain size. Variable length superblocks would then require the reintroduction