Experiments in Compressing Wikipedia. A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree

Size: px

Start display at page:

Download "Experiments in Compressing Wikipedia. A thesis presented to. the faculty of. In partial fulfillment. of the requirements for the degree"

Lee Sutton
6 years ago
Views:

1 Experiments in Compressing Wikipedia A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Marco Wotschka December Marco Wotschka. All Rights Reserved.

2 2 This thesis titled Experiments in Compressing Wikipedia by MARCO WOTSCHKA has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by David W. Juedes Professor of Electrical Engineering and Computer Science Dennis Irwin Dean of the Russ College of Engineering and Technology

3 Abstract 3 WOTSCHKA, MARCO, M.S., December 2013, Computer Science Experiments in Compressing Wikipedia (105 pp.) Director of Thesis: David W. Juedes Wikipedia contains a large amount of information on a variety of topics and continues to grow rapidly. With this growing collection of information, the need for improved lossless compression programs arises. This thesis investigates several lossless, general-purpose compression techniques, such as Burrows-Wheeler Compression (BWC) as well as Prediction by Partial Matching (PPM), and evaluates their performance on two benchmark files containing Wikipedia data. Improvements to BWC are suggested, outlined and evaluated. Furthermore, several preprocessing stages are introduced and tested. This thesis suggests an Improved Burrows-Wheeler Compression (IBWC) scheme, which utilizes a multi-threaded Burrows-Wheeler Transform (BWT), a Move-Fraction Transform (MF) as well as PPM and combines good compression outcomes with reasonable space and time requirements. It achieves compression of the first 1 GB of Wikipedia to bits per character (44% better than gzip) and compresses the first 100 MB of Wikipedia to 1.78 bits per character (38% better than gzip). Utilizing the BWT, this compression approach works particularly well on long inputs that contain frequent repetitions of long strings. Compression performance of the IBWC scheme is compared to gzip, bzip2 and PPM on two additional files - the complete genome of the model organism Caenorhabditis elegans and a collection of books obtained from Project Gutenberg. In both cases, IBWC provides compression performance similar to PPM.

4 Table of Contents 4 Page Abstract List of Tables List of Figures List of Acronyms Introduction Preliminaries Lossless vs. lossy compression The structure of enwik General Purpose Compression Techniques Huffman coding Arithmetic Coding Prediction by Partial Matching (PPM) Context-Mixing (CM) Burrows-Wheeler Compression (BWC) The Burrows-Wheeler Transform (BWT) Time and space considerations of the Burrows-Wheeler Transform (BWT) Further processing steps of Burrows-Wheeler Compression (BWC) 32 4 Compressing Wikipedia Compression With Word-Based Huffman Coding (WBH) Compression using Burrows-Wheeler Compression (BWC) with unlimited block size Compression using Prediction by Partial Matching (PPM) Improving Burrows-Wheeler Compression (BWC) Improving the Burrows-Wheeler Transform (BWT) stage Using a different lexicographic order Reflected Order Sorting Bijective Variants Improving the Global Structure Transform stage Move-One-From-Front Transform (M1FF) Move-Fraction Transform (MF)

5 5.3 Improving Run-Length Encoding (RLE) Improving the entropy coding stage Eliminating NULL-Run-Length Encoding (RLE0) and further adjustments to the Move-Fraction Transform (MF) Repositioning individual steps and other post-bwt stages Results Preprocessing Compression by separating structure from content (Splitting) Rabin-Karp Compression (RKC) as a Precompression Step Star Transform (ST) Shortened-Context Length-Preserving Transform (SCLPT) Word Replacing Transform (WRT) Results Experimental Results on enwik9 using Improved Burrows-Wheeler Compression (IBWC) Experimental Results on additional benchmark files Summary of experimental results Conclusions References Appendix: Code

6 List of Tables 6 Table Page 3.1 Frequency distribution of an example file with fixed-length codes and Huffman codes Partial interval assignments - Initialization Partial interval assignments after one character Rotations of the string "banana" Sorted rotations of the string "banana" Burrows-Wheeler Transform Reversal (UNBWT) Burrows-Wheeler Transform Reversal (UNBWT), Step Move-To-Front Transform (MTF) initialization of A Move-To-Front Transform (MTF) A after processing the first symbol Move-To-Front Transform (MTF) A after processing the third symbol Compression performance of Word-Based Huffman Coding (WBH) on enwik Compression performance of Burrows-Wheeler Compression (BWC) with unlimited block size on enwik Compression performance of PPM on enwik Performance of Burrows-Wheeler Compression (BWC) with different settings for the individual stages Compressor performance on enwik8 split into 12 files Performance of Rabin-Karp Compression (RKC) followed by gzip Compression performance of Improved Burrows-Wheeler Compression (IBWC) on enwik9 treated as ten blocks Compression performance of Improved Burrows-Wheeler Compression (IBWC) on enwik9 treated as three blocks Compression performance of splitting with Improved Burrows-Wheeler Compression (IBWC) on enwik Compression results on the files gut97 and c_elegans Summary of approaches tested on enwik Summary of approaches tested on enwik

7 List of Figures 7 Figure Page 2.1 Example of a short XML entry in enwik Huffman tree computation Initial contents of the priority queue q Contents of the priority queue after iteration Contents of the priority queue after iteration Contents of the priority queue after iteration Completed Huffman tree Files produced by separating structure from content (Splitting) Example page entry after removing all non-xml from figure

8 ARI Arithmetic Coding List of Acronyms 8 BWC IBWC BWT CM GST LPT M1FF MF MSD MTF PPM RK RKC RLE RLE0 SCLPT ST UNBWT WBH WRT Burrows-Wheeler Compression Improved Burrows-Wheeler Compression Burrows-Wheeler Transform Context-Mixing Global Structure Transform Length-Preserving Tranform Move-One-From-Front Transform Move-Fraction Transform Most Significant Digit Move-To-Front Transform Prediction by Partial Matching Rabin-Karp Compressor Rabin-Karp Compression Run-Length Encoding NULL-Run-Length Encoding Shortened-Context Length-Preserving Transform Star Transform Burrows-Wheeler Transform Reversal Word-Based Huffman Coding Word Replacing Transform

9 1 Introduction 9 Most data traffic on the internet today is compressed in some form. Videos, images and voice as well as text files such as html, javascript and cascading style sheets often undergo some form of compression before arriving at their destination. As such, compression allows for faster data transfer as well as lowered storage requirements, and is vital to our modern society. Wikipedia is an important part of today s internet as it provides an entry point for finding information on a wide variety of topics such as mathematics, politics, literature, medicine, art and computer science. Hence, there is interest in finding better ways to compress Wikipedia for storage and mirroring as it currently features over four million articles in the English version alone and continues to grow at a rapid pace [51]. The information contained in Wikipedia is very unique and includes XML, text (encoded in Unicode 1 ), further markup, links to media of various types, mathematical equations and more. Because of this, we would expect Wikipedia to be very compressible in theory. However, general purpose compressors such as gzip [20] and bzip2 [47] do not compress Wikipedia particularly well. The first 100 MB of its English version are compressed using these programs to roughly 36.6 MB and 29 MB, respectively. The contribution of this thesis is an exploration of techniques that can be used to achieve better compression on this type of information. The original motivation came from the Hutter prize [24] - a competition to determine which software can compress the first 100 MB of Wikipedia best. The current record for this benchmark is approximately 16 MB. On the first 1 GB of Wikipedia the record is roughly 127 MB [36]. Shannon estimates the information content of English to be between 0.6 and 1.3 bits per character [48] indicating that further compression of Wikipedia might be possible. While it is well known that certain information is very compressible, it is not clear what the limits for 1 For more information, consult Salomon s introduction to Unicode [46].

10 10 compresssing Wikipedia are. The information stored within it is obviously not random, but does not appear to have long repetitions either and contains a significant amount of structural information. However, it is not immediately obvious how to exploit the redundancy within it. In this thesis, a series of experiments are described, which are intended to determine the best compression approaches for Wikipedia and similar data collections using the benchmark files enwik8 (the first 100 MB of Wikipedia) [24] and enwik9 (the first 1 GB of Wikipedia) [36]. The current best approach described in this thesis achieves compression to roughly 22.5 MB on enwik8 and 181 MB on enwik9. This currently places it at position 36 out of over 160 different compressors tested on the larger benchmark file enwik9 [36]. The approach combines a Burrows-Wheeler Transform (BWT) with unlimited block size with a Move-Fraction Transform (MF) and uses Prediction by Partial Matching (PPM) during its last stage. Choosing an unlimited block size during the BWT proves to be very helpful in improving compression. Furthermore, enough redundancy is still present after the MF to warrant the use of PPM for further gains in compression. This thesis is organized as follows: In Chapter 2 the reader is introduced to the concepts of lossy and lossless compression. While lossy compression has its place in today s world, the focus within this thesis will lie on lossless compression techniques. Furthermore, Chapter 2 discusses the benchmark files used throughout this thesis. Chapter 3 summarizes several lossless, general-purpose compression techniques. These include Huffman coding, arithmetic coding as well as Prediction by Partial Matching (PPM) and Burrows-Wheeler Compression (BWC). In Chapter 4 some of these techniques are tested on the benchmark file enwik8. Chapter 5 highlights several possible improvements to a Burrows-Wheeler Compression (BWC) scheme. Among those are a modified lexicographic sort order to be used during the Burrows-Wheeler Transform (BWT) step, as well as alternatives to the

11 11 subsequent Move-To-Front Transform (MTF), Run-Length Encoding (RLE) and entropy coding steps. Additionally, test results for an improved Burrows-Wheeler compressor are provided. In Chapter 6 several preprocessing techniques designed to improve compression outcomes are discussed and tested in conjunction with gzip, bzip2, BWC and PPM. Chapter 7 gives results of some of the approaches outlined in this thesis when applied to the benchmark file enwik9. The Improved Burrows-Wheeler Compression (IBWC) scheme developed in Chapter 5 is tested on two additional files - c_elegans and gut97 - in Chapter 8. A summary of test results on enwik8 and enwik9 can be found in Chapter 9.

12 2 Preliminaries 12 The widespread availability of cloud services, music players, tablets and other portable devices with relatively limited storage capacity has increased the demand for efficient storage and transmission of information. This chapter discusses lossless and lossy compression. Lossless compression occurs when the compression process can be reversed to obtain an exact copy of the input. Lossy compression algorithms do not have this behavior, but are an option when some loss of information is acceptable. Additionally, the structure and content of enwik8 are discussed below. The file contains page entries of Wikipedia organized in XML. Some loss of information might in theory be acceptable when compressing enwik8. However, for this thesis only lossless compression techniques will be considered. 2.1 Lossless vs. lossy compression Let c( f ) be the result of compression of a file f using some compression algorithm c. Furthermore, let d(c( f )) be the result of the reversal of this compression process. If it is guaranteed that d(c( f )) = f for every string f, then compression is considered lossless. If d(c( f )) = f cannot be guaranteed for all files, compression is considered lossy. Lossy compression algorithms "concede a certain loss of accuracy in exchange for greatly increased compression [42]" by discarding information that is considered to be of lower importance. For instance, mp3-files are lossily compressed versions of music files. Other uses for lossy compression present themselves. Examples include image compression as well as the minification of cascading style sheets (css) and javascript files - a process in which variable names are shortened and whitespace is mostly eliminated. These compressed images and minified css and javascript files are still functional without needing to be decompressed prior to use. In addition, they consume less hard-drive space as well as transfer time than their uncompressed counterparts. Use of these compressed

13 13 files on a web server results in faster page-load times for the user and reduced bandwidth consumption on both ends. However, lossy compression is not an option when all information, regardless of how unimportant it may seem, must be reproducible during decompression. Lossless compression on the other hand is the process of compressing a file f using an algorithm c in such a way that an algorithm d can be used to decompress c( f ) while satisfying d(c( f )) = f for every string f. In order to satisfy this constraint, lossless compression algorithms are by nature less efficient than lossy compression algorithms. The process of compressing a string f can be thought of as a one-to-one mapping of f to c( f ). Hence, it is impossible to guarantee that c( f ) < f for every string f, because a simple counting argument shows that not all strings can be mapped to a shorter string. If an algorithm made the claim that c( f ) < f for every string f, then clearly it would also guarantee that c(c( f )) < c( f ) for every string f. Therefore, such an algorithm could be used recursively on its own output, guaranteeing that with every iteration the output is shortened by at least one byte, which is impossible. As a result, some mappings necessarily lead to scenarios in which c( f ) > f. In such a case, compression of a file results in a larger compressed file. When compressing information, we aim to satisfy c( f ) < f for those files that are of interest, while accepting that compressing less common files will result in file expansion. The compression techniques discussed and used throughout this thesis are lossless in nature. Chapter 3 introduces several of those lossless, general-purpose compression techniques, including Huffman coding, arithmetic coding, Prediction by Partial Matching (PPM) and Context-Mixing (CM) as well as Burrows-Wheeler Compression (BWC).

14 The structure of enwik8 The motivation for writing this thesis is the Hutter Prize [24]. A file enwik8, containing the first 10 8 bytes of the English Wikipedia dump as of March 3, 2006, is provided for benchmarking as part of the competition. A prize is awarded to a program which can create a compressed version of this file that beats the current record-holder. Limitations with respect to running time, memory usage as well as specifics of the testing platform are enforced. The goal of the Hutter Prize is to encourage use of intelligent means to compress human knowledge. In addition to the benchmark file enwik8, a file enwik9 is provided by Mahoney [35]. It contains the first 10 9 bytes of the English Wikipedia dump as of March 3, Memory and running time limitations do not apply to this file as it is not part of the Hutter Prize. The purpose of this benchmark is to "encourage research in artificial intelligence and natural language processing (NLP)." [35] Enwik8 and enwik9 are UTF-8 encoded XML with primarily English text. The files contain a list of Wikipedia pages, that follow the general structure shown in figure 2.1. While all of the individual XML-tags contain some text or numeric information, these passages are usually short with text nodes being the only exception. These nodes may contain #REDIRECT statements like the one in figure 2.1, but in a lot of instances contain complete Wikipedia articles that are often lengthy and therefore of particular interest. Enwik8 is roughly 75% clean text and 25% markup of various forms such as tables, hypertext links, images, formatting and XML. The textual data present in the <text></text> nodes contains some URL-encoded XHTML tags such "<", "&" and ">". Hypertext links to internal articles are enclosed in double-corner brackets such as "[[pagetitle anchortext]]". When anchor text and page title are identical, the vertical bar and anchor text are omitted. External links usually take the form "[URL anchortext]"

15 <page> 15 <title>accessiblecomputing</title> <id>10</id> <revision> <id> </id> <timestamp> t22:18:38z</timestamp> <contributor> <username>ams80</username> <id>7543</id> </contributor> <minor /> <comment>fixing redirect</comment> <text xml:space="preserve"> #REDIRECT [[Accessible_computing]]</text> </revision> </page> Figure 2.1: Example of a short XML entry in enwik8 [24]. Several other uses of markup in the <text></text> nodes can be observed. For example, section headers are often of the form "== Section Header ==" and "=== Section Header ===". Furthermore, "{{" and "" occur in pairs and have structural meaning. Quotations are often surrounded by two instances of the string """. There are programs that have been specifically designed to compress XML. These include XMill [33] [34] and XWRT [49]. One approach that separates structure from content is discussed in Chapter 6.1.

16 3 General Purpose Compression Techniques 16 This chapter introduces the reader to several general-purpose lossless compression approaches that can be used to compress enwik8. Huffman coding [23] was one the first efficient attempts at compression. Invented in 1952, it quickly became a standard and owes its continued use today to its simplicity and lack of protection by patents. Another breakthrough in compression was reached in 1977 when Ziv and Lempel published the LZ-methods [54] [55], which to this day are commonly used in gzip as well as other programs and are popular due to their fast running times and the fact that they achieve better compression than Huffman coding. In 1979, Rissanen and Langdon [31] published their work on arithmetic coding - an improvement over Huffman coding. Arithmetic coding lead to the development of more sophisticated compression methods such as Prediction by Partial Matching (PPM) [52]. The Burrows-Wheeler Transform (BWT) was introduced in 1994 [9]. While it is not technically a compression technique, it can facilitate better compression when used in conjunction with other algorithms. This chapter outlines some of these techniques. 3.1 Huffman coding In 1952, Huffman [23] proposed a method for the construction of minimum-redundancy codes, now commonly referred to as Huffman coding. It is based on the observation that individual characters in English text occur with different frequencies. For example, the characters e and t are much more commonly used in English than the characters q or x. However, all characters are allocated the exact same amount of space - one byte or eight bits using the standard, flat encoding. The idea behind Huffman coding is to allocate less space to characters which occur frequently and more space to those that occur less often in the hope that the savings incurred often for the frequent characters offset the encoding penalty for the less frequent symbols.

17 17 Initialize prority queue q of tree nodes and frequency table f Determine character frequencies and output them for i 0, 255 do Create a tree node for every character if f [i] > 0 then tn TreeNode set left child of tn NULL set right child of tn NULL set frequency of tn f [i] insert tn into q end if end for while size of q > 1 do Combine the two nodes with lowest freqency into a new node tnnew TreeNode set left child of tnnew q[0] set right child of tnnew q[1] set frequency of tnnew q[0]. f requency + q[1]. f requency remove q[0] and q[1] from q insert tnnew into q end while The remaining element in q is the root of the Huffman tree Figure 3.1: Huffman tree computation

18 18 The algorithm requires the gathering of frequency information for each character, the construction of a Huffman tree based on those frequencies and a tree traversal to determine the code words for the individual characters. In a second pass over the input, individual characters are then replaced with their Huffman codes and a dictionary is created. An important property of Huffman codes is that they are prefix-free [13], i.e no code is a prefix of any other code. Hence, a message encoded using a Huffman coder can always be uniquely decoded as long as the decoder can construct the same Huffman tree from a given distribution table. This implies that the distribution information needs to be made available to the decoder, which can be done by simply prepending it to the Huffman-coded output. The construction of the Huffman tree is outlined in Figure 3.1. Table 3.1: Frequency distribution of an example file with fixed-length codes and Huffman codes symbol frequency (f) fixed-length code Huffman code a b c d e total 285 The following paragraphs walk the reader through a simple Huffman coding example. Table 3.1 gives frequencies for the symbols a, b, c, d and e of a file containing a total of 285 characters. Since in this example only five symbols are used in our input file we could choose to encode each symbol as exactly 3 bits. An example of such a fixed-length code is given in the third column. Encoding a file in such a way uses

19 = 855 bits. The Huffman codes given in the last column produce a compressed file that uses only = 616 bits for the message. A dictionary needs to be made available to the decoder, but is omitted in this example. Such a dictionary contains at most 256 frequencies for character-based approaches and its size is neglilible on large inputs. Figure 3.2 shows the contents of the priority queue after its initialization. All nodes are trees corresponding to single characters. These trees are sorted by their frequencies in ascending order e b c d a Figure 3.2: Initial contents of the priority queue q Figure 3.3 shows the contents of the priority queue after the characters b and e have been combined to form a new tree. Note, that the frequency for the root node of this new tree is the sum of the frequencies of its children c d a e b Figure 3.3: Contents of the priority queue after iteration 1

20 20 Figure 3.4 shows how the letter c has been combined with another node to form a new tree. The nodes corresponding to the characters d and a have been moved to the front of the priority queue as they now are the two root nodes with the lowest frequencies d a c e b Figure 3.4: Contents of the priority queue after iteration 2 Figure 3.5 shows the forest after combining two nodes that have so far been untouched. The combined frequency of the symbols d and a is larger than those of the root nodes of the remaining trees in the priority queue. This new tree is, therefore, placed at the end of q c d a e b Figure 3.5: Contents of the priority queue after iteration 3

21 c d a e b Figure 3.6: Completed Huffman tree Figure 3.6 shows the completed Huffman tree. The Huffman codes for the individual symbols can be obtained by traversing the tree. Left branches are usually assigned a 0-bit, right branches a 1-bit. The path from the root to the character b would be "left-left-right" - equivalent to the bit-sequence 001. A traversal of the Huffman tree allows us to obtain the codes given in Table 3.2. In a second pass over the input, symbols can now be replaced with their Huffman codes. 3.2 Arithmetic Coding Just like Huffman coding, arithmetic coding [31], which can be traced back to Elias in the early 1960s [52], is a member of the family of entropy coders 2. Unlike Huffman coding, in which each symbol is encoded individually with a fixed number of bits, arithmetic coding encodes the entire input as a number in the interval [0,1). This enables an arithmetic coder to use fractions of bits rather than a fixed number of bits per symbol. 2 See the excellent comprehensive guide to arithmetic coding by Bodden, Clasen and Kneis [8] for additional details.

22 22 Beginning with a frequency table f and the interval s over [0,1), subintervals s i are reserved for each symbol i of the input alphabet. These fragments correspond in size to the relative frequency of their respective symbols i. Once a symbol is read and its subinterval determined, this subinterval serves as a new interval to the next iteration. In arithmetic coding, it is possible to begin with a first pass over the data, in which frequency information is gathered. As a result of such a strategy, one might determine that certain characters are never used and the relative frequency of such characters would equal 0, requiring no subinterval creation for such symbols. Similarly to Huffman coding, this frequency table would need to be passed on to the decoder. For an example of arithmetic coding, the frequency distribution in table 3.2 was used. In the input file, the character a occurs four times, b occurs twice, c once and three occurrences of the character d are observed. The input consists of ten characters altogether. Using this information, we would reserve for each symbol the partial interval identified by the columns three and four of this table. Table 3.2: Partial interval assignments - Initialization symbol frequency lower bound upper bound a b c d The character a falls into the interval [0, 0.4), b is assigned the partial interval [0.4, 0.6), the portion [0.6, 0.7) is reserved for the character c and d falls into the partial interval [0.7, 1). While it is not important which subinterval is assigned to a symbol, it is important that the decoder assigns subintervals exactly the same as the encoder. Now

23 23 assume that the first character found in the input is an a. Since an arithmetic coder encodes an entire message as one number, the final value is located in the partial interval [0, 0.4). Given the frequency distribution in table 3.2, it is ensured that all possible strings beginning with a will fall into this subinterval. Predictions about upcoming characters are made in this fashion while always using the current interval as the basis for the assignment for partial intervals to individual symbols. In table 3.3, the values for the next step are shown. All strings beginning with "aa" will fall into the interval [0, 0.16). One can continue in this fashion until the input is consumed. Table 3.3: Partial interval assignments after one character symbol frequency lower bound upper bound a b c d Arithmetic coding exists in different forms. In an initial pass over its input a static arithmetic coder gathers frequency information and uses this information while performing the encoding during a second pass over the data. Probabilities of symbols are never modified. The example shown in this section is a static arithmetic coder. By initializing the frequency table in such a way that all symbols i have a frequency f i = 1 and incrementing these frequencies as symbols are encountered one can turn a static into an adaptive arithmetic coder. The advantage of an adaptive arithmetic coder lies in the fact that frequency information does not need to be made available to the decoder, eliminating this overhead.

24 24 If a human were confronted with the task of predicting the next character in an English text file without knowing anything about the file or the symbols prior to the current character, the prediction would be based on the knowledge that e is the most commonly used alphabetic character in the English language. The best guess - given the limited amount of information - would be to predict the character e as the next symbol. If this same person were given the information, that the last symbol encountered was a q, their prediction for the next symbol would likely change. In the English language, a q is often followed by a u. As humans we can take advantage of this knowledge and make better predictions. We no longer attempt to estimate the probability of symbol s i being our next symbol, but instead try to predict the probability of s i being our next symbol given s i 1. More formally, let x and y be symbols from our alphabet Σ. We would like to be able to predict the conditional probabilities P(x y) x, y Σ, the probability that the next character is x given that the last character was y. We can expand this to not only include the last one symbol in our predictions, but the last n symbols. An arithmetic coder considering n previous symbols is called an order-n arithmetic coder. A model using only the static frequency table while considering no previous characters when making a prediction about the next symbol is, therefore, an order-0 model. Arithmetic coding performs at least as well as Huffman coding [52], as it usually represents the encoded message with greater compactness. This is especially true for texts over small alphabets. Consider a binary alphabet (as one would use to encode black and white pixels on a fax) with a heavily skewed character distribution. Huffman coding assigns a one-bit code to each of the characters even though white pixels should occur much more frequently than black pixels. However, with a larger alphabet size Huffman coding comes closer to arithmetic coding in terms of its compression performance [8]. Arithmetic coding separates the model from the encoding process and modern implementations are fast. Another major advantage of arithmetic coding over Huffman

25 25 coding is its ability to adapt to its input. When processing the input, one can easily increment frequency counts of symbols, which in turn are used to partition the interval. To add this ability to a Huffman coder would require the recomputation of all Huffman codes at every character, which is a time-consuming process. 3.3 Prediction by Partial Matching (PPM) Another lossless general-purpose compression technique is Prediction by Partial Matching (PPM) [12]. Like higher-order arithmetic coding, PPM is based on the prediction of symbols given their current context. Section 3.2 discussed how an arithmetic coder can use a frequency table to predict the next character and how a higher-order arithmetic coder can take context into consideration. PPM differs from higher-order arithmetic coding in one important way. While arithmetic coding of order n predicts the probabilities of symbols given the last n symbols, PPM can also make use of lower-order contexts, if the current symbol has not previously been encountered in a higher-order context. This can be achieved by introducing an additional character - referred to as the escape symbol. The frequency of this symbol in every context is positive. When the order-n model is unable to predict the next character based on the order-n context, it causes the arithmetic coder to encode the escape symbol. This character indicates to the decoder that the next-highest order was used to make a prediction. In this way, one can utilize the predictive power of higher-order models, while falling back on lower-order models when no prediction can be made. One obvious drawback of this approach is that one does not know in advance, which highest order will result in best compression outcomes. If PPM fails to predict at order n, it encodes the escape symbol to indicate this failure and will attempt to use the model of order n-1 to predict the next symbol. The encoded escape symbol increases the size of the output, but does nothing to encode the next symbol. Frequent failure of a higher-order

26 26 model to predict the next symbol results in many encodings of the escape symbol without contributing to better compression. PPM allows us to separate modelling from encoding. While the modelling process attempts to predict the upcoming symbol accurately, the arithmetic coder uses the symbol probabilities it receives from the modelling stage in order to encode characters. The main focus in modern PPM approaches lies in improving the quality of probability prediction. It is important not to confuse encoding with compression. It is entirely possible to encode a message with Prediction by Partial Matching (PPM) (or arithmetic coding) in such a way that the encoded message is in fact longer than the input. In such a case the model is a bad predictor of the input. Therefore, it is important that PPM accurately predicts the next symbol and be implemented in such a way that it adapts to changes in its input in a timely fashion. Salomon [46] discusses several variants of PPM. 3.4 Context-Mixing (CM) Context-Mixing (CM), which is similar to PPM, presents another example of a lossless compression approach. In PPM one attempts to predict the next character based on one higher-order context and falls back to a lower-order context when necessary. In this way, a prediction is made for P(x y), where x is the next symbol and y is the context. In CM one attempts to estimate P(x y, z) based on P(x y) and P(x z) by combining the predictions of two statistical models into one. It is not uncommon to use a weighted average of the values one has readily available. For instance, if P(x y) = 0.2 and P(x z) = 0.5 are given, the average (0.35) could be used as an estimate of P(x y, z). If the confidence in the prediction of P(x z) is higher than the confidence in the accuracy of P(x y), it might receive a higher weight. This process is called Context-Mixing (CM) since predictions based on different contexts are mixed or combined into one prediction. It is important to note, that there is no need for an escape symbol in CM.

27 Burrows-Wheeler Compression (BWC) The latest member in the arsenal of general-purpose, lossless data compression approaches is the Burrows-Wheeler Compression (BWC) scheme which was introduced by Burrows and Wheeler in 1994 [9]. This multi-step compression scheme makes use of the Burrows-Wheeler Transform (BWT), which was introduced in the same technical report. The BWT reorders characters in the input in order to make it more amenable to compression. Its output contains long runs of characters, which Burrows and Wheeler processed further via means of a Move-To-Front Transform (MTF) and Run-Length Encoding (RLE) before using a Huffman coder as a final step. The following sections describe the individual steps in more detail The Burrows-Wheeler Transform (BWT) During the BWT, all cyclic permutations of the input are sorted lexicographically. Once this step is completed, the last characters of the sorted strings are sent to the output. Additionally, the lexicographic position of the original string is provided. For illustration, the next example shows how the BWT is performed on the string "banana" 3. Table 3.4 shows the rotations of the input string. The beginning character b is displayed bold for readability. The original string "banana" is located in row 0. A rotation in row i is created by moving the last character of row i-1 to the front of row i and shifting all other characters to the right by one position. After all rotations have been generated, they are sorted lexicographically. The result of this sorting process is shown in table 3.5 with the original string being located in row 3 of this matrix. The output of the BWT consists of the index of the row containing the original string as well as the last column (L) of our sorted matrix. Hence, the output becomes 3 This example or variants thereof are found in many places in the literature. Wikipedia uses a slightly modified version of the string "banana" to illustrate the inner workings of the BWT.

28 28 Table 3.4: Rotations of the string "banana" row F L 0 b a n a n a 1 a b a n a n 2 n a b a n a 3 a n a b a n 4 n a n a b a 5 a n a n a b Table 3.5: Sorted rotations of the string "banana" row F L 0 a b a n a n 1 a n a b a n 2 a n a n a b 3 b a n a n a 4 n a b a n a 5 n a n a b a "3_nnbaaa" and the BWT of the string "banana" is complete. Since the BWT requires sorting of its input and needs this input to be available before it can generate output, it is not suitable for on-line use. The BWT is capable of handling context of unlimited length [10]. Note the repetitions present in the last column of table 3.5. There are two instances of n and three instances of a grouped together. Consider that in the string "banana", an a is often preceeded by an n and an n is often preceeded by an a. In the sorted matrix in table

29 the n in column L of row 0 was followed by the a in column F of row 0. Similarly, the n in the last column of row 1 was followed by the a in that row s first column. Since the individual strings are sorted lexicographically, strings starting with a and ending with n will be grouped closely together. This process results in many long character runs in our output. Since sorting is not a reversible process, the question arises how the Burrows-Wheeler Transform Reversal (UNBWT) can reproduce the original string given just a row index and the last column of the transformation process. Since the BWT only changes the order of the characters in its input, the same characters appear in the output with a different arrangement. The output of the BWT can be used to populate the last column of the matrix. Furthermore, the last column can be sorted to obtain the first column in that fashion. Table 3.6: Burrows-Wheeler Transform Reversal (UNBWT) row F L 0 a n 1 a n 2 a b 3 b a 4 n a 5 n a Table 3.6 shows the matrix after populating column F from the contents of column L by sorting its characters lexicographically. The original string is located in row 3 and begins with the character b. This occurrence of the letter b is the first occurrence of

30 said letter in column F. After finding the first occurrence of said letter in column L, which appears in row 2, a jump to column F in row 2 gives the next letter of the original string. 30 Table 3.7: Burrows-Wheeler Transform Reversal (UNBWT), Step 2 row F L 0 a n 1 a n 2 a b 3 b a a 4 n a 5 n a Table 3.7 shows the matrix after having obtained letter a in column F in row 2. This instance of the character a is the third occurrence of that letter in column F. After finding the third a in column L, a new row index (5) can be obtained. The character n found at column F in row 5 is the next character in the original string. By jumping back and forth, the original string can be rebuilt without having to reconstruct the entire matrix. Similarly to PPM, the BWT predicts symbols based on context. While in PPM the context is provided by preceding symbols, the BWT uses context provided by following symbols [29]. Kruse and Mukherjee [29] point out that the BWT discards a lot of statistical and structural information before the Global Structure Transform (GST)-stage and that the BWT only considers context induced by adjacent symbols. The example strings "abc", "agc" and "azc" given in [29] strongly suggest context based on non-adjacent symbols, which the BWT is unable to model.

31 Time and space considerations of the Burrows-Wheeler Transform (BWT) Storing all rotations of the original string explicitly during the sorting process consumes a lot of space. For a string of length n, n rotations of length n would need to be stored resulting in a space complexity of O(n 2 ), which is not feasible for large n. Instead of storing all rotations explicitly, it is more efficient to store the original string and n rotations as integers corresponding to the beginning locations in the original string. Hence, for an input of length n we would store n characters (the original string) and n integers, thus dramatically reducing memory requirements to O(n) (more specifically 5n when using 32-bit integers), although further reductions in memory requirements can be achieved, when considering that only log 2 (n) bits are needed to represent the rotations. Sorting strings is an integral part of the Burrows-Wheeler Transform (BWT). Since n strings of length n need to be sorted and a comparison of two strings can take up to n steps, the worst-case running time is O(n 2 log(n)), where n is the length of the string to which the Burrows-Wheeler Transform (BWT) is applied. Since n can grow large, programs using this transform such as bzip2 [47] usually introduce a block size that essentially treats a string of length n as multiple inputs of length b thus reducing the running time and space requirements while sacrificing compression performance. With a block size of b, the time complexity becomes O(n b log(b)). It should, however, be noted that the comparison of two strings of length n often takes only one character comparison. If their first characters differ, the lexicographic order can be determined immediately. In practice, the observed running time of a BWT is, therefore, often faster than O(n 2 log(n)), while long string repetitions present in the input to the BWT would result in longer running times. Several suffix tree and suffix array construction algorithms [45] [26] have been introduced that allow for even speedier BWT implementations with linear time and space complexity. Unfortunately, these approaches often use more memory than 5n bytes, which necessitates the use of blocks even for smaller files such as enwik8.

32 32 The Burrows-Wheeler Transform Reversal (UNBWT) is linear in time and space. This may be surprising given that in order to obtain the first column of the matrix shown in Table 3.7 the last column had to be sorted, which is generally not linear in time. A comparison sort is, however, not necessary. When the input to UNBWT is read into the last column of our matrix, a frequency table f can be updated for all possible bytes. This table is small, containing only 256 entries. Once the input is read, the first column of the matrix can be reconstructed by simply iterating over all elements of f in their already known lexicographic order and appending them to the first column as many times as they occurred in the last column. This process, known as counting sort, takes linear time. Two operations that are performed as part of the Burrows-Wheeler Transform Reversal (UNBWT) require closer attention. Once the next character in column F has been obtained, it is necessary to (1) find out how many instances of that same character appeared before the current instance. Also, once that number i is found, it is important to (2) quickly get the row index of the (i + 1)th instance of that character in the last column. Step (1) can be done by keeping a cumulative frequency table c f. Since the current character is known, one can find out how many instances of that same character occurred before the current position by using the values from c f and f. Step (2) can be performed in constant time by using a data structure lt of size 256 that stores the character locations in the last column for each character. Hence, lt[c][i] provides the location of the ith occurrence of character c in the last column. The extra space required to store this information is linear with respect to the length of the input Further processing steps of Burrows-Wheeler Compression (BWC) Since the Burrows-Wheeler Transform (BWT) does not compress data but rearranges it instead, its output is sent to other algorithms for further processing. Burrows and Wheeler [9] proposed the use of the Move-To-Front Transform (MTF) followed by

33 33 Run-Length Encoding (RLE), which provides some compression. Finally, a Huffman coder was used to compress the file further. The long runs of characters in the output of the BWT can be thought of as local frequency anomalies. The Move-To-Front Transform (MTF) is another example of a transform that does not compress but rather makes the output of the BWT stage more compressible by converting these local frequency anomalies into a global frequency anomaly. It replaces a symbol a with the number of distinct other symbols that have been encountered since the last occurrence of a. In order to accomplish this, the MTF maintains an array A of symbols in the alphabet. When processing its input, it looks up the position of the current symbol in A, outputs its array index and moves the character to the front of A. As a result, long character repetitions, which can be expected from a BWT step, are transformed into long sequences of zeros. Consider the string s = "ttaaaacccctttgaaacc" the input to the Move-To-Front Transform (MTF). Assume that the alphabet contains only the four letters present in s. The Move-To-Front Transform (MTF) might initialize A alphabetically as shown in table 3.8. Table 3.8: Move-To-Front Transform (MTF) initialization of A symbol a c g t index After reading s 0 ( t ), the position for that symbol (3) is sent to the output and the character s 0 is moved to the front of A resulting in table 3.9 When the symbol s 1 ( t ) is processed, a 0 is sent to the output. Since the element t is already at the beginning of A, no move needs to be performed. The next symbol s 2 is

34 34 Table 3.9: Move-To-Front Transform (MTF) A after processing the first symbol symbol t a c g index a, which is currently found at location 1 in A. Hence, a 1 is sent to the output and a is moved to the front of A, which is shown in table Table 3.10: Move-To-Front Transform (MTF) A after processing the third symbol symbol a t c g index Repeating this process produces the output " " (spaces added for clarity), in which the zeros correspond to repetitions of their preceeding symbols. While this example uses integers for clarity, the symbols output by the MTF would correspond to ASCII characters with values between 0 and 255. Given the long character repetitions in the output of the BWT step, the MTF would therefore produce output containing a significant amount of small characters. Specifically, the NULL-character would be expected to occur often and should occur in long runs. The probabilities of higher symbols in the output of the MTF decreases monotonically and probabilities of symbols change significantly based on the position of a symbol in the output of the MTF. This is why the immediate application of an entropy coder causes difficulties in the estimation of higher symbols [14] [2]. Balkenhol refers to this phenomenon as "pressure of runs" [6]. Burrows and Wheeler recommend that a Run-Length Encoding (RLE) step be used directly after the MTF. The long runs of characters present in the output of the MTF are replaced with simple instructions that

35 35 indicate which symbol was repeated and how many times it was repeated. This step removes long runs of characters and provides some compression. It is often used when Huffman coding is applied afterwards, since Huffman coders have difficulties representing high-probability symbols [3]. One way to implement RLE is to simply output single characters that are not repeated as is, while outputting characters that are repeated as two instances of said character followed by another symbol denoting how many additional instances of that character have been replaced. The output of the MTF as shown above (" ") could therefore be represented as " ". The bolded characters indicate the number of additional instances of the two identical preceding characters. As this example shows, it is entirely possible that the output of the RLE step produces output that is longer than its input. However, the runs shown in this example were rather short and longer runs would be expected in real data. The entropy coding stage is the last step of the Burrows-Wheeler Compression (BWC) scheme and provides the majority of compression. While Burrows and Wheeler utilized a Huffman Coder for this step, an arithmetic coder would be a preferable choice. Both of these techniques have been introduced earlier in this chapter.

36 4 Compressing Wikipedia 36 As previously discussed, the file enwik8 (used throughout this thesis as a benchmark file) contains the first 10 8 bytes of the English Wikipedia dump as of March 3, Some information contained within it serves a structural purpose. This includes XML and URL-encoded XHTML tags. Furthermore, text, numeric data and additional markup are present. This chapter outlines several attempts at compressing the file enwik8. One approach utilizes a word-based Huffman coder. It treats the file enwik8 as a sequence of alternating words and non-words for which two separate Huffman trees are maintained. Compressing enwik8 in this fashion produces results that are better than those offered by gzip and slightly worse than those obtained by compressing enwik8 with bzip2. A further approach makes use of a slightly modified Burrows-Wheeler Compression (BWC) scheme. It treats the entire input is as one block during the Burrows-Wheeler Transform (BWT) stage and replaces the Huffman coder originally suggested by Burrows and Wheeler [9] with an adaptive arithmetic coder of order 0. The results of this approach are rather encouraging. Enwik8 can be compressed to a little less than 25 MB using this approach. The third approach makes use of Prediction by Partial Matching (PPM). Several different highest orders 4 were tested and compression results compared. PPM faired best with a highest order of 6, resulting in a compressed files size just under 23.8 MB. 4.1 Compression With Word-Based Huffman Coding (WBH) It was discussed in Chapter 3 how Huffman codes can be constructed for an alphabet containing 5 characters. One can easily expand this to an alphabet of 256 characters or even an alphabet of larger size, which should bring the compression performance of a Huffman coder closer to that of an arithmetic coder [8]. 4 This approach uses its highest order context first in order to make a prediction and falls back to the next-highest order when failing to predict.

37 37 Define a word to be a maximal sequence of alphabetic characters [A Z, a z] and a non-word to be a maximal sequence of non-alphabetic characters. With this definition any input file will contain strictly alternating words and non-words. A frequency analysis can be performed by parsing words and non-words and incrementing their frequencies accordingly. Once the frequency analysis is complete, a dictionary containing these words and non-words can be made available to the decoder and two separate Huffman trees can be constructed - one on words and another on non-words. While each of those trees is prefix-free, it is entirely possible, in fact likely, that a code for a word is a prefix of a code for a non-word or vice versa. This does not pose a problem, since words and non-words strictly alternate. Table 4.1: Compression performance of Word-Based Huffman Coding (WBH) on enwik8. Gzip and bzip2 were used with default parameters. produced file file size size after gzip size after bzip2 enwik8.wbh enwik8.words enwik8.nonwords total Table 4.1 shows the results of applying Word-Based Huffman Coding (WBH) to enwik8. The program produces three files: enwik8.wbh is the compressed file using Word-Based Huffman Coding (WBH), enwik8.words is a dictionary containing words and their frequencies and enwik8.nonwords is a dictionary of non-words with their frequencies. Column two shows the file sizes in bytes, and columns three and four provide file sizes in bytes after running gzip and bzip2 with default parameters on the individual parts.

38 38 Word-Based Huffman Coding (WBH) in combination with gzip reduces the size of enwik8 to just over 30 MB, while gzip on its own produces a compressed file of roughly 36.6 MB. Furthermore, Huffman coding is well-understood and easy to implement, relatively fast and patent-free. The Huffman coder used for this test was not an adaptive version, but used a two-pass strategy instead. As a result, it had to make the dictionaries of words and non-words available to the decoder. Horspool and Cormack [22] used an adaptive approach in their experiments. While this improved the compression performance slightly, the running time was affected negatively, since with every word and non-word that was read, the Huffman codes needed to be updated. Huffman coding is not the only suitable technique to be used on words. Other word-based compression algorithms have been suggested by Moffat [38] and Yugo et al. [25]. 4.2 Compression using Burrows-Wheeler Compression (BWC) with unlimited block size Chapter 3 outlined the inner workings of Burrows-Wheeler Compression (BWC) and introduced the notion of a block size to reduce running-time and memory requirements. Burrows-Wheeler Compression (BWC) is used in slightly modified form in bzip2 [47], which allows a maximum block size of 900 KB. Burrows and Wheeler [9] state, that a block size larger than a few million bytes provides little or no added compression while being detrimental to the running-time. However, we should expect an increase in block size to provide some improvement with respect to compression performance. In the C++ implementation of the Burrows-Wheeler Transform (BWT) provided in the appendix the entire input file was, therefore, treated as one block. As discussed in chapter 3, the BWT has an average running time of O(n log(n)) with space requirements of 5n in our implementation. Speedier implementations [45] [26] are possible. However, they consume more memory,

39 39 which affects our ability to process our input in as few blocks as possible. Therefore, in order to speed up the Burrows-Wheeler Transform (BWT) step, Most Significant Digit (MSD) radix sort [4] was used on the first two characters of each rotation. Hence, rotations beginning with "aa" were inserted into a different bucket than rotations beginning with "ab" and so forth. An insertion of a rotation into its corresponding bucket can be done in constant time when a rotation is created. Since one knows the lexicographical order of strings in different buckets, only pairs of strings from the same bucket need to be compared. Furthermore, said buckets can be sorted independently from one another using std::sort in multiple threads that process these buckets in order of their size. We would expect such an approach to utilize the multiple threads well. In our implementation, rotations were stored as integers corresponding to their starting positions and a less-than operator defined in such a way that it takes the starting location into consideration allowing for stable sorting. Performing a Burrows-Wheeler Transform (BWT) on enwik8 in this fashion requires less than five minutes on a 32-bit version of the operating system Ubuntu on a machine with 4 AMD Phenom II 830 processor cores. For the Move-To-Front Transform (MTF) and Run-Length Encoding (RLE) step as well as the final Arithmetic Coding (ARI) step the reference implementations provided by Nelson [40] were used. The Move-To-Front Transform (MTF) was explained in section In the Run-Length Encoding (RLE) implementation two consecutive characters with the same value flag a run. One additional byte is used to provide the number of additional instances of the same character. The arithmetic coder is an order-0 adaptive coder based on Witten, et.al [52]. The results of running a Burrows-Wheeler Transform (BWT) with unlimited block size followed by a Move-To-Front Transform (MTF), Run-Length Encoding (RLE) and an Arithmetic Coding (ARI) step were encouraging. Table 4.2 shows an improvement of

40 40 more than 4 MB over bzip2. Using an unlimited block size allows the BWT to group significantly more characters together - resulting in longer runs, which can be picked up by the Move-To-Front Transform (MTF) and Run-Length Encoding (RLE) steps. Table 4.2: Compression performance of Burrows-Wheeler Compression (BWC) with unlimited block size on enwik8 file file size size after BWT MTF RLE ARI enwik For instance, the presence of many instances of "[[" and "]]" causes the BWT to group corner brackets together as many rotations beginning with [ also end in a [. The same is true for the many instances of "{{" and "" as well as "==". One would also expect the English portion of enwik8 to produce long runs. Consider that one of the most common bigrams in English is "th". As such, many of the rotations that begin with h end with a t and as a result long runs of the character t would be expected. XML is another contributor to good compression. Figure 2.1 shows that Wikipedia pages end with a "</page>" tag. Sorting the rotations during the BWT results in the strings beginning with "/page>" to be grouped together. Most of these rotations, if not all of them, should end in the exact same character ( < ), providing the following stages of the BWC scheme with very long runs. The same effect can be observed with other closing XML-tags. 4.3 Compression using Prediction by Partial Matching (PPM) PPM, which was discussed in Chapter 3, attempts to make predictions with respect to upcoming symbols based on the current context and past frequencies it has observed in the same context. For the PPM step a reference implementation provided by Nelson [39] was used. The implementation allows the user to choose the highest order via command line

41 41 option. This highest order is used to attempt a prediction of the next symbol. The implementation uses a simple escape strategy discussed above. Additionally, it also monitors compression ratios and flushes the model whenever the local compression ratio starts to get notably worse. Specifically, the model divides all its counts by two and thus gives a higher weight to newer statistics. The best compression was achieved with a highest order of 6. This value appears to work well in practice [46]. Increasing the order further results in negative gains in compression as shown in table 4.3. The highest order is listed in column one. Column two gives the size of the compressed file in bytes. The compression ratio is expressed as bits per character in the last column. Table 4.3: Compression performance of PPM on enwik8 highest order size after PPM bpc Prediction by Partial Matching (PPM) uses its highest order model to make a prediction for the next symbol in the input. It may be the case that the current context has not been observed in the past, so that an informed prediction at a specific order may not be possible. In these cases, a model will cause the arithmetic coder to output an escape

42 42 symbol, which indicates to the decoder that the next-highest order model was used for prediction instead. Many such failed predictions will result in many additional encodings of the escape symbol (possibly at multiple levels in a row), resulting in worse compression. This effect can be observed when increasing the highest order from 6 to 7 previous symbols.

43 5 Improving Burrows-Wheeler Compression (BWC) 43 Burrows and Wheeler [9] proposed to use a limited block size during the Burrows-Wheeler Transform (BWT) stage of their compression scheme. This step was followed by a Move-To-Front Transform (MTF), Run-Length Encoding (RLE) and a Huffman coder. The encouraging results presented in Chapter 4.2, confirm that the block size in the BWT should be chosen as large as possible when compressing Wikipedia or other homogenous data sets [40] [46]. Additionally, Huffman coding should be replaced with arithmetic coding as discussed in Chapter 3. In this chapter, several ways to improve on this slightly modified Burrows-Wheeler Compression (BWC) scheme are discussed. The current best BWC scheme improves compression by 9.6% over the approach discussed in Chapter 4.2 with a final size of bytes for enwik Improving the Burrows-Wheeler Transform (BWT) stage The literature [14] [3] [11] discusses several improvements to the Burrows-Wheeler Transform (BWT) stage. Some bijective variants of the BWT [21] [30] have been developed. However, improvements to compression as a result from these steps appear to be generally small and not consistently positive [21]. Furthermore, Radescu [44] tested the performance of the BWT as a function of block size. Ferragina et al. [18] suggest a procedure to find a partitioning that lies within a certain margin of error of the optimal partitioning for the BWT, although it suffers from a time-complexity that is unacceptable for larger inputs. This section outlines the potential benefits of using a different lexicographic sort order and reflected order sorting during the sorting process of the BWT stage [11].

44 Using a different lexicographic order Often the terms "lexicographic order" and "alphabetic order" are used synonymously. However, alphabetic order is a special case of lexicographic order. The ASCII codes 97 through 122 represent the alphabetic characters a through z and are adjacent and organized in increasing order (just like their counterparts in the English alphabet). This results in a being considered smaller than b as one would expect when using alphabetic order. Hence, the simple comparison of ASCII values is usually used for sorting strings. If the number of runs generated by the BWT can be minimized and the average run length increased, better compression will result. While Fenwick [16] indicates that anything other than a standard sort prevents recovery of the data in the reversal process, Chapin and Tate [11] suggest a modified lexicographic sort order in the sorting stage. Chapin [10] states that the symbols? and! are often found at the end of sentences, are usually followed by a space-character or carriage-return and generally preceded by similar characters. One could argue that for most exclamations made, there should exist a question ending in the same character before the terminating question mark. It makes intuitive sense to think of! and? as similar. Hence, these characters should be considered close to each other in terms of their sort order. In order to determine an optimal sort order over a given alphabet, one would need to minimize the sum of distances between symbols. Chapin [10] reduces this problem to the Traveling Salesman Problem (TSP) and applies heuristics suitable to the TSP in order to find good lexicographic orderings. For distance measures he defines a histogram for each symbol of the alphabet and stores in it "counts of the characters immediately preceding each occurrence in the data of the character represented by that histogram." Different measures to calculate the distances between histograms were used. Similarly, Lemire et al. [32] tried to minimize the number of runs by reduction to the TSP. Chapin [10] found that a hand-picked heuristic ordering containing the alphabetic characters in the order

45 45 "aeioubcdgfhrlsmnpqjktwvxyz" with lower-case characters being kept separate from upper-case character worked well on data from the Calgary corpus [1] while computed sort orders faired worse. In order to test this approach, the lexicographic order in the BWT sorting stage was altered using the alphabetic characters in the order "aeioubcdgfhrlsmnpqjktwvxyz" (for both upper-case and lower-case letters) as suggested by Chapin, while leaving the sort order for all other symbols unchanged. The improvement was minimal (13452 bytes), which may be explained by the fact that enwik8 contains XML and the hand-picked sort order "aeioubcdgfhrlsmnpqjktwvxyz" was intended to be used on text files. The text portions in enwik8 are also often interrupted by further markup such as [ and { and their corresponding ] and as well as other characters not commonly used in English text. A more specific reordering tailored to the structure of enwik8 may improve compression performance further. However, as Chapin showed, an optimal sort order is hard to find and does not necessarily result in significant gain during the compression process. Furthermore, a sort-order optimized for the use on enwik8 may result in worse compression when used on text files or files containing languages other than English Reflected Order Sorting In addition to using a different lexicographic sort order, Chapin and Tate [11] suggest inverting the sort order for alternating character positions in order to put similar strings closer together. Consider a matrix built during the BWT. Now sort the strings first in lexicographic order on column 0, then inverted lexicographic sort order on column 1 and continue to alternate. Chapin and Tate found that this process may result in more homogenous columns. This effect extends all the way to the last column, which is the output of the BWT. In their tests, reflection improved the compression of the BWC on all test files but appears to have an even smaller effect than an improved lexicographic sort

46 46 order. When using a reflected sort order on enwik8, a small increase in file size was observed Bijective Variants In order to reverse the Burrows-Wheeler Transform (BWT), a row index is needed, which indicates the position of the original string in the sorted matrix. Without this index, any of the rotations could lead to the output of the BWT and it is impossible to identify the original string. Kufleitner [30] discusses bijective variants of the BWT that eliminate the need for the row index. Some bijective variants appear to have slightly improved compression results. This is in part due to the fact that the row index in a BWT needs to be output for every single block on which it is performed. For small blocks, this additional information needs to be provided relatively often. When using an unlimited block size, the additional output is a single integer followed by a space character and the effect of this additional output should be small. Gil and Scott later provided a description of their algorithm [21]. Their tests performed on files of the Calgary corpus result in rather small improvements on most files. 5.2 Improving the Global Structure Transform stage The Move-To-Front Transform (MTF) was suggested for use in conjunction with the Burrows-Wheeler Transform (BWT) by Burrows and Wheeler [9]. It attempts to convert local frequency anomalies typically present in the output of the BWT into a global frequency anomaly, which can be exploited better by means of an entropy coder [2]. The MTF is, however, not the only transform suitable for processing the output of the BWT stage. Two alternatives are discussed below.

47 Move-One-From-Front Transform (M1FF) The M1FF [17] is a variant of the Move-To-Front Transform (MTF). The only difference between the two lies in how the new position of a symbol in A is determined (commonly referred to as the list update problem). In the M1FF, a symbol is moved to the front of A if it is already at position 1 or 0. Otherwise, it is moved to position 1. Consider the run "aaaaaaaaaabaaaaaaaaaa" and assume that currently the characters a and b are at locations 0 and 6 of A respectively. A Move-To-Front Transform (MTF) would transform this to " ". Notice the presence of the character 1 in this output. The run of 20 instances of a was interupted by a single b. When the character b was moved to the front of A, the character a was moved to position 1. As a result of this, it needs to be moved back in the next step. The idea of the Move-One-From-Front Transform (M1FF) is to avoid removing a character from the front of A if it is part of a long run that is interrupted by only one character. In the small example "aaaaaaaaaabaaaaaaaaaa", the Move-One-From-Front Transform (M1FF) output would be " ". In experiments on enwik8, this approach performed slightly better than the Move-To-Front Transform (MTF). The gain in compression is roughly 300 KB Move-Fraction Transform (MF) The MF [10] is another Global Structure Transform (GST), in which the new location for a symbol is determined by dividing its current position by a constant d. Instead of immediately moving an encountered symbol close to the front of A, it is only allowed to move up after it has proven its "worthiness". Using this technique, it is less likely that frequent symbols are removed from the front of A by those that occcur less frequently. However, if such a (so far) less frequent symbol is encountered many times in a short sequence of the input, the table will update

48 48 fairly quickly (depending on d) and the symbol will be moved close to the front of A. In tests on the output from the BWT stage, this approach performed better than the MTF and M1FF and faired best with d = 2. It provides savings of bytes (an improvement of 2.42% over the MTF) with a final file size of bytes. 5.3 Improving Run-Length Encoding (RLE) RLE is the second-to-last step of the Burrows-Wheeler Compression (BWC) scheme. It receives the output of the preceding Global Structure Transform (GST) as its input. One would expect the input to the RLE stage to be predominantly NULL-characters. Other runs are rare, although they can occur. Instead of replacing all runs of characters, one could replace only runs of NULL-characters. When encountering such a run, one could output two NULL-characters followed by another character indicating the number of additional NULL-characters to follow. Since a lot of long NULL-character runs are expected from the MF stage, the output of RLE0 would contain a lot of pairs of NULL-characters followed by a large character. Using this approach, compression was again improved. Using a Burrows-Wheeler Transform (BWT) with unlimited block size in conjunction with a Move-Fraction Transform (MF) and a NULL-Run-Length Encoding (RLE0) step, the size of the compressed file was reduced to bytes. If one considers that a run of 1000 NULL-characters is encoded by an RLE0 as " ", another possible improvement becomes obvious. Instead of using a large character to indicate a long run, it might be beneficial to use a small character instead. Specifically, one could use 255 l to indicate the run length, where l is the length of a run. For simplicity, this approach will be referred to as RLE Using RLE0-255, the example above is encoded as " ", which appears to be more compressible. For short runs of NULL-characters such as "0 0" the output using RLE0 is "0 0 0" rather than " ", which we get when using RLE This might

49 49 result in worse compression, if the runs of NULL-characters tend to be short. Using RLE0-255 indeed performed significantly worse than RLE0 with a final size of bytes, which indicates that a significant number of NULL-character runs are indeed short. 5.4 Improving the entropy coding stage In Chapter 4.2, an adaptive order-0 arithmetic coder was used during the entropy coding stage. The Burrows-Wheeler Transform (BWT), Move-Fraction Transform (MF) and NULL-Run-Length Encoding (RLE0) already remove a lot of redundancy and leave little additional context, so that an order-0 arithmetic coder can compress the resulting file well. Some context is, however, still present in the input to the arithmetic coder. Ideally, one would see many long runs of NULL-characters in the input to the RLE0 step, which would result in its output containing many instances of NULL-character pairs followed by a single large character (often indicating a run length of 255). These pairs of NULL-characters and the often very large character that is expected to follow them should be encoded more efficiently using Prediction by Partial Matching (PPM) with an order of 2. The use of Nelson s PPM implementation [39] described in Chapter 4.3, results in further improvements. Using the Burrows-Wheeler Transform (BWT) with unlimited block size followed by a Move-Fraction Transform (MF) with d = 2 and NULL-Run-Length Encoding (RLE0) with an order-2 PPM step compressed enwik8 to just bytes. Surprisingly, better compression performance can be observed when using RLE0-255 instead of RLE0. With it, enwik8 is compressed to bytes. 5.5 Eliminating NULL-Run-Length Encoding (RLE0) and further adjustments to the Move-Fraction Transform (MF) The use of Prediction by Partial Matching (PPM) should prove useful when compressing the output of the Move-Fraction Transform (MF) even when no NULL-Run-Length Encoding (RLE0) is used before it. RLE0 shortens long runs of

50 50 NULL-characters but also introduces additional characters for the lengths of the runs, even when runs are only two characters long. Using PPM, it should be possible to strongly predict these long runs instead. Indeed, using this approach compression was improved slightly, resulting in a final size of just bytes. Furthermore, a change of the constant in the Move-Fraction Transform (MF) to d = 1.3 provided additional compression resulting in a final compressed size of bytes for enwik8. Compression outcomes with different values for d are listed in table Repositioning individual steps and other post-bwt stages Abel [2] suggested that a RLE step be placed in front of the Move-To-Front Transform (MTF) stage to minimize the pressure of runs. Using this approach in combination with a Move-Fraction Transform (MF) had a negative impact on compression of enwik8. Several other post-bwtstages such as the Distance Coding algorithm by Binder [15] and the Inversion Frequencies algorithm by Arnavut and Magliveras [5] have been proposed. Both are replacements of the MTF stage and are based on distances between occurrences of the same symbol. The compression rates of these approaches were tested by Abel [2] and found to be similar to those realized with MTF on the Calgary Corpus and Canterbury Corpus. The Global Structure Transform (GST) stage could switch between known approaches, such as the MF, MTF and M1FF. One such technique was analyzed by Chapin [10] and achieves slightly better compression. 5.7 Results The results of the different approaches examined in this chapter are summarized in table 5.1. The first column shows how the different stages are applied. Column two shows the resulting file size in bytes. The last column lists compression ratios expressed in bits

51 51 per character. In approach 1, the output of the Burrows-Wheeler Transform (BWT) was piped to a Move-To-Front Transform (MTF), which was followed by a Run-Length Encoding (RLE) stage. An order-0 arithmetic coding step was applied last. In approach 2, the MTF was replaced with a Move-Fraction Transform (MF) with d = 2 resulting in better compression. Approach 3 shows that using Prediction by Partial Matching (PPM) with a highest order of 2 provides roughly the same amount of improvement as the use of a Move-Fraction Transform (MF) with d = 2 as used in approach 2. Approach 4 combines the MF with d = 2 and NULL-Run-Length Encoding (RLE0) with an order-0 arithmetic coder. In approach 5 the order-0 arithmetic coding step of approach 4 was replaced with PPM with a highest order of 2, which results in further compression gains. In the approaches 6 through 11, the RLE0 step of approach 5 was omitted. While in Chapter a constant of d = 2 during the Move-Fraction Transform (MF) was found to perform best, a contant of d = 1.3 improves performance when the NULL-Run-Length Encoding (RLE0) step is ommitted. Approaches 6 through 11 show the compression outcomes with different values for d. It is neither obvious beforehand which constant should be chosen during the MF, nor is it clear why a smaller constant improves compression when RLE0 is omitted. Together. these improvements result in additional 9.6% compression gain over BWT MTF RLE ARI(0) with little effort. The results show that the use of an RLE0 step is no longer beneficial when PPM with a highest order of 2 is used in the final stage of the BWC scheme. We will refer to this improved approach as Improved Burrows-Wheeler Compression (IBWC) throughout the rest of this thesis. The use of an unlimited block size during the BWT stage results in long runs of NULL-characters after the MF stage when used on homogenous files [40] such as enwik8, other XML files, log files or English text. It should be noted that the performance of PPM

52 52 Table 5.1: Performance of Burrows-Wheeler Compression (BWC) with different settings for the individual stages Approach file size bpc 0 uncompressed BWT MTF RLE ARI(0) BWT MF(2) RLE ARI(0) BWT MTF RLE PPM(2) BWT MF(2) RLE0 ARI(0) BWT MF(2) RLE0 PPM(2) BWT MF(2.5) PPM(2) BWT MF(2) PPM(2) BWT MF(1.5) PPM(2) BWT MF(1.4) PPM(2) BWT MF(1.2) PPM(2) BWT MF(1.3) PPM(2) with a highest order of 2 as a final stage in this compression scheme depends on the existence of long runs of NULL-characters. When compressing smaller files, long runs may not be present and a different approach may provide better results.

53 6 Preprocessing 53 Chapter 4 presented the results of compressing enwik8 via means of Burrows-Wheeler Compression (BWC), and Prediction by Partial Matching (PPM). Chapter 5 showed how better compression outcomes can be realized by using an Improved Burrows-Wheeler Compression (IBWC) scheme, which utilizes a Burrows-Wheeler Transform (BWT) with unlimited block size followed by a Move-Fraction Transform (MF) and Prediction by Partial Matching (PPM) of order 2. This chapter explores several techniques that can be used to preprocess enwik8 in ways that may make it more amenable to compression using IBWC. Reasons for using precompression steps include: 1. Providing a compressor with a heavily skewed character distribution 2. Providing some compression in order to speed up a slower compressor 3. Introducing artificial context to improve compression 4. Reducing memory requirements in the compression stage 5. A combination of 1-4 One example of preprocessing is the Run-Length Encoding (RLE) step as outlined in Chapter in bzip2. It is used in an attempt to protect the BWT step that follows it from performing slowly when the input contains long runs of characters while at the same time compressing the input slightly. Generally, however, its use should be avoided [14] [2]. The MF and BWT in the IBWC scheme can be thought of as preprocessing steps as well. Neither of them compress data, but attempt to make it more compressible to an entropy coder. Several preprocessing techniques have been developed over the last years. These can be used in conjunction with most compressors although some have been

54 54 developed specifically with Burrows-Wheeler Compression (BWC) - bzip2 in particular - in mind. The following sections discuss a few of them. The first preprocessor attempts to separate structural information present in enwik8 from content, while maintaining different containers for different types of content. Using this approach we hope to eliminate any adverse effect that the frequent context switches present in enwik8 might have on compression performance. While this approach works well when used in conjunction with bzip2 or gzip, the Improved Burrows-Wheeler Compression (IBWC) scheme did not profit from such a preprocessing step. Furthermore, three dictionary-based preprocessors (the Star Transform (ST) [28], the Shortened-Context Length-Preserving Transform (SCLPT) [19] and the Word Replacing Transform (WRT) [50]) were tested in conjunction with bzip2, gzip and IBWC. The Star Transform (ST) replaces many alphabetic characters in its input with a single character in order to provide the Burrows-Wheeler Transform (BWT) stage with a heavily skewed character distribution, which should result in longer, more frequent runs. This approach worked well with bzip2 and gzip, but failed to improve the performance of our IBWC scheme. Similar results were observed for the SCLPT and WRT. While both allowed bzip2 and gzip to perform better, worse results were observed when using them in conjunction with IBWC. 6.1 Compression by separating structure from content (Splitting) The file enwik8 contains chunks of XML (each containing short text or numbers) interspersed with long non-xml portions in the <text></text> nodes. A sliding window compressor such as the commonly used linux utility gzip would populate its sliding window with XML initially and become good at compressing XML until a large non-xml portion is encountered. At this point the sliding window would contain mostly XML and compression performance for the following non-xml would suffer. Similarly,

55 55 encountering XML after having filled its sliding window with non-xml would cause gzip to perform worse at compressing XML until enough XML has been pushed into the sliding window. With every context switch between XML and non-xml, gzip would need to relearn the current context in order to compress well. Approaches that can take advantage of the existence of structured information, such as the XML-compressor XMill [33] [34], are examples of compressors that use knowledge about certain characteristics of the input in order to achieve improved compression. These human-tuned approaches are no longer used for general purpose compression. This does not make these special-purpose approaches less valuable, however. Often some knowledge about the input is available and should be used to facilitate better compression outcomes. For instance, it may be desirable to compress server logs or database backups on a daily basis. In those cases, patterns that occur often in these files should be taken advantage of. As discussed in Chapter 5, that the Burrows-Wheeler Transform (BWT) performs best on large, homogenous files. Separating structure from content should provide a BWT with more homogenous input and should improve compression outcomes. Hence, it makes intuitive sense to separate structure from content. We saw in Chapter 2 that several character combination in the <text></text> nodes serve a structural purpose. Text is often surrounded by "<" and ">" such as "</ref&gt". Furthermore, text is often enclosed by "[[" and "]]" as well as "{{" and "". These brackets serve a structural purpose. Triple-equal and double-equal signs, used to surround headers, and text surrounded by two instances of the string """ are fairly common as well. Removing this distant structural information from the textual contents may allow for further improvements to compression. As part of this approach the file enwik8 was split into twelve files as shown in Figure 6.1.

56 56 enwik8.alpha enwik8.ltgt enwik8.text enwik8.doublecorner enwik8.numeric enwik8.timestamp enwik8.doublecurly enwik8.quot enwik8.tripleequal enwik8.doubleequal enwik8.redirect enwik8.xml Figure 6.1: Files produced by separating structure from content (Splitting) A small lex-program was written to output XML only to enwik8.xml. The contents of the <title></title>, <username></username>, and <comment></comment> nodes were placed in enwik8.alpha and numeric entries contained in the <id></id> nodes were placed in enwik8.numeric. Timestamps all follow the same syntax and are given in the format "YYYY-MM-DDThh:mm:ssZ", where "YYYY" is a placeholder for the four digits of the year, "MM" for the two digits of the month and "DD" for the two digits of the day. Furthermore, the digits for hours, minutes and seconds are denoted by "hh", "mm" and "ss", respectively. A typical timestamp hence would be " T14:25:16Z", which is 20 characters in length. However, the characters -, :, T and Z have structural meaning and can be removed without loss of information. A shortened timestamp can therefore be expressed as " ", which is 14 characters long. Additionally, all timestamps have the exact same length, which makes the use of a separator in enwik8.timestamp unnecessary. While these shortened timestamps preserve most of the natural character of the input, expressing them as a distance measured in seconds from the smallest timestamp rather than an absolute value is likely to provide further improvements. Text preceeded by "#REDIRECT [[" and followed by "]]" is replaced with a single character not present in enwik8 and the content contained within the corner brackets is placed in enwik8.redirect. Remaining text enclosed by "[[" and "]]" is also replaced with a

57 57 single character and its content written to enwik8.doublecorner. The same approach is used for double-curly brackets and triple equal signs as well as double equal signs. Text of the form "<" followed by a short sequence of text followed by ">" is also replaced with a single character and the contents between are written to enwik8.ltgt. An analogous approach is used for strings surrounded by two instances of the literal """. The main idea behind this approach is to replace tokens such as ""Four Percent Hurdle"_" with a more generic string "x_", where x is a placeholder for the actual token. When encountering an instance of the literal """, one would expect another instance of this literal to follow closely. Neither PPM nor the BWT can take advantage of these structural constructs if they are too far away from one another. Folding an entire quoted literal into a single character should allow the BWT to take advantage of this structural information. Specifically, this means that the character _ is now preceded by a quotation instead of a variety of different characters. While the contents of a string surrounded by two instances of """ are just moved to another file (deferring the issue of compressing them to another file), both instances of the token """ are removed, which results in some compression. Figure 6.2 shows the contents of enwik8.xml after stripping out the content. What is left is clean XML. Placeholders in the individual XML tags allow for the reversal of the splitting process. The resulting files were compressed separately, which allows for the possibility of using different algorithms for each file. Table 6.1 lists the individual files that were created during the splitting process. The file size is given in column two, while column three and four list the sizes of these files compressed with gzip as well as bzip2. The compression outcomes of using IBWC and PPM are listed in columns five and six, respectively. In Chapter 4, Prediction by Partial Matching (PPM) was found to perform best with a highest order of 6, which was used in this test as well.

58 <page> 58 <title>a</title> <id>n</id> <revision> <id>n</id> <timestamp>t</timestamp> <contributor> <username>a</username> <id>n</id> </contributor> <minor /> <comment>a</comment> <text xml:space="preserve">t</text> </revision> </page> Figure 6.2: Example page entry after removing all non-xml from figure 2.1

59 59 Table 6.1: Compressor performance on enwik8 split into 12 files. Best results are displayed in bold font. file size (bytes) gzip bzip2 IBWC PPM-6 enwik8.alpha enwik8.doublecorner enwik8.doublecurly enwik8.doubleequal enwik8.ltgt enwik8.numeric enwik8.quot enwik8.redirect enwik8.text enwik8.timestamp enwik8.tripleequal enwik8.xml enwik Table 6.1 shows that the sum of the individual parts is now smaller due to the removal of some redundant information in the splitting process. For instance, timestamps and "#REDIRECT [[" statements were shortened. Compression overall was improved for gzip and bzip2. This is in part due to the removal of some structural information but also a result of the more homogenous nature of the individual files. Furthermore, bzip2 still performs well on the smaller files, which it can treat as one or few blocks.

60 60 Interestingly, compression suffered a little when using IBWC with this splitting approach. Small gains were observed when using PPM with a highest order of 6. While this highest order was found to give best compression on enwik8, different highest orders should be used for the individual files produced by the splitting process. In another test that used different highest orders for different files during the PPM stage, it was found that on most files a highest order of 6 performed best or was very close to the best highest order. Picking the best highest order for each file improves compression just a little ( bytes). 6.2 Rabin-Karp Compression (RKC) as a Precompression Step Ziv and Lempel [54] [55] implemented an algorithm that replaces a repeated string with a reference to an earlier occurrence of said string. Today, this approach is used in the linux utility gzip. It uses a fixed-size sliding window in which a history of recently encountered symbols is stored and replaces subsequent repetitions of strings with references to earlier occurrences (if such are present in the sliding window). Due to the limited size of the sliding window, strings that are repeated but are far apart from one another may be overlooked. Hence, a technique which finds long repetitions of strings that potentially occur at a great distance from one another in the input could be used as a precompression step to gzip. Given a string s of length s and a pattern p of length p, a naive algorithm to find all instances of p in s would need to consider all starting positions i for which s i = p 0 and then compare up to p characters to identify if a match has been found at location i. The worst-case time complexity of this approach is O( p s ). Since the goal is to identify long string repetitions, this approach may not be feasible. In 1987, Rabin and Karp [27] proposed a string searching algorithm that utilizes fingerprints to find a given pattern in text. While exhibiting the same worst-case running

61 time as the naive approach, its running time is linear on most inputs of interest. Bentley and McIlroy [7] proposed to use Rabin and Karp s algorithm for data compression. In order to accomplish this task, a block size b > 0 (generally quite large) is introduced. For every non-overlapping block of b consecutive characters a fingerprint is stored. For an input string s, the fingerprints that are stored correspond to the strings s 0...s b 1, s b...s 2b 1, s 2b...s 3b 1 and so forth. Hence, in a file of length n this method stores approximately n/b fingerprints. Given a block size b, an alphabet of size a, as well as a string of length b beginning at location i, a fingerprint h i is defined as h i = b 1 j=0 61 s i+ j a b 1 j (6.1) For large block sizes and large alphabets, h i quickly becomes too large to be stored in a 32-bit integer. Modulo operations using a large prime number p such that a p < are used to prevent integer overflow. While equation (6.1) allows for the computation of a hash value for a string of length b at any location, it takes O(b) to compute. However, consider the computation of h i+1 given h i : h i+1 = a(h i s i a b 1 ) + s i+1 (6.2) Equation (6.2) shows that computing the hash value for a string beginning at location i + 1 is trivial, if the hash value of the string beginning at location i is known. If one precomputes a b 1 (again using modulo operations to prevent overflow), updating hash values becomes a constant time operation. With a rolling hash function as shown in Equation (6.2) we can compute hash values for any position in the input efficiently. For concreteness, let us assume that part of the input has already been consumed and position i = 253 is currently being investigated. Furthermore, assume a block size of b = 100. A set f of fingerprints is stored for locations 0, 100 and 200, which are multiples

62 62 of b. Assume that h 253 f. Since it is known that two identical strings must have the same hash value, it can can be concluded that the string of length 100 starting at location 253 has not previously been encountered at positions that are multiples of 100. There is no need for any character comparisons at this point and the next position in the input can be investigated after having compared just two integers. Assume that h 254 f, specifically that h 254 = h 100. Since different strings may hash to identical values, a character-by-character comparison is now necessary to ensure that a match is found. If it is determined, that the strings of length b at the locations 254 and 100 are indeed identical, second occurrence can be replaced with a reference to the first. Bentley and McIlroy further propose to greedily extend the matched string backward and forward as far as possible. In case multiple matches are found, the longest match should be used as a basis for replacement. Table 6.2 shows the results of using Rabin-Karp Compression (RKC) as a precompressor to gzip on enwik8. Column one shows the block size b and column two gives the running time in seconds. The number of replacements is given in column three. Column four and five give the size of the compressed output after RKC and RKC followed by gzip, respectively. The last column lists the improvement the use of RKC delivers in combination with gzip over gzip only. Replacements of strings that were found within 2048 characters were avoided altogether since it was assumed that those presumably short repetitions would be handled more efficiently by gzip. It can be observed that the running time increases as the block size decreases. This is not surprising as with a decreased block size an increased number of hash collisions can be expected. Hash collisions occur when two different strings hash to the same value. With smaller block sizes, this is increasingly often the case and strings need to be compared on a character-by-character basis more often.

63 63 Table 6.2: Performance of Rabin-Karp Compression (RKC) followed by gzip when avoiding replacements found within 2048 characters b running time replacements size(rk) size(rk gzip) improvement (%) Compression performance was fairly poor. Bentley and McIlroy point out that if one were to concatenate a long piece of text with itself, Rabin-Karp Compression (RKC) would work well as it would find that long repetition, which would go unnoticed by gzip. Unfortunately, enwik8 appears to not have long repetitions that warrant the use of this approach. Furthermore, compression performance strongly depends on the choice of the block size. Small block sizes tend to cause this algorithm to replace short, close repetitions that gzip would handle more efficiently, while large block sizes tend to provide very little added compression. Since the optimal value of b is not known in advance, a large value for it needs to be chosen in order to avoid detrimental effects on gzip s performance. Nonetheless, the use of Rabin-Karp Compression (RKC) with a large block size as a precompressor to gzip can catch long, distant repetitions and may be used successfully in large collections of files. Bentley and McIlroy give the example of a collection of books, each beginning with an identical legal statement. 6.3 Star Transform (ST) The ST [28] is designed to provide a compressor with a heavily skewed character distribution. Consider the following set of four-letter words: {help, hint, hunt, hard. Furthermore, assume that our input contains no other four-letter words. Using the

64 64 character * as a special symbol, this set of words can be spelled as {***p, *i**, *u**, **r*. The encoding of a word w using the special symbol * is called a signature (w s ). The length of a word is equal to the length of its signature. Given a dictionary containing the mappings of signatures to regular words, it is possible to reverse this transform. As a result of the transformation, the frequency of a single symbol substantially increases, which should prove to be beneficial to Burrows-Wheeler Compression (BWC) since the Burrows-Wheeler Transform (BWT) stage should produce output with longer runs of this symbol. The procedure is outlined by Franceschini et al. [19] as follows: 1. Construct dictionary D 2. Partition D into disjoint dictionaries D i such that each dictionary D i contains only words of length i 3. Sort each dictionary D i by word frequency count 4. Apply mapping to generate the encodings for all words in D i for all i Replacing words with their signatures results in the symbol * being the predominant character in the output of this transform. If a word is not present in the dictionary, it is simply output to the transformed text unaltered. Since the reversal process needs to be informed about the mappings between signatures and words, a dictionary needs to be made available for it. This storage overhead may or may not be offset by improved compression. In order to minimize the impact of the dictionary, it should be compressed by sending it along to the next step of the compression scheme. When using a static dictionary, this approach becomes language-specific. One could argue that one of the most commonly used languages in the world today is English and that, therefore, an English dictionary should be used in order to improve compression of most files. If the dictionary does not contain many matches with the text that is to be compressed, one

65 65 could simply output a flag indicating to the star-decoder that no star-encoding was performed. In order to be able to use this approach on a larger variety of files, it is possible to create an impromptu dictionary based on the actual input. Such a dictionary may, however, be very large and its use not justified by the gain in compression. An impromptu dictionary could be improved by eliminating long words that do not occur very often or by using some other metric to remove words that add comparatively few * -characters to the output of the star-transform. For example, let w be a word, w s the signature of w and w f the number of occurrences of w in the input. Additionally, define w s to be the number of * -characters present in w s. One could attempt to increase the number of * -characters added to the output while keeping the dictionary size manageable by including only words for which w s w f is above a certain threshold. As mentioned, the ST is used to create an abundance of * -characters in its output. When used in conjunction with bzip2, Franceschini et al. found that the performance increase was rather small. This was explained by the fact that bzip2 uses a Run-Length Encoding (RLE) step before applying the Burrows-Wheeler Transform (BWT). During this Run-Length Encoding (RLE) step, some of the stars that were introduced during the ST are removed. Since these * -characters are supposed to provide the BWT with a skewed character distribution in the hopes of improving compression, a replacement of this character is undesirable. Nelson s reference implementation of the Star Transform (ST) [41] was modified to use the character with ASCII-code 31 instead of the character *, as it was not present in the files enwik8 and enwik9. The use of the ST improves the performance of gzip by 6.9% while its use in conjunction with bzip2 resulted in a slight file size increase. When including an RLE step before applying bzip2, a small improvement of 2.2% was observed. Unfortunately, the use of the ST resulted in a slightly larger compressed size when it was used as a precompression step to Improved Burrows-Wheeler Compression (IBWC) with

66 66 a final size of bytes. The inclusion of an RLE step immediately after the ST resulted in a compressed file with a size of bytes. 6.4 Shortened-Context Length-Preserving Transform (SCLPT) The SCLPT [19] is considered to be an improvement over the Length-Preserving Tranform (LPT) [19], which in turn improves upon the ST. In the SCLPT, the character * is kept as the starting character of an encoded word in order to allow the Burrows-Wheeler stage to strongly predict the space character typically preceding it. If the character * is only used at the beginning of words, rotations starting with a * should all end with a space character. Three alphabetic characters are used at the end of an encoded word as an encoding of a dictionary entry. The last character cycles through [A-Z], the second-to-last character through [a-z] and the third-to-last character through [z-a], thus allowing for words to be encoded for each word length. For words longer than four characters an additional character is inserted between the star and the three-character dictionary encoding. This character is taken from the string "abcdefghijklmnopqrstuvwxyz" and is the fifth character of the suffix with the same length as the word that is to be encoded. Hence, the most frequent word of length 10 would be encoded as "*uzaa". The strong local context provided by the padding character as well as the * -character at the beginning of encoded words should prove to be beneficial to the Burrows-Wheeler Transform (BWT). Furthermore, this approach compresses the input slightly, which has the added benefit of speeding up the BWT stage. Based on tests conducted by Franceschini et al. [19] the SCLPT performs slightly better than the LPT and better than the Star Transform (ST). Using the SCLPT on enwik8 resulted in an improvement of 5.9% when used with gzip. More improvement was seen when bzip2 was used as a compressor. The resulting file was 7.1% smaller. The use of the SCLPT as a precompression step to Improved Burrows-Wheeler Compression (IBWC)

67 67 again resulted in a larger compressed file size of bytes, while using order-6 PPM resulted in an increase of the compressed file size to bytes. For the SCLPT, the code provided in the appendix was used. 6.5 Word Replacing Transform (WRT) The WRT [50] is another transform which can be applied to the input in order to make it more compressible by means of known compression algorithms. It works on XML-files but can also be applied to text files. The WRT makes use of special containers for dates ( ), times (11:30pm), numbers commonly used to express years (ranging from 1900 to 2155), URLs, addresses, and special XHTML-encoded strings such as "ü". The user can influence parameters such as maximum dictionary size, minimum word length and minimum word frequency, which are used during the dictionary creation process. Additionaly, the WRT contains several other mechanisms intended to improve compression. One of these techniques is called capital conversion and is based on the fact that words such as "compression" and "Compression" are essentially identical, but are not recognized by PPM as such. The idea behind capital conversion is to replace upper-case characters at the beginning of a word with a lower-case letter and denote that change with an additional flag placed in front of the converted character. The WRT with default parameters improved compression significantly when used in conjunction with gzip resulting in a compressed size of just bytes (an improvement of 26.3% over gzip alone). When used in conjunction with bzip2, compression improved by 10.5% with a final size of bytes. However, compression performance suffered significantly when the WRT was used in conjunction with Improved Burrows-Wheeler Compression (IBWC) with a final size of over 28 MB. Even when using the parameters outlined on [36], compression stood at just over 27 MB.

68 68 Some improvement is observed with Prediction by Partial Matching (PPM). Using the WRT followed by an order-6 PPM compressor, enwik8 is compressed to bytes. 6.6 Results In this chapter several precompression steps were tested. Splitting enwik8 into several files was intended to provide the compression program with more uniform input while also exploiting regularities such as "[[" being followed by "]]". In another attempt, the Rabin-Karp Compressor (RK) was used in order to remove long string repetitions before applying another compression program. While the Star Transform (ST) was intended to provide the compressor with a skewed character distribution, the main goal of the Shortened-Context Length-Preserving Transform (SCLPT) and Word Replacing Transform (WRT) was to precompress the input. All of these steps improved compression when they were used in conjunction with either gzip or bzip2. However, no improvements were observed when the Improved Burrows-Wheeler Compression (IBWC) scheme was used as the main compressor. The SCLPT and WRT improve compression when used with gzip or bzip2 by providing some compression of their own. In fact, the WRT can compress enwik8 to 44.5 MB - mostly by replacing long words with shorter code words - which helps both compressors. This means that enwik8 is roughly 2.25 times as large as the output of the WRT. This has implications with respect to the performance of gzip as well as bzip2. Gzip uses a fixed-size sliding window in which it stores the last n encountered symbols. If a long string is found in the unprocessed input, it is replaced with an instruction of the form < x, y >, where x is the offset and y the length of the match. This instruction can be interpreted as "Move back x characters and from there copy y characters." The value for x can be as large as n and requires log 2 (n) bits to encode. An increase in the size of the sliding window would require x to be encoded using more bits.

69 69 With the input already precompressed by WRT, it is presented to gzip in denser fashion, which means that gzip can now store more information in the same sliding window and is, therefore, able to find more distant matches than before. Bzip2 profits from such a precompression step for similar reasons. The information it receives from the WRT is denser. Hence, a single block in bzip2 can now hold more than twice the information than before. Consider that compressing enwik8 with bzip2 uses 113 blocks of size 900 kb while compressing the output of the WRT requires only 50 blocks of the same size. The effective block size has been increased to more than two million characters. A BWT with unlimited block size does not profit from this effect, since it already treats enwik8 as one block. The only way it could profit from a precompression step would be if the precompressor were to replace some of the context present in enwik8 with stronger, artificial context. It would appear that the artificial context introduced by the SCLPT and WRT is weaker than the actual context present in enwik8. While it is sufficiently weaker to hurt compression performance when used with IBWC, it seems to be strong enough to be outweighed by the increased effective block size in bzip2.

70 7 Experimental Results on enwik9 using Improved 70 Burrows-Wheeler Compression (IBWC) The IBWC scheme was discussed in Chapter 5. It achieves best compression of enwik8 with a Burrows-Wheeler Transform (BWT) with unlimited block size followed by an MF stage with d = 1.3 while omitting the RLE and RLE0 stages entirely. For the entropy coder the PPM implementation by Nelson [39] was used with an order of 2. Since enwik9 is too large to be treated as one block during the Burrows-Wheeler Transform (BWT) stage due to memory requirements that cannot be met on a 32-bit machine, we will use several approaches to circumvent this issue. In approach 1, we divide enwik9 into ten blocks of equal size and compress those independently from one another. Approach 2 treats enwik9 as three blocks of equal size, which are compressed independently. In approach 3, we use splitting as discussed in Chapter 6.1 as a preprocessing step. The procedure of approach 1 and 2 is outlined as follows with values for approach 2 given in parentheses: 1. Divide enwik9 into ten (three) blocks and perform the following steps on each segment 2. Use a BWT with unlimited block size 3. Run a MF on the output with a parameter of Apply order-2 PPM Tables 7.1 and 7.2 show the results of approaches 1 and 2. The running time increased from 6055 to 6171 seconds when the number of blocks was reduced. This corresponds to an increase of 1.9%. The compression performance, however, improved by

71 71 Table 7.1: Compression performance of Improved Burrows-Wheeler Compression (IBWC) on enwik9 treated as ten blocks. block file size compressed size bpc enwik enwik enwik enwik enwik enwik enwik enwik enwik enwik total Table 7.2: Compression performance of Improved Burrows-Wheeler Compression (IBWC) on enwik9 treated as three blocks. block file size compressed size bpc enwik enwik enwik total

72 72 5.8% from to bytes. Ideally, the largest possible block size should be chosen when using the BWT. This is often not possible for large files, but some of the preprocessing techniques discussed in Chapter 6 might be used in order to shrink enwik9 to a more manageable size. All of the preprocessing steps that were investigated resulted in worse compression outcomes. The splitting approach, however, resulted in the least amount of file size increase. When splitting enwik9, it can be observed that the largest of the 12 resulting files (enwik9.text) is smaller than 700MB and can thus be treated as two blocks rather than three. The procedure of approach 3 is as follows: 1. Split enwik9 into 12 files, divide enwik9.text into two segments and perform the following steps on each of the 13 files 2. Use a BWT with unlimited block size 3. Run a MF on the output with a parameter of Apply order-2 PPM Table 7.3 shows the results of this approach on enwik9. Column one lists the files produced by splitting. Enwik9.text was further divided into two segments of equal size. The files size (in bytes) is given in column two. Compression outcomes (in bytes) are shown for IBWC, gzip and bzip2 in columns three, four and five, respectively. The last column lists the time (in seconds) spent during the BWT stage of the IBWC scheme. For gzip and bzip2 the file enwik9.text was not divided into two blocks as there is no need to do so. One can see that this approach performs approximately 25% better than bzip2 and 41% better than gzip on the split files. Furthermore, as shown by [36], bzip2 compresses the untouched enwik9 to bytes. Preprocessing by splitting improves its performance by 4.5%.

73 73 Table 7.3: Compression performance of splitting with Improved Burrows-Wheeler Compression (IBWC) on enwik8. Best performance is given in bold font. produced file file size IBWC gzip bzip2 t BWT enwik9.alpha enwik9.doublecorner enwik9.doublecurly enwik9.doubleequal enwik9.ltgt enwik9.numeric enwik9.quot enwik9.redirect enwik9.text enwik9.text enwik9.timestamp enwik9.tripleequal enwik9.xml total Splitting enwik9 in this fashion made one shortcoming of our implementation of the Burrows-Wheeler Transform (BWT) very obvious. The resulting file enwik9.xml with a file size of 71 MB took very long to compress. In fact, most of time spent during the Burrows-Wheeler Transform (BWT) stages was invested in enwik9.xml. Given its worst-case running time of O(n 2 log(n)) and the long repetitions found in enwik9.xml in the form of repeated empty XML entries, it is no surprise that this approach resulted in extended running times.

74 8 Experimental Results on additional benchmark files The previous chapters of this thesis focused on the compression of enwik8 and enwik9. In this chapter, the Improved Burrows-Wheeler Compression (IBWC) scheme is tested on the two files c_elegans and gut97. The file c_elegans contains the genome of the model organism Caenorhabditis elegans [37] and is made up almost entirely of the characters a, c, g and t. It is made available for download at wormbase [53]. File gut97 contains 97 of the 100 most downloaded e-books as of May 20, 2013 from Project Gutenberg [43]. These 97 books were chosen since they were available as plain text files and are concatenated with each other in order of popularity. Table 8.1 shows the compression outcomes of various compressors tested throughout this thesis. Gzip and bzip2 were used with default parameters. Prediction by Partial Matching (PPM) was run with a highest order of six, which is the optimal highest order for both files. BWC is the compressor developed in chaper 4.2, which utilizes a Burrows-Wheeler Transform (BWT) with unlimited block size followed by a Move-To-Front Transform (MTF) and Run-Length Encoding (RLE) stage before applying order-0 adaptive arithmetic coding. IBWC is the approach outlined in chapter 5 and makes use of the same BWT. This step is followed by a Move-Fraction Transform (MF). As a last step PPM with a highest order of two is applied. All results are given in bytes with best result in bold font. 74 Table 8.1: Compression results on the files gut97 and c_elegans file file size gzip bzip2 PPM-6 BWC IBWC gut c_elegans

75 75 Of the compressors tested on the file gut97, the Improved Burrows-Wheeler Compression (IBWC) scheme performs best. However, this is not the case for the file c_elegans, which is best compressed with PPM-6. Since a BWT is expected to generate long runs of characters, which can be transformed into long runs of NULL-characters by means of a MTF, we would expect some compression when combining these two steps with an RLE step. This approach is used in the compressor BWC listed in column six. However, after applying RLE we observe file expansion instead, which indicates that many runs are exactly two characters long and explains the relatively poor performance of the BWC. This effect, which is not generally observed when compressing English text, occurs in c_elegans due to the seemingly more random nature of the input.

76 9 Summary of experimental results 76 The compression results of several approaches tested throughout this thesis on the file enwik8 are summarized in table 9.1. Results for gzip and bzip2 are provided for convenience. For details on the individual approaches, the chapters are listed where applicable. Table 9.1: Summary of approaches tested on enwik8 Chapter Approach compressed size (bytes) bpc 1 gzip Word-based Huffman Coding gzip bzip BWC with unlimited block size PPM Splitting PPM-opt ST RLE IBWC WRT PPM SCLPT IBWC Splitting IBWC IBWC Approach 2 used static Word-Based Huffman Coding (WBH). Its output was processed further with gzip. Approach 4 utilized a Burrows-Wheeler Transform (BWT) with unlimited block size and an adaptive order-0 arithmetic coding step. Approach 5 used Prediction by Partial Matching (PPM) with a highest order of 6, while approach 6 used PPM with an optimal highest order for each of the files obtained from the splitting

77 77 process. Further approaches make use of the Star Transform (ST), Shortened-Context Length-Preserving Transform (SCLPT) and Word Replacing Transform (WRT) in conjunction with previously discussed approaches. Improved Burrows-Wheeler Compression (IBWC) was used in approaches 9 through 11. Enwik9 was compressed in several different ways in Chapter 7. The results are summarized in table 9.2. Table 9.2: Summary of approaches tested on enwik9 Chapter Approach compressed size (bytes) bpc 1 gzip Splitting gzip bzip Splitting bzip IBWC with 10 blocks IBWC with 3 blocks Splitting IBWC Approach 2 made use of the splitting technique developed in Chapter 6.1. It produced 12 files with more homogenous data and compressed those individually with gzip. Approach 4 used the same approach but utilized bzip2 as a main compressor. Approach 5 divided enwik9 into 10 files and compressed those individually using the Improved Burrows-Wheeler Compression (IBWC) scheme developed in Chapter 5, while approach 6 used the same approach with three blocks. Approach 7 used splitting first and then compressed the resulting text files individually with IBWC. The file enwik9.text, which was obtained from the splitting process, was treated as two blocks.

78 10 Conclusions 78 The results of the experiments conducted in this thesis confirm that when compressing large, homogenous files using Burrows-Wheeler Compression (BWC), the largest possible block size should generally be chosen during the Burrows-Wheeler Transform (BWT) stage. Furthermore, Run-Length Encoding (RLE) and even NULL-Run-Length Encoding (RLE0) may be omitted when using Prediction by Partial Matching (PPM) instead of arithmetic coding as a last step in BWC. In this thesis an Improved Burrows-Wheeler Compression (IBWC) scheme is developed. It utilizes a multi-threaded BWT, the Move-Fraction Transform (MF) as well as PPM. This approach combines reasonable running times with moderate memory requirements and achieves better compression than PPM in standard form alone. Furthermore, decompression is faster than compression, which is genrally not the case when using PPM. Furthermore, BWT-based compressors are particularly powerful when used on files with frequent, long string repetitions. Further improvements to BWC are possible, but it is not obvious how BWC can be improved significantly. In the Improved Burrows-Wheeler Compression (IBWC) scheme, Prediction by Partial Matching (PPM) with a highest order of 2 was used. When predictions at that level failed, an order-1 context was considered before falling back to an order-0 model. Instead of using this scheme, a scheme or a 2-0 scheme could provide better results. Several preprocessing steps were tested throughout this thesis using ad-hoc dictionary creation based on the contents of the input. While those preprocessors improved compression of enwik8 when used with gzip and bzip2, they resulted in worse compression outcomes when used in conjunction with our IBWC scheme. Preprocessing often results in already compressed input to the BWT stage, which allows bzip2 to use a larger effective block size. The IBWC cannot benefit from this effect since it already treats

79 79 its input as one block. It would, therefore, be desirable to develop precompression schemes that focus on replacing context in their input with stronger artificial context to facilitate better compresson. It is not immediately clear how this can be accomplished.

80 References 80 [1] The Canterbury Corpus. Retrieved April 1, 2013 from [2] Jürgen Abel. Post BWT stages of the Burrows-Wheeler compression algorithm. Software: Practice and Experience, 40(9): , [3] Donald Adjeroh, Timothy Bell, and Amar Mukherjee. The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching. Springer, 1st edition, July [4] Arne Andersson and Stefan Nilsson. A new efficient radix sort. IEEE Symposium on Foundations of Computer Science, pages , [5] Ziya Arnavut and Spyros S. Magliveras. Block sorting and compression. In Proceedings of the IEEE Data Compression Conference, pages , [6] Bernhard Balkenhol and Yuri M. Shtarkov. One attempt of a compression algorithm using the BWT, Preprint , SFB343 : Discrete Structures in Mathematics, Falculty of Mathematics, Univ. of Bielefeld, Germany. [7] Jon Bentley and Douglas McIlroy. Data compression using long common strings. In Proc. IEEE Data Compression Conference, pages IEEE Computer Society, [8] Eric Bodden, Malte Clasen, and Joachim Kneis. Arithmetic coding revealed - a guided tour from theory to praxis. Technical report, SABLE-TR , Sable Research Group, School of Computer Science, McGill University, [9] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. SRC Research Report, 124, [10] Brenton Chapin. Higher compression from the Burrows-Wheeler Transform with new algorithms for the list update problem. PhD thesis, University of North Texas, [11] Brenton Chapin and Stephen R. Tate. Higher compression from the Burrows-Wheeler Transform by modified sorting. In Data Compression Conference, page 532. IEEE Computer Society, [12] John G. Cleary and Ian H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32: , [13] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006.

81 [14] Sebastian Deorowicz. Improvements to Burrows-Wheeler compression algorithm. Software: Practice and Experience, 30(13): , [15] Sebastian Deorowicz. Second step algorithms in the Burrows-Wheeler compression algorithm. Software: Practice and Experience, 32(2):99 111, [16] Peter Fenwick. Block sorting text compression - Final report. Technical report, University of Auckland, Department of Computer Science, [17] Peter Fenwick. Burrows-Wheeler Compression: Principles and Reflections. Theoretical Computer Science, 387(3): , November [18] Paolo Ferragina, Igor Nitto, and Rossano Venturini. On optimally partitioning a text to improve its compression. Algorithmica, 61(1):51 74, [19] R. Franceschini, H. Kruse, N Zhang, R. Iqbal, and A. Mukherjee. Lossless, reversible transformations that improve text compression ratios, [20] J. Gailly and M. Adler. A freely available compression utility. Retrieved September 10, 2012 from [21] Joseph Yossi Gil and David Allen Scott. A bijective string sorting transform. Computing Research Repository (CoRR), abs/ , [22] Nigel Horspool and Gordon Cormack. Constructing word-based text compression algorithms. In Proceedings of the IEEE Data Compression Conference, pages IEEE Computer Society Press, [23] David A. Huffman. A method for the construction of minimum-redundancy codes. In Proceedings of the Institute of Radio Engineers, volume 40, pages , September [24] Marcus Hutter. The Hutter Prize. Retrieved August 28, 2012 from [25] R. Yugo Kartono Isal, Alistair Moffat, and Alwin C. H. Ngai. Enhanced word-based block-sorting text compression, [26] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Simple linear work suffix array construction. International Colloquium on Automata, Languages and Programming (ICALP), 2719: , [27] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2): , [28] Holger Kruse and Amar Mukherjee. Preprocessing text to improve compression ratios. In Proceedings of the IEEE Data Compression Conference, page 556. IEEE Computer Society,

82 [29] Holger Kruse and Amar Mukherjee. Improving text compression ratios with the Burrows-Wheeler Transform. In Proceedings of the IEEE Data Compression Conference, DCC 99, pages 536, Washington, DC, USA, IEEE Computer Society. [30] Manfred Kufleitner. On bijective variants of the Burrows-Wheeler Transform. Computing Research Repository (CoRR), abs/ , [31] Glen G. Langdon. Arithmetic coding. IBM Journal of Research and Development, 23: , [32] Daniel Lemire, Owen Kaser, and Eduardo Gutarra. Reordering rows for better compression: Beyond the lexicographic order. ACM Transactions on Database Systems, 37(3):20, [33] Hartmut Liefke and Dan Suciu. An extensible compressor for xml data. ACM Special Interest Group on Management of Data (SIGMOD), 29(1):57 62, March [34] Hartmut Liefke and Dan Suciu. Xmill: an efficient compressor for xml data. ACM Special Interest Group on Management of Data (SIGMOD), 29(2): , May [35] Matt Mahoney. Data compression programs. Retrieved September 21, 2012 from [36] Matt Mahoney. Large data compression benchmark. Retrieved September 10, 2012 from [37] Maria Markaki and Nektarios Tavernarakis. Modeling human diseases in Caenorhabditis elegans. Biotechnology Journal, 5(12): , [38] A. Moffat. Word-based text compression. Software: Practice and Experience, 19(2): , February [39] Mark Nelson. Arithmetic coding + statistical modeling = data compression. Retrieved September 10, 2012 from arithmetic-coding-statistical-modeling-data-compression/. [40] Mark Nelson. Data compression with the Burrows-Wheeler Transform. Retrieved September 10, 2012 from [41] Mark Nelson. Star-Encoding in C++. Retrieved September 10, 2012 from [42] Mark Nelson. The Data Compression Book. Henry Holt and Co., Inc., New York, NY, USA,

83 [43] Project Gutenberg. Free ebooks - Project Gutenberg. Retrieved May 21, 2013 from [44] R. Radescu. Transform methods used in lossless compression of text files. Romanian Journal of Information Science and Technology, 12(1): , [45] Kunihiko Sadakane. A fast algorithm for making suffix arrays and for burrows-wheeler transformation. In Proceedings of the IEEE Data Compression Conference, Snowbird, Utah, March 30 - April 1, pages IEEE Computer Society Press, [46] David Salomon. Data Compression: The Complete Reference. Springer-Verlag New York, Inc., Secaucus, NJ, USA, [47] Julian Seward. A freely available, high quality data compressor. Retrieved September 10, 2012 from [48] C. E. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, pages 50 64, [49] Przemyslaw Skibiński. A high-performance XML compressor. Retrieved September 10, 2012 from [50] Przemyslaw Skibiński, Szymon Grabowski, and Sebastian Deorowicz. Revisiting dictionary-based compression: Research articles, December [51] Wikipedia, the free encyclopedia. Retrieved March 02, 2013 from [52] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Commun. ACM, 30(6): , June [53] WormBase. The Biology and Genome of C. elegans. Retrieved May 21, 2013 from [54] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3): , [55] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5): ,

84 Appendix: Code 84 A.1 An example implementation of BWT # include <iostream > # include <vector > # include <set > # include <algorithm > # include <fstream > # include <time.h> # include <string > # include <boost/ thread.hpp > # include <boost/ bind.hpp > using namespace std; /* A permutation is simply a copy of the static member wiki with a different starting location */ class Permutation{ public: // static members are available to comparator static vector < unsigned char > wiki; // original input // We can store different lexicographic orderings here static vector < vector < unsigned char > > orderings; int location; // starting location of the permutation in wiki bool operator < ( const Permutation & p2) const{ // used so std:: sort does not compare a permutation to itself - strict weak ordering if( location == p2. location) return false; int i = 0; int j = 0; int k = 0; unsigned char * pc1; unsigned char * pc2; if ( location > p2. location){ i = Permutation:: wiki. size() - location; j = Permutation:: wiki. size() - p2. location; k = p2. location; if ( p2. location > location){ i = Permutation:: wiki. size() - p2. location; j = Permutation:: wiki. size() - location; k = location; pc1 = & Permutation:: wiki[ location]; pc2 = & Permutation:: wiki[ p2. location];

85 85 int c = 0; while ( i--) { if ( *pc1 < *pc2 ) return true; else if ( * pc1 ++ > * pc2 ++ ) return false; if ( location > p2. location) pc1 = & Permutation:: wiki[0]; if ( location < p2. location) pc2 = & Permutation:: wiki[0]; while ( j--) { if ( *pc1 < *pc2 ) return true; else if ( * pc1 ++ > * pc2 ++ ) return false; if ( location > p2. location) pc2 = & Permutation:: wiki[0]; if ( location < p2. location) pc1 = & Permutation:: wiki[0]; while ( k--) { if ( *pc1 < *pc2 ) return true; else if ( * pc1 ++ > * pc2 ++ ) return false; return location < p2. location; // This code can be used for alternative lexicographic orderings /* int locthis, locthat; for ( int i = 0; i < Permutation:: wiki. size(); ++i){ // The strings are in the correct buckets. We don t need to compare the first two characters locthis = location + i + 2; locthat = p2. location + i + 2; // correct running off the end if ( locthis >= Permutation:: wiki. size()) locthis -= Permutation:: wiki.size(); if ( locthat >= Permutation:: wiki. size()) locthat -= Permutation:: wiki.size(); if ( Permutation:: orderings [0][ wiki[ locthis]] < Permutation:: orderings [0][ wiki[ locthat]]){ return true;

86 86 if ( Permutation:: orderings [0][ wiki[ locthis]] > Permutation:: orderings [0][ wiki[ locthat]]){ return false; */ ; vector < unsigned char > Permutation:: wiki; vector < vector < unsigned char > > Permutation:: orderings; // our own sorting function for ease of use namespace{ void mysort( vector < Permutation > & dp){ sort(dp.begin(), dp.end()); // class used to count bucket size class FrequencyEntry{ public: unsigned char c1, c2; int freq; bool operator < ( const FrequencyEntry & e2) const{ if ( freq > e2. freq){ return true; return false; ; int main( int argc, char * argv[]){ time_t starttime, stoptime; starttime = time( NULL); vector < vector < int > > freq; vector < vector < vector < Permutation > > > p; Permutation:: orderings. resize(1); Permutation:: orderings [0]. resize (256); for ( int i = 0; i < 256; ++i){ Permutation:: orderings [0][ i] = i; Permutation:: orderings [0][ A ] = 65; Permutation:: orderings [0][ B ] = 70; Permutation:: orderings [0][ C ] = 71; Permutation:: orderings [0][ D ] = 72; Permutation:: orderings [0][ E ] = 66; Permutation:: orderings [0][ F ] = 74; Permutation:: orderings [0][ G ] = 73;

87 87 Permutation:: orderings [0][ H ] = 75; Permutation:: orderings [0][ I ] = 67; Permutation:: orderings [0][ J ] = 83; Permutation:: orderings [0][ K ] = 84; Permutation:: orderings [0][ L ] = 77; Permutation:: orderings [0][ M ] = 79; Permutation:: orderings [0][ N ] = 80; Permutation:: orderings [0][ O ] = 68; Permutation:: orderings [0][ P ] = 81; Permutation:: orderings [0][ Q ] = 82; Permutation:: orderings [0][ R ] = 76; Permutation:: orderings [0][ S ] = 78; Permutation:: orderings [0][ T ] = 85; Permutation:: orderings [0][ U ] = 69; Permutation:: orderings [0][ V ] = 87; Permutation:: orderings [0][ W ] = 86; Permutation:: orderings [0][ X ] = 88; Permutation:: orderings [0][ Y ] = 89; Permutation:: orderings [0][ Z ] = 90; Permutation:: orderings [0][ a ] = Permutation:: orderings [0][ A ] +32; Permutation:: orderings [0][ b ] = Permutation:: orderings [0][ B ] +32; Permutation:: orderings [0][ c ] = Permutation:: orderings [0][ C ] +32; Permutation:: orderings [0][ d ] = Permutation:: orderings [0][ D ] +32; Permutation:: orderings [0][ e ] = Permutation:: orderings [0][ E ] +32; Permutation:: orderings [0][ f ] = Permutation:: orderings [0][ F ] +32; Permutation:: orderings [0][ g ] = Permutation:: orderings [0][ G ] +32; Permutation:: orderings [0][ h ] = Permutation:: orderings [0][ H ] +32; Permutation:: orderings [0][ i ] = Permutation:: orderings [0][ I ] +32; Permutation:: orderings [0][ j ] = Permutation:: orderings [0][ J ] +32; Permutation:: orderings [0][ k ] = Permutation:: orderings [0][ K ] +32; Permutation:: orderings [0][ l ] = Permutation:: orderings [0][ L ] +32; Permutation:: orderings [0][ m ] = Permutation:: orderings [0][ M ] +32; Permutation:: orderings [0][ n ] = Permutation:: orderings [0][ N ] +32; Permutation:: orderings [0][ o ] = Permutation:: orderings [0][ O ] +32; Permutation:: orderings [0][ p ] = Permutation:: orderings [0][ P ] +32; Permutation:: orderings [0][ q ] = Permutation:: orderings [0][ Q ] +32; Permutation:: orderings [0][ r ] = Permutation:: orderings [0][ R ] +32; Permutation:: orderings [0][ s ] = Permutation:: orderings [0][ S ] +32; Permutation:: orderings [0][ t ] = Permutation:: orderings [0][ T ] +32; Permutation:: orderings [0][ u ] = Permutation:: orderings [0][ U ] +32; Permutation:: orderings [0][ v ] = Permutation:: orderings [0][ V ] +32; Permutation:: orderings [0][ w ] = Permutation:: orderings [0][ W ] +32; Permutation:: orderings [0][ x ] = Permutation:: orderings [0][ X ] +32; Permutation:: orderings [0][ y ] = Permutation:: orderings [0][ Y ] +32; Permutation:: orderings [0][ z ] = Permutation:: orderings [0][ Z ] +32; Permutation::orderings [0][? ] = 34; Permutation::orderings [0][ " ] = 63; Permutation::orderings [0][. ] = 35; Permutation::orderings [0][ # ] = 46; Permutation::orderings [0][, ] = 36; Permutation:: orderings [0][ $ ] = 44; Permutation::orderings [0][ { ] = 92; Permutation::orderings [0][ \\ ] = 124; Permutation::orderings [0][ ] = 94;

88 88 Permutation:: orderings [0][ ^ ] = 125; Permutation permut; unsigned char j,k; freq. resize (256); p.resize (256); for ( int i = 0; i < 256; ++i){ p[i]. resize (256); freq[i]. resize (256); char letter; unsigned char lastletter = ; FrequencyEntry fe; vector < FrequencyEntry > feq; while (!cin.eof()){ cin. get( letter); if (!cin.fail()){ ++ freq[ lastletter][( unsigned char) letter]; lastletter = ( unsigned char) letter; Permutation:: wiki. push_back(( unsigned char) letter); cerr << " Done reading input\n"; cerr << " File size: " << Permutation:: wiki. size() << " characters\n"; for ( int i = 0; i < 256; ++i){ for ( int j = 0; j < 256; ++j){ fe. c1 = i; fe. c2 = j; fe. freq = freq[i][ j]; feq. push_back( fe); stable_sort(feq.begin(), feq.end()); for ( int i = 0; i < Permutation:: wiki. size(); ++i){ permut. location = i; if (i % == 0){ cerr << i << \r ; j = Permutation:: wiki[i]; if (i + 1 >= Permutation:: wiki. size()){ k = Permutation:: wiki[0]; else{ k = Permutation:: wiki[i + 1];

89 89 p[j][ k]. push_back( permut); int numthreads = 4; int currentbucketcount = 0; cerr << endl; for ( int i = numthreads - 1; i < feq. size(); i += numthreads){ boost:: thread_group tg; for ( int t = numthreads - 1; t >= 0; --t){ if (p[( int) feq[i-t]. c1][( int) feq[i-t]. c2]. size() > 1){ tg. create_thread( boost:: bind( mysort, boost:: ref(p[( int) feq[i-t]. c1][(int)feq[i-t].c2]))); cerr << \r ; cerr << " Sorted buckets: " << currentbucketcount << " / " << 256*256; tg.join_all(); currentbucketcount +=4; // find the original string in the matrix. It needs to be output first vector < Permutation >:: iterator iter; int count = 0; int originallocation; int correctbucket0 = ( int) Permutation:: wiki[0]; int correctbucket1 = ( int) Permutation:: wiki[1]; for ( iter = p[ correctbucket0][ correctbucket1]. begin(); iter!= p[ correctbucket0][ correctbucket1]. end(); ++ iter){ if ( iter -> location == 0){ originallocation = count; break; ++ count; for ( int i1 = 0; i1 < Permutation:: wiki[0]; ++i1){ for ( int i2 = 0; i2 < 256; ++i2){ originallocation += p[ i1][ i2]. size(); for ( int i2 = 0; i2 < Permutation:: wiki[1]; ++i2){ originallocation += p[ Permutation:: wiki[0]][ i2]. size(); // done getting original string location cerr << "\ noriginal string is at location " << originallocation << endl; cout << originallocation << " "; // output the BWT for ( int i = 0; i < 256; ++i){ for ( int j = 0; j < 256; ++j){

90 90 for (iter = p[i][j].begin(); iter!= p[i][j].end(); ++iter){ if ( iter -> location == 0){ cout << Permutation:: wiki[ Permutation:: wiki. size() - 1]; else{ cout << Permutation:: wiki[ iter -> location - 1]; stoptime = time( NULL); cerr << "\ nrunning time: " << ( double) stoptime - ( double) starttime << endl << endl; A.2 An example implementation of UNBWT # include <iostream > # include <string > # include <set > # include <vector > # include <algorithm > # include <map > # include <deque > using namespace std; void readinput( istream& is, int& originalstringlocation, vector < unsigned char >& lastcolumn, vector <int > & charfreq, vector < vector < int > > & lastcollocationlookup); int main( int argc, char * argv[]){ time_t starttime, stoptime; starttime = time( NULL); int originalstringlocation; vector < unsigned char > lastcolumn; vector < unsigned char > firstcolumn; vector <int > charfreq; charfreq. resize (256); vector < vector < int > > lastcollocationlookup; lastcollocationlookup. resize (256); readinput(cin, originalstringlocation, lastcolumn, charfreq, lastcollocationlookup); firstcolumn.reserve(lastcolumn.size()); cerr << " Doing counting sort to determine first column\n"; vector <int > cumulativecharfreq; cumulativecharfreq. resize (256);

91 91 int cumulativecount = 0; for ( int i = 0; i < charfreq. size(); ++i){ for ( int j = 0; j < charfreq[i]; ++j){ firstcolumn. push_back(( unsigned char)i); cumulativecount += charfreq[i]; cumulativecharfreq[i] = cumulativecount; int currentlocation = originalstringlocation; int nth; int lastloc; cerr << " Beginning output" << endl; /* cerr << " Frequencies: " << endl; cerr << "\ tfreq\ tcumfreq\n"; for ( int i = 0; i < charfreq. size(); ++i ){ cerr << i << ":\ t" << charfreq[i] << "\ t" << cumulativecharfreq[i] << endl; */ for ( int i = 0; i < lastcolumn. size(); ++i){ if (i % == 0){ cerr << i << "\r"; cout << firstcolumn[ currentlocation]; // the character in firstcolumn at location is the how manyeth character of its kind? nth = currentlocation - cumulativecharfreq[ firstcolumn[ currentlocation]] + charfreq[ firstcolumn[ currentlocation ]]; // now find the nth occurrence of said character in the last column currentlocation = lastcollocationlookup[ firstcolumn[ currentlocation ]][nth]; stoptime = time( NULL); cerr << "\ nrunning time: " << ( double)( stoptime - starttime) << endl << endl; void readinput( istream& is, int& originalstringlocation, vector < unsigned char >& lastcolumn, vector <int > & charfreq, vector < vector < int > > & lastcollocationlookup){

92 92 cerr << " Reading input and creating lookup tables\n"; char letter; unsigned char unsignedletter; is >> originalstringlocation; is. get( letter); // discard the delimiting space int position = 0; while (!is.eof()){ if ( position % == 0){ cerr << position << "\r"; is.get(letter); if (!is.fail()){ unsignedletter = ( unsigned char) letter; lastcolumn. push_back( unsignedletter); ++ charfreq[ unsignedletter]; lastcollocationlookup[ unsignedletter]. push_back( position); ++ position; A.3 An example implementation of SCLPT /* * * general outline of SCLPT * ( shortened context length preserving transform) * * ENCODING * 1) generate the dictionary * a) in the input file, find all words * ( maximal sequences of alphabetic characters) * and store words of length i in container[i] for all i > 4. * b) sort the words in each of those subdictionaries by * frequency count, which we keep track of in 1a * c) Since the dictionary needs to be passed along, we want * it to be small. Hence, we only store the actual words * separated by space in this dictionary. * For that to work, we pick one character (*), that is * not present in the input. Each word is given a star code * that begins with "*" and ends in three letters of the * alphabet that can be used to find it in the dictionary * Example: * Assuming the most frequent 10- letter word in our * input is " henceforth". With this transform, * " henceforth" would become "* rstuvwzaa", where the star * denotes the beginning of a code word, the suffix " zaa" * is used to find the entry in our dictionary, and the * characters " rstuvw" expand the string to 10 characters

93 93 * (the length of "henceforth"). * d) The sequence rstuvw is for padding, but really just the * r in it is needed. So we can shorten the code word. * The word " henceforth" would be encoded as "* rzaa", * where "r" helps determine the length and " zaa" helps * look up the word. * e) In order to store the dictionary, it suffices to output * the words we used in order sorted by frequency. * The decoder can then determine the lookup codes for each * word and decipher the message * f) In order to control the dictionary size, one could use * some sort of metric to limit what goes in * * 2) Read input again, and for every word encountered, look up * the code word. Output the code word and continue. If we * don t have a code - word for a word, output it unaltered. * * DECODING * 1) read the dictionary and generate code - words for each * dictionary word * 2) Read the input and for words beginning with * find the * translation in the dictionary. If a word does not start * with * output it unaltered. */ # include <iostream > # include <fstream > # include <vector > # include <string > # include <set > # include <map > # include <stdlib.h> using namespace std; /* * a simple Word class to allow for sorting by frequency */ class Word{ public: string wd; int freq; bool operator < ( const Word & w2) const{ if ( freq > w2. freq) return true; return false; ; /* * determines if a character is a word character * the bool includestar is used when decoding */ bool iswordchar( unsigned char c, bool includestar){ if ( A <= c && c <= Z ) return true;

94 94 if ( a <= c && c <= z ) return true; if ( includestar && c == \31 ) return true; return false; bool isnonwordchar( unsigned char c, bool includestar){ return (! iswordchar(c, includestar)); /* * grabs a maximal sequence of word - characters from & is * again, the star - character can be included */ istream& readword( istream &is, string & word, bool includestar){ char c; word = ""; while(!is.eof()){ is.get(c); if (!is.fail()){ if( iswordchar(( unsigned char)c, includestar)){ word += ( unsigned char)c; else{ is. putback(c); break; return is; istream& readnonword( istream &is, string & nonword, bool includestar){ char c; nonword = ""; while(!is.eof()){ is.get(c); if (!is.fail()){ if( isnonwordchar(( unsigned char)c, includestar)){ nonword += ( unsigned char)c; else{ is. putback(c); break; return is; /* * gets the SCLPT code - word for a word based on its position in the * frequency table for that word length. */ string getsclptcode( string & word, int & position, string & paddingstring){ string ret = "\31"; int paddinglength = paddingstring. length();

95 95 int lengthindicator = word. length() - 4; if ( lengthindicator > 0){ ret += paddingstring[ lengthindicator - 1]; char one = z ; char two = a ; char three = A ; int offone = ( position /26)/26; int offtwo = ( position /26)%26; int offthree = position %26; // if ( offone >= 26) offone += 6; // if ( offtwo >= 26) offtwo -= 58; // if ( offthree >= 26) offthree += 6; one -= offone; two += offtwo; three += offthree; if ( word. length() >= 4){ ret = ret + one + two + three; else if ( word. length() == 3){ ret = ret + two + three; else if ( word. length() == 2){ ret = ret + three; return ret; /* * returns how many items we can hold in the dictionary * given the length of a word */ int getmaxindex( string w){ if (w. length() == 2) { return 26-1; else if (w. length() == 3){ return 26 * 26-1; return 26 * 26 * 26-1; /* * encodes the input * in to *out, optionally outputting the * dictionary to * dict ( if not null) */ void encode( char *in, char *out, char * dict, int minwordlength, int maxwordlength, int mincount, string paddingstring){ ifstream infile;

96 96 infile. open( in); ofstream outfile; outfile. open( out); ofstream dictfile; if( dict!= NULL){ dictfile. open( dict); dictfile << minwordlength << " " << maxwordlength << endl; else{ outfile << minwordlength << " " << maxwordlength << endl; bool nextisword = false; char c; string token; int wordcount = 0; map <string, int > wordfrequencies; vector < multiset <Word > > dictionaries; map <string, int > wordindex; dictionaries. resize( maxwordlength +1); map <string, int >:: iterator miter; // read words and non - word from file. They strictly alternate if we define them as // maximal sequences of alphabetic characters ( words) and // maximal sequences of non - alphabetic characters (non - words). while(!infile.eof()){ if (nextisword){ readword( infile, token, false); // read the word // if the word satisfies minimum length and max length we increment its frequency if ( token. length() <= maxwordlength && token. length() >= minwordlength){ miter = wordfrequencies. find( token); if ( miter == wordfrequencies. end()){ wordfrequencies[ token] = 0; else{ wordfrequencies[ token]++; nextisword = false; // some feedback for the user while they are waiting ++ wordcount; if ( wordcount % == 0) cerr << wordcount << "\r"; else{ readnonword( infile, token, false);// just read non - words at this point. Discard - we don t need them right now nextisword = true;

97 97 // we have the frequencies for each word. // Now we would like to create dictionaries for each word length with the individual words sorted by frequency for ( miter = wordfrequencies. begin(); miter!= wordfrequencies. end(); ++miter){ // if we have a dictionary for the word, we will crate a word object and insert it into a multiset // of words for that dictionary for easy sorting by frequency. if ( miter ->first. length() < dictionaries. size()){ if ( miter -> second >= mincount){ Word w; w. wd = miter ->first; w. freq = miter -> second; dictionaries[ miter ->first. length()]. insert(w); // now set up a little easy way to quickly find the code for a word multiset <Word >:: iterator siter; for ( int i = 0; i < dictionaries. size(); ++i){ // each dictionary begins at index 0 int index = 0; for ( siter = dictionaries[i]. begin(); siter!= dictionaries[i]. end() ; ++siter){ // for short words, we use shorter codes. This also implies that the number of short words we // can hold in our dictionary is smaller int maxindex = getmaxindex( siter ->wd); // do we even have room for this word? If so, add to wordindex so we can quickly find it again if ( index <= maxindex){ wordindex[ siter ->wd] = index; string s = siter ->wd; // We only really need the actual word. The decoder should be able to figure out the rest // since the length is obvious and the words are output in order of frequency. if (dict == NULL){ outfile << siter ->wd;// << " " << getsclptcode(s, index, paddingstring) << " " << siter ->freq << endl; outfile << endl; else{ dictfile << siter ->wd;// << " " << getsclptcode(s, index, paddingstring) << " " << siter ->freq << endl; dictfile << endl; ++ index;

98 98 if (dict == NULL){ outfile << \31 << endl; // end of dictionary else{ dictfile << \31 << endl; // end of dictionary infile.close(); cerr << "\ ndone with first pass over data. Beginning second pass to encode words.\ n"; infile. open( in); // begin our second pass over the data. Now we will replace the words with their codes. while(!infile.eof()){ if (nextisword){ readword( infile, token, false); miter = wordindex. find( token); int maxindex = getmaxindex( token); if ( miter!= wordindex. end() && token. length() <= maxwordlength && token. length() >= minwordlength){ if ( wordindex[ token] <= maxindex){ outfile << getsclptcode( token, wordindex[ token], paddingstring ); // output the code of the word else{ outfile << token; // words that are too short will be ignored nextisword = false; // something for the user to look at if ( wordcount % == 0) cerr << wordcount << " \r"; -- wordcount; else{ // non - words are ignored readnonword( infile, token, false); outfile << token; nextisword = true; cerr << endl; void decode ( char *in, char *out, char * dict, string paddingstring){ ifstream infile; ofstream outfile; ifstream dictfile;

99 99 infile. open( in); outfile. open( out); if (dict!= NULL){ dictfile. open( dict); int minwordlength; int maxwordlength; char newline; if (dict!= NULL){ dictfile >> minwordlength; dictfile >> maxwordlength; dictfile. get( newline); else{ infile >> minwordlength; infile >> maxwordlength; infile.get(newline); string word; bool nextisword = false; char c; string token; int wordcount = 0; int index; vector < vector < string > > dictionaries; map <string, int > wordindex; map < string, string > codelookup; dictionaries. resize( maxwordlength +1); cerr << " Reading dictionary\n"; if (dict == NULL){ do{ readword( infile, word, true); // get the word infile.get(newline); if(word[0]!= \31 ){ // get the word length and insert it into the appropriate dictionary if ( word. length() < dictionaries. size()){ // inserting into the appropriate vector. Remember that the encoder provided the words of length i // sorted by frequency. Hence, we can just push_back into the dictionary to obtain a sorted dictionary // for words of length i dictionaries[ word. length()]. push_back( word); wordindex[ word] = dictionaries[ word. length()]. size() - 1; while(!infile.eof() && word[0]!= \31 );

100 100 else{ do{ readword( dictfile, word, true); // get the word dictfile. get( newline); if(word[0]!= \31 ){ // get the word length and insert it into the appropriate dictionary if ( word. length() < dictionaries. size()){ // inserting into the appropriate vector. Remember that the encoder provided the words of length i // sorted by frequency. Hence, we can just push_back into the dictionary to obtain a sorted dictionary // for words of length i dictionaries[ word. length()]. push_back( word); wordindex[ word] = dictionaries[ word. length()]. size() - 1; while(!dictfile.eof() && word[0]!= \31 ); // The dictionary has been read at this point. for ( int i = 0; i < dictionaries. size(); ++i){ for ( int j = 0; j < dictionaries[i]. size(); ++ j){ string code = getsclptcode( dictionaries[i][ j], j, paddingstring); codelookup[ code] = dictionaries[i][ j]; cerr << " Dictionary has been read.\ n"; // continue to process the input while(!infile.eof()){ if (nextisword){ readword( infile, token, true); if (token[0] == \31 ){ outfile << codelookup[ token]; else{ outfile << token; nextisword = false; // something for the user to look at if ( wordcount % == 0) cerr << wordcount << " \r"; ++ wordcount; else{ // non - words are ignored readnonword( infile, token, true); outfile << token; nextisword = true;

101 101 cerr << endl << " Done\n";; int main ( int argc, char * argv[] ){ // default values int minwordlength = 4; int maxwordlength = 52; int mincount = 4; string paddingstring = " ZYXWVUTSRQPONMLKJIHGFEDCBAzyxwvutsrqponmlkjihgfedcba"; // string paddingstring = " abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"; if ( argc!= 4 && argc!=5 && argc!= 7 && argc!=8){ cerr << " Usage: # transform e infile outfile dictfile\n"; cerr << " Usage: # transform e minwordlength maxworlength mincount infile outfile\n"; cerr << " Usage: # transform e minwordlength maxworlength mincount infile outfile dictfile\n"; cerr << " Usage: # transform d infile outfile\n"; cerr << " Usage: # transform d infile outfile dictfile\n"; exit(1); if (argv[1][0] == e ){ cerr << " Encoding "; if ( argc == 7 argc == 8){ minwordlength = atoi( argv[2]); maxwordlength = atoi( argv[3]); mincount = atoi( argv[4]); if (argc == 7){ cerr << " with minwordlength = " << minwordlength << ", maxwordlength = " << maxwordlength << ", mincount = " << mincount << " from " << argv[5] << " to " << argv[6] << endl; encode( argv[5], argv[6], NULL, minwordlength, maxwordlength, mincount, paddingstring); else if (argc == 8){ cerr << " with minwordlength = " << minwordlength << ", maxwordlength = " << maxwordlength << ", mincount = " << mincount << " from " << argv[5] << " to " << argv[6] << " to dictionary " << argv[7] << endl; encode( argv[5], argv[6], argv[7], minwordlength, maxwordlength, mincount, paddingstring); else if (argc == 4){ cerr << " with minwordlength = " << minwordlength << ", maxwordlength = " << maxwordlength << ", mincount = " << mincount

102 102 << " from " << argv[2] << " to " << argv[3] << endl; encode( argv[2], argv[3], NULL, minwordlength, maxwordlength, mincount, paddingstring); else{ cerr << " with minwordlength = " << minwordlength << ", maxwordlength = " << maxwordlength << ", mincount = " << mincount << " from " << argv[2] << " to " << argv[3] << " to dictionary " << argv[4] << endl; encode( argv[2], argv[3], argv[4], minwordlength, maxwordlength, mincount, paddingstring); else if(argv[1][0] == d ){ if (argc == 4){ cerr << " Decoding from " << argv[2] << " to " << argv[3] << endl; decode( argv[2], argv[3], NULL, paddingstring); else{ cerr << " Decoding from " << argv[2] << " to " << argv[3] << " with dictionary " << argv[4] << endl; decode( argv[2], argv[3], argv[4], paddingstring); A.4 A lexer for splitting % option noyywrap %{ # include <iostream > # include <fstream > # include <cstring > using namespace std; # define YY_DECL extern "C" int yylex() ofstream fout_xml; ofstream fout_text; ofstream fout_alpha; ofstream fout_numeric; ofstream fout_timestamp; ofstream fout_redirect; ofstream fout_doublecorner; ofstream fout_doublecurly; ofstream fout_doubleequal; ofstream fout_tripleequal; ofstream fout_ltgt; ofstream fout_quot; % %x numerictag %x alphatag %x timestamptag %x nonxml redirecttoken "#REDIRECT [["[^]]{1,100"]]" cornertoken "[["[^]]{1,50"]]"

103 103 curlytoken "{{"[^]{1,50"" dequaltoken " =="[^=]{1,50" ==" tequaltoken " ==="[^=]{1,50" ===" dquotetoken " "[^ ]{1,50" " tquotetoken " "[^ ]{1,50" " ltgttoken "<"[^&]{1,50">" quottoken "&quot ;"[^&]{1,50"& quot;" delimtoken { cornertoken { curlytoken %% "<text"([^ >]*)" >" { BEGIN( nonxml); fout_xml << yytext; fout_xml << \00 ; fout_text << \00 ; "<title >" { BEGIN( alphatag); fout_xml << yytext << \01 ; fout_alpha << \00 ; "<username >" { BEGIN( alphatag); fout_xml << yytext << \01 ; fout_alpha << \00 ; "<comment >" { BEGIN( alphatag); fout_xml << yytext << \01 ; fout_alpha << \00 ; "<id >" { BEGIN( numerictag); fout_xml << yytext << \02 ; fout_numeric << \00 ; "<timestamp >" { BEGIN( timestamptag); fout_xml << yytext << \03 ; [ \t\n] { fout_xml << yytext;. { fout_xml << yytext; <numerictag >[ \t\n] { fout_numeric << yytext; <numerictag >[^ <]* { BEGIN( INITIAL); fout_numeric << yytext; <timestamptag >[ \t\n] { fout_timestamp << yytext; <timestamptag >[0-9]* { fout_timestamp << yytext; <timestamptag >[-:TZ] {; <timestamptag >[ <] { BEGIN( INITIAL); unput( yytext[ yyleng -1]); <alphatag >{ redirecttoken {1 { fout_alpha << \01 ; fout_redirect << \00 ; for ( int i = 12; i < yyleng - 2; ++i) fout_redirect << yytext[i ]; <alphatag >[ \t\n] { fout_alpha << yytext; <alphatag >[^ <] { fout_alpha << yytext; <alphatag >[<] {BEGIN(INITIAL);unput(yytext[yyleng -1]); <nonxml >{ redirecttoken {1 { fout_text << \01 ; fout_redirect << \00 ; for ( int i = 12; i < yyleng - 2; ++i) fout_redirect << yytext[i]; <nonxml >{ cornertoken {1 { fout_text << \02 ; fout_doublecorner << \00 ; for ( int i = 2; i < yyleng - 2; ++i) fout_doublecorner << yytext[i ]; <nonxml >{ curlytoken {1 { fout_text << \03 ; fout_doublecurly << \00 ; for ( int i = 2; i < yyleng - 2; ++i) fout_doublecurly << yytext[i]; <nonxml >{ dequaltoken {1 { fout_text << \04 ; fout_doubleequal << \00 ; for ( int i = 2; i < yyleng - 2; ++i) fout_doubleequal << yytext[i ]; <nonxml >{ tequaltoken {1 { fout_text << \05 ; fout_tripleequal << \00 ; for ( int i = 3; i < yyleng - 3; ++i) fout_tripleequal << yytext[i ]; <nonxml >{ ltgttoken {1 { fout_text << \06 ; fout_ltgt << \00 ; for ( int i = 4; i < yyleng - 4; ++i) fout_ltgt << yytext[i];

104 <nonxml >{ quottoken {1 { fout_text << \07 ; fout_quot << \00 ; for ( int i = 6; i < yyleng - 6; ++i) fout_quot << yytext[i]; 104 <nonxml >[ \n\t] { fout_text << yytext; <nonxml >. { fout_text << yytext; <nonxml >"<"([ ]*)"/text"([ ]*)">" {BEGIN(INITIAL); fout_xml << yytext; %% int main( int argc, char* argv[]){ cerr << " Using the following files:\ n"; FILE *myfile = fopen(argv[1], "rb"); string base = argv[1]; string name = ""; name = base + ". xml"; fout_xml.open(name.c_str()); cerr << "\ nxml: " << name; name = base + ". alpha"; fout_alpha.open(name.c_str()); cerr << "\ nalpha: " << name; name = base + ". numeric"; fout_numeric. open( name. c_str()); cerr << "\ nnumeric: " << name; name = base + ". timestamp"; fout_timestamp. open( name. c_str()); cerr << "\ ntimestamp: " << name; name = base + ". redirect"; fout_redirect. open( name. c_str()); cerr << "\ nredirect: " << name; name = base + ". doublecorner"; fout_doublecorner. open( name. c_str()); cerr << "\ ndoublecorner: " << name; name = base + ". doublecurly"; fout_doublecurly. open( name. c_str()); cerr << "\ ndoublecurly: " << name; name = base + ". doubleequal"; fout_doubleequal. open( name. c_str()); cerr << "\ ndoubleequal: " << name; name = base + ". tripleequal"; fout_tripleequal. open( name. c_str()); cerr << "\ ntripleequal: " << name; name = base + ". ltgt";

105 105 fout_ltgt.open(name.c_str()); cerr << "\ nltgt: " << name; name = base + ". quot"; fout_quot.open(name.c_str()); cerr << "\ nquot: " << name; name = base + ". text"; fout_text.open(name.c_str()); cerr << "\ nremaining core text: " << name << endl << endl; if (!myfile){ cout << "I can t open " << argv[1] << "!\ n"; return -1; yyin = myfile; yylex(); return 0;

106 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Thesis and Dissertation Services!

Lossless Compression Algorithms

Multimedia Data Compression Part I Chapter 7 Lossless Compression Algorithms 1 Chapter 7 Lossless Compression Algorithms 1. Introduction 2. Basics of Information Theory 3. Lossless Compression Algorithms