Markov Models for Clusters in Concordance Compression - Extended Abstract*

Size: px

Start display at page:

Download "Markov Models for Clusters in Concordance Compression - Extended Abstract*"

Della Fleming
5 years ago
Views:

1 Markov Models for Clusters in Concordance Compression - Extended Abstract* A. Bookstein S. T. Klein Center for Information and Language Studies Dept. of Math. & CS Bar Ilan University University of Chicago Chicago, IL Ramat-Gan Israel a-bookstein(puchicago.edu tomiqbimacs.cs.biu.ac.il T. Raita Comp. Sci. Dept. University of Turku Turku Finland raita(peuroni.cs.utu.fi Abstract An earlier paper developed a procedure for compressing concordances, assuming that all elements occurred independently. In this paper, the earlier models are extended to take the possibility of clusterin into account. We suggest several models adapted to concordances of large full-text information retrieval systems, which are generally subject to clustering. 1 Introduction and Background The development of optical disk technology has made it possible to widely distribute large, full text databases. But large as the capacity of CD-roms may be, it still doesn't match our ambitions for storing data [5]. For example, it is often overlooked that to be able to access and manipulate text, auxiliary data-structures must also be created and stored, and these often occupy as much space as the original text itself. Thus, to distribute a functional, full-text information retrieval system, consideration must be given how to store these data structures efficiently. Most large information retrieval systems are based on inverted files. In this approach, query processing does not directly involve the original text files (in which key words might be located using some pattern matching technique), but rather the auxiliary dictionary and concordance files. The dictionary is a list of all the different words appearing in the text and is usually ordered alphabetically. For each entry in 'Two of the authors (A.B. and S.T.K.) wish to acknowledge that the material in this paper is based upon research supported by the U.S. National Science Foundation under award number IRI , and by grant No from the United States - Israel Binational Science Foundation (BSF), Jerusalem, Israel. T.R. acknowledges support by the Academy of Finland under grant No. 4964/30/ $ IEEE

2 117 the dictionary, there is a pointer into the concordance, which lists each occurrence of the word. Every occurrence of a word in the database can be uniquely characterized by a sequence of numbers that gives its exact position in the text. Typically, such a sequence would consist of the document number d, the paragraph number p (in the document), the sentence number s (in the paragraph) and the word number w (in the sentence). The quadruple (d,p, s, w) is the coordinate of the occurrence. The concordance contains, for every word of the dictionary, the lexicographically ordered list of all its coordinates in the text. The concordance is generally of the order of magnitude of the text itself, its exact size depending on the omission or inclusion of the most frequent words, the so-called stop-words. Compressing it not only saves space, but also saves processing time by reducing the number of 1/0 operations needed to fetch parts of the concordance into main memory (see [5], [2], [8]). We note that in a static information retrieval system, compression and decompression are not symmetrical tasks. Compression is done only once, while building the system, whereas decompression is needed during the processing of every query and directly affects the response time. One may thus use extensive and costly preprocessing for compression, provided reasonably fast decompression methods are possible. Moreover, in an information retrieval system, while we compress full files (text, concordance, etc.), we decompress only (possibly many) short pieces on demand; these may be accessed at random by means of pointers to their exact locations. This limits the value of adaptive methods based on tables that systematically change from the beginning to the end of the file. Oddly, compared to full text, concordance compression has received relatively little attention. Some ad-hoc methods were used [3], [7], or simple models, assuming a structured entity in which term occurrences were distributed independently [l], [8]. In this paper we generalize the independence models to incorporate a tendency of the terms to cluster. 2 Modeling the Concordance For our model of a textual database, we assume that the text is divided into documents and the documents are made up of words. We thus use only a two level hierarchy to identify the location of a word, which makes the exposition easier. The methods can, however, be readily adapted to more complex concordance structures, like the 4-level hierarchy mentioned above. In our present model, the conceptual concordance consists, for each word, of a series of (d,w) pairs, d standing for a document number, and w for the index, or offset, of a word within the given document. It is sometimes convenient to translate this model to an equivalent one, in which we indicate 1) the index of the next document containing the word, 2) the number of times the word occurs in the document, followed by 3) the list of word indices of the various occurrences:

3 118 (GI"; wordz :... Wl,...,W") Our task is to model each of the components of the coordinate, and use standard compression methods to compress each entity. Below we assume that we know (from the dictionary) the total number of times a word occurs in the database, the number of different documents in which it occurs (N), and (from a separate table) the number of words in each document. We also know the value of D, the number of documents in the collection. The model of [I] assumed that all documents are approximately of the same size, that there is no between-document clustering, and that within a single document, words are independently distributed. On the basis of these assumptions, three probability distributions were derived for each term, giving the location of the next document containing the term, the number of times the term occurs in this document, and the locations of the word within the document. Finally Huffman, arithmetic, or Shannon-Fano codes were generated for these probability distributions, and the codeword corresponding to the actually appearing value d, m or w was used in the encoding. We now turn to the more realistic case in which the independence assumptions do not hold. For example, if the documents of an Information Retrieval system are grouped by author or some other criterion, the d-values of a many terms will tend to appear in clusters because of the specific style of the author, or because adjacent documents might treat similar subjects. Similarly, term-occurrences will cluster within documents, reflecting content variations over a document. In the following section, we model the bit-generation process in a cluster prone environment by means of a Markov chain, and suggest how to exploit it for the compression of concordances. In Section 4 we report on some experimental results obtained by applying these methods on the concordance of the Hebrew Bible. 3 Models of Clustering In our earlier examination of concordances, we first looked at the pattern of term occurrences over documents, and then at the distribution of terms within a given document. Formally, the analyses were identical. Since this remains true when clustering is considered, we shall here examine only the first case explicitly: that is, we have D documents, of which N are known to have at least one occurrence of the term being considered; our goal is to analyze the patterns of the gaps separating documents that contain the term. Formally, we shall represent each term by a bitmap, with a position for each document. If the term is present in a document, that document's bit position will be set to one; otherwise it will be zero. Below it is convenient to talk in terms of bitmaps rather than in terms of the concordance of which they are a component.

4 119 Should the tendency for clustering be pronounced, codes based on assumptions of term independence will produce poor compression. In this section we offer several models of clustering that potentially can be used to improve compression. 3.1 Full Markov Model We investigate a family of n-state Markov chains as models of how our bitmap was generated as we traverse the bitmap from beginning to end, at each position we are in a state, and that state determines the probability of being in any given state at the next location. The nature of the transition determines whether the next bit is a one or zero. The simplest model is the independence model. This has two states, which, to be consistent with what follows, we denote by C (within cluster) and B (between clusters). In the independence model we assume the probabilities of the transitions C -+ C and B + C are equal. In this model, as in all the Markov models we consider, we assume that we are generating terms only when we are making a transition into a cluster, even if the cluster consists of only one document. A simple, true Markov model is based on the same two states C and B, but in the true two-state model, the probability of a transition to C (that is, the probability of turning on a bit) differs depending on the state we are in. The state C indicates that we are in a cluster, and thus are more likely to generate occurrences of the designated term. We enter B as soon as we leave the cluster (that is, generate a zero-bit ). A limitation of the two-state model is that it doesn t recognize the possibility of spurious zeroes within a cluster: that is, we would like to incorporate the possibility that we may be in a cluster and generate a single zero bit without leaving the cluster. (For simplicity, we are neglecting the symmetric problem of spurious one s outside of a cluster; this can be dealt with in an identical manner.) We accommodate this possibility by introducing transitional states. For illustration, we describe in detail a 3-state model. The states are: 1. Cluster State (C); 2. Transitional State (X); and 3. Between Cluster State (B). The state X of the three-state model is introduced to permit us to be in a cluster and yet not generate any terms-it is a maybe state. However, if two consecutive documents do not contain the term, we have left the cluster-we are in state B between clusters. We can generalize this model by introducing a larger number of transitional states XI, X2, etc. Within the three-state model, a term is generated whenever we enter state C. The transition probabilities are: c + c : -yc, X -+ C :rx, B --+ c :YE, c --+ x : 1 - -yc; X -+ B: 1-7~; B + B : 1 - YE. We can encode each bit in a bitmap individually, using arithmetic coding based on the probability of a one-bit at each stage; alternatively, it may be more convenient to encode the gaps between one bits. To do this, we need the distribution of gap

5 120 sizes, which is easily derived using the above model. Once we are in a document that contains an occurrence of the designated term, the probability, pn, that the next n documents not have an occurrence is: po = yc; pl = (1 - yc)yx; and for n 2 2: The 7 s are our model s basic parameters. But we will also consider a reparameterization that is suggestive: y~ y, y~ e Oxyc, and y~ E Ogyx, which define the parameters y, Ox, and OB. These parameters satisfy simpler constraints (0 5 OX, OB 5 1) independently of one another and of y, which will ease problems of estimation. We also expect that regularities in the data will be more easily expressed in terms of the 0 s than in terms of the 7 s. It is also useful to define 0 OX&, so that 7~ = 0-y~. 0 is a single value reflecting the strength of clustering: it indicates the relative likelihood of a term appearing if we are inside a cluster as compared to if we are outside a cluster. More applications of Markov models to data compression can be found in [4] and PI. 3.2 Parameter Estimation We can estimate the parameters directly. We assume that before generating the bitmap, we are in state B. The sequence of one s and zero s making up the bitmap completely determines the state at any bitmap position. Thus it is easy to tabulate the number, and hence probability, of each type of transition. For illustration, consider the following bitmap: ~ ~ ~ Since we begin in state B, the initial zero indicates that the first transition is back to state B. Continuing in this manner we find the sequence of states corresponding to the above bitmap is given as follows (with the initial B preceding the colon): B: BBCXCCXB... In this sequence, we are in state B four times, for which we have three transitions. Of these three transitions, one is to state C, so on the basis of the information given, we would estimate M 1/3; similarly 7~ M 1/3, and 7~ M 1/2. Thus each parameter is easily evaluated and these parameters can be used as the basis for compressing the bitmap. Further, the standard deviation of this estimate, for state S, is given by us = d-, if we experience Ns transitions from state S in the pertinent bitmap. The standard deviations make it possible to compute confidence intervals and do tests of hypotheses. For a large bitmap, the?-values can be stored with the bitmap to permit decompression. But several mechanisms can be tried to reduce the cost of storing these parameters. For example, we can save the space for storing these parameters by evaluating them adaptively, beginning with reasonable initial values. We can

6 also lower storage costs if we find the 0's are related in a regular manner over the terms. For example, since state X is intermediate between states C and B, we might find that -yx is reasonably approximated by the average of 7~ and TB, that is, that ex = ( 1 + 0y2. It may also be possible, without serious deterioration in performance, to divide the parameters into a small number of categories. Thus, we might divide our terms into four clustering classes: one class would represent no clustering (0 = 1); the other three states would represent varying clustering strengths, with the values of the clustering parameter, 0, for these states determined empirically. This simplification allows clustering strength to be represented with the cost of just two bits per term. We will explore the value of such simplifications. 3.3 Model Testing The model allows us to make a number of predictions about the properties of the bitmaps, and these can be used to test the model. Test 1: One such test predicts the density of one-bits, NJD; this can be compared to the actual values of N and D, which are available to us without extra cost. The relationship is defined as follows. We assume we first have a span of documents without the designated term; the size of this span is E. This is followed by N spans made up of a document containing the term followed by an inter-term gap (of size G) of documents, and a terminal span of approximate size E. To derive our estimate, we assume that each internal span has a length equal to its expected value, and that the lengths of the terminal spans is of comparable size (say E M ag, for a M 1). As N gets large, the contribution of E becomes negligeable, allowing the following large-n relationship between D, N and the 7's: -- N - 'YB D 'YB -k (1 - 'Yc ) (1 - 'Yx + 'YB ) ' This relation, which can also be derived more formally in terms of Markov chain theory, provides our first test of the model. The general theory of Markov chains allows us to derive the steady state probabilities TC, TX and TB of being in states C, X, and B: 'YB 7rc = (2) 'YB+(~-'YC)(~--'YX+'YB)' (1) 121 and, Since 7rcD = N equations 1 and 2 are very similar. Test 2: As part of the development of test 1, we computed the average number of zeroes between two consecutive one-bits. A similar argument yields the useful

7 122 relation that the average size, SZ, of an actual run of zeroes (that is, assuming at least one zero in the run) is given by SZ = rx+(l -7x)(l+7~)/7~. If there are NZ zeroes in the bitmap, we expect approximately NZ/SZ runs of zeroes. Further, since the number of runs of one's is within one of the number of runs of zeroes, NZ/SZ is also an estimate of the number of runs of one-bits. Each of these relations can be used as a test of the model. 3.4 Performance Estimates If we are in state S E {B,X,C}, we expect that we can encode the next bitmap element in H(7s) bits (where H(z) = -(z log(z) + (1- z) log(1 - z)) ). U sing ' eqs. 2, 3, and 4, we estimate that the D bits of the bitmap can be reduced to BM = (*ch(rc) +ad(%) + TxH(7x))D bits. This estimate of performance can be used as a rough test of the Markov model. But even if the model is valid, we are left with the question of whether the savings of using the full model justifies the additional complexity relative to the independence model. The relative performance of the two models is easily computed. Under the assumption of the independence model, the appearance of a one-bit is determined by a single parameter, p, and the size of the bitmap is reduced to BI = H(p)D bits. But the probability p of generating a one-bit depends on which state S we are in (governed by the known probabilities TS), and the probability of a transition to C from S. If the Markov model is valid, we could combine these to find: P = TB7B + TXTX + TC7C = TC. The equation p = TC could have been asserted directly: the probability of a one-bit is just the probability of going to state C from any of the states; but the long term probability of going to C is the same as the probability of being in C, as the last equation asserts. The ratio BM/BI gives the relative advantage of using the full Markov model. 4 Partial experiment We decided for technical reasons to continue testing our methods on the concordance of the Hebrew Bible, as we did in our earlier paper. The Bible consists of words which are partitioned into 929 chapters. The number of different words is and the number of coordinates of the form (d,m;q,..,tum) is The first step for applying the Markov model is to evaluate the values of the 7's and 0. This was done by scanning the bitmaps generated from the concordance,

8 123 simulating the B, X and C states, and counting the transitions. There is of course no sense in applying the Markov model to the full set of terms; most of these appear in very few documents, and 57% of the terms appear in one document only. Using the Markov model, one would need 9.35 bits on the average to encode the d-fields of terms that appear in less than 60 documents, while one could just use a fixed encoding, using 10 bits to designate in which of the 929 documents the term appears. We therefore restricted our test to the 426 terms appearing in at least 60 documents. Though these are only 1% of the terms, they account for 48% of the N11 II N 1 60 Range 7c 7X 7B 7C 7X 7B Table 1: Distribution of 7 values Table 1 shows the distribution of the values of the estimated parameters ~ c 7x, and TB, both for the full set of coordinates and for the most frequent words. The table shows for each range, the number of terms having their value 7s in the given range, for S E {C, X, B). We see a tendency of having more terms with rc in the higher ranges than with 7~ or YE. The large majority of terms have very low 7 values (less than 0.02), because these terms are so rare that hardly any cluster is found. By comparing the two 7B columns, one sees that no term with N < 60 has a 7~ value larger than 0.1. The average values, 5, of 7s in the given sets, are displayed in Table 2. The last line of Table 2 gives the ratio BM/BI, which estimates the relative advantage of using the Markov model. The fact that these values are so close to 1 might at the first sight indicate that the expected gain is very small. It should, however, be noted that this is probably due to the fact that our test concordance is much too small. Most of the terms, even most of those appearing in more than 60 documents, are still relatively rare; on the other extreme, those few terms appearing in many documents are probably not content bearing words, which therefore will not exhibit any clustering tendency. But even if the BM/BI ratio is not very impressive, there is still evidence that clustering occurs, as suggested by the fact that

9 124 > > 5, for both sets considered; if terms were distributed randomly, the various 7-values would tend to be equal. Parameter N 2 1 N 2 60 BM/BI U Table 2: Parameter values The compression results are listed in Table 3, giving the average number of bits needed to encode the document-field of a coordinate. The first line corresponds to the static independent model of 111, that is, N and D are fixed for each term. The second line are the results of the dynamic independent model of [l], where N and D are updated after each d-field encoded. The last line corresponds to the Markov model. n Model N21 N 2 60 Dynamic indep Table 3: Comparison of bit generation models The first column gives the averages on the full set. We see that using the Markov model would actually incur a loss. The second column gives the averages for the 426 terms appearing in at least 60 documents. These terms are not necessarily clustered, but because of their relatively high density, the Markov model shows here a moderate improvement over the earlier models. The denser the maps, the better does the Markov model perform. For instance, for the 79 words appearing in at least 200 documents, only 2.24 bits are needed on the average to encode the value in the document-field.

10 125 5 Final remarks As noted already in our earlier paper, the Bible is not really a good example of a database on which to apply our new methods. The Bible is too small (just about 1.5 MB), so that hardly any concordance is needed. On real life concordances, of the size of hundreds of MB, the compression savings are usually much more substantial (about 40% by method POM of [3] on large data bases, versus 16% by POM here), so we expect there our method also to perform much better. The fact, however, that by restricting our action only to the most frequent words, we still cover a large part of the concordance, is true also for large systems. For instance, the 100 most frequent words of the Trbor de la Langue FranGaise (TLF) account for 57% of the text. We are currently experimenting with the other tests and models, and shall apply these on the database of the TLF. 6 References 1. Bookstein A., Klein S.T., Raita T., Model based concordance compression, Proc. Data Compression Conference DCC-92, Snowbird, Utah (1992) Bookstein A., Klein S.T., Ziff, D.A., A systematic approach to compressing a full text retrieval system, Information Processing & Management 28 (1992) Choueka Y., Fraenkel A.S., Klein S.T., Compression of Concordances in Full-Text Retrieval Systems, Proc. 11-th ACM-SIGIR Conf., Grenoble (1988) Cormack G.V., Horspool R.N., Data compression using dynamic Markov modeling, The Computer Journal 30 (1987) Klein S.T., Bookstein A., Deerwester S., Storing Text Retrieval Systems on CD-ROM: Compression and Encryption Considerations, ACM Trans. on Information Systems 7 (1989), Llewellyn J.A., Data compression for a source with Markov characteristics, The Computer Journal 30 (1987) Wisniewski J.L., Compression of index term dictionary in an invertedfile oriented database: some efficient algorithms, Information Proc. & Management 22 (1986) Witten I.H., Bell T.C., Nevi11 C.G., Models for compression in full-text retrieval systems, Proc. Data Compression Conference DCC-91, Snowbird, Utah (1991)

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS

THE RELATIVE EFFICIENCY OF DATA COMPRESSION BY LZW AND LZSS Yair Wiseman 1* * 1 Computer Science Department, Bar-Ilan University, Ramat-Gan 52900, Israel Email: wiseman@cs.huji.ac.il, http://www.cs.biu.ac.il/~wiseman