Adjacency Matrix Based Full-Text Indexing Models

Size: px

Start display at page:

Download "Adjacency Matrix Based Full-Text Indexing Models"

Constance Shepherd
5 years ago
Views:

1 /2002/13(10) Journal of Software Vol.13, No.10 Ajacency Matrix Base Full-Text Inexing Moels ZHOU Shui-geng 1, HU Yun-fa 2, GUAN Ji-hong 3 1 (Department of Computer Science an Engineering, Fuan University, Shanghai , China); 2 (Department of Computer an Information Technology, Fuan University, Shanghai , China); 3 (School of Computer Science, Wuhan University, Wuhan , China) sgzhou@fuan.eu.cn Receive January 18, 2002; accepte July 1, 2002 Abstract: With the rapi growth of online text information an user accesses, query-processing efficiency has become the major bottleneck of information retrieval (IR) systems. This paper proposes two new full-text inexing moels to improve query-processing efficiency of IR systems. By using irecte graph to represent text string, the ajacency matrix of text string is introuce. Two approaches are propose to implement the ajacency matrix of text string, which leas to two new full-text inexing moels, i.e., ajacency matrix base inverte file an ajacency matrix base PAT array. Query algorithms for the new moels are evelope an performance comparisons between the new moels an the traitional moels are carrie out. Experiments over real-worl text collections are conucte to valiate the effectiveness an efficiency of the new moels. The new moels can improve query-processing efficiency consierably at the cost of much less amount of extra storage overhea compare to the size of original text atabase, so are suitable for applications of large-scale text atabases. Key wors: information retrieval; full-text inexing; inverte file; PAT array; ajacency matrix; moel The rapi evelopment an wie application of Worl Wie Web (WWW) leas to explosively increasing of online text information. More an more people surf the Web for information. Such a situation poses new challenges to researchers an scientists in the information retrieval (IR) fiel. That is, how to effectively an efficiently organize an manage massive text information on Web an how to search what the users nee on Web accurately an completely [1]. Currently, the popular solution is Web search engine, which is essentially an IR system where text-inexing moel constitutes its core technique [2]. As the amount of text information exponentially increases an the number of user queries rapily grows, query-processing efficiency becomes the major bottleneck of Web search engines. Take AltaVista [3], one of the largest Web search engine systems uner running, for example. In 1998, the overall AltaVista system was running on 20 multi-processor machines, all of them have more than 130Gb of RAM an over 500Gb of isk space. Only query processing consumes more than 70% of these resources [1]. To a great extent, efficiency of information retrieval systems epens on the unerlying text inexing moels. That is the reason why efficient text-inexing techniques have been the hot research topic in IR fiel [4]. Up to ate, Supporte by the National Natural Science Founation of China uner Grant No ( ); the Natural Science Founation of Hubei Province of China uner Grant No.2001ABB050 ( ) ZHOU Shui-geng was born in He is an associate professor at the Department of Computer Science an Engineering, Fuan University. His research interests are in atabase, ata warehousing, ata mining an information retrieval. HU Yun-fa was born in He is a professor an octoral supervisor of the Department of Computer an Information Technology, Fuan University. His current research areas are atabase, knowlege base an igital library system. GUAN Ji-hong was born in She is an associate professor at the School of Computer Science, Wuhan University. Her current research areas are spatial atabase, spatial ata mining an geographic information systems.

2 1934 Journal of Software 2002,13(10) several text-inexing moels have been evelope, in which wiely accepte are inverte files [5,6], signature files [1,7], an PAT (Patricia tree) array [1,8,9]. Owing to their avantage of fast response an easy implementation, inverte file an its variations have been use in most text atabase systems an search engines [2]. Generally, text-inexing moels can be classifie into two categories, i.e., character-base inexing an wor (or keywor)-base inexing. The first step towar wor-base inexing is to extract meaningful wors from text string, which is not an easy task that may require eep unerstaning of the context, especially for Chinese text. Conversely, character-base inexing methos inex all characters appearing in text atabase; so character-base inexing is also referre to as full-text inexing. Full-text inexing IR systems main avantage is that they avoi the complicate an expensive process of semantic inexing. From the en-user point of view, full text searching of on-line ocuments is appealing because a vali query is just any wor or sentence of the ocument. This paper intens to evelop new full-text inexing moels of higher efficiency. By seeing a text atabase as a irecte graph an combining the concept of ajacency matrix of irecte graph with the structures of traitional inverte file an PAT array, we propose two new full-text inexing moels, which we call ajacency matrix base inverte file an ajacency matrix base PAT array respectively. The new moels can improve consierably query-processing efficiency of IR systems at the cost of much less amount of extra space overhea compare to the size of original text atabase, thus are suitable for applications of large-scale text atabases an can be use as an effective way to upate the IR systems uner running. 1 Preliminaries Let Σ be an alphabet: a finite, orere set of characters. For an arbitrary character l in Σ, there is an associate natural integer i, which inicates the position of l in Σ, an l is also enote by l i. In the case of Chinese text atabases, Σ correspons to the Stanar Chinese Characters Base, such as GB , where the region-position coe of each character assigne in GB may be use as the character s position ientifier in Σ (The alternative may be the Unicoe system). For English text collections, Σ is a set of letters, igits, punctuation marks an other symbols that may occur in English text, an the ASCII coe of each character in Σ is taken naturally as its position ientifier in Σ. A text string or simply string over Σ is a finite sequence of characters from Σ. Specifically, a string has no characters at all is calle the empty string an is enote by ε. To avoi confusion, we generally use u, v, w, x, y, z to enote strings. The length of a string is its length as a sequence. We enote the length of a string w by w. Alternatively a string w can be consiere as a function w:{1,, w } Σ; the value of w(j), where 1 j w, is the symbol in the jth position of w. To istinguish ientical symbols at ifferent positions in a string, we refer to them as ifferent occurrences of the symbol. That is, the symbol l Σ occurs in the jth position of the string w if w(j)=l. Two strings over the same alphabet can be combine to form a thir by the operation of concatenation. The concatenation of strings x an y, written xy, is the string x followe by the string y; formally, w=xy if an only if w = x + y, w(j)=x(j) for j=1,, x, an w( x +j)=y(j) for j=1,, y. A string v is a substring of a string w if an only if there are strings x an y such that w=xvy. Both x an y coul be ε, so every string is a substring of itself. If w=xv for some x, then v is a suffix of w; if w=vy for some y, then v is a prefix of w. If x is a substring of string w, then we enote P(w, x) the set of positions of x occurring in w, or the occurrences of x in w. For a string w of limite length, let us pa artificially at its right en with an infinite number of null (or any special characters that is not inclue in Σ). Then, a semi-infinite string (abbreviate to sistring) of string w is the sequence of characters starting at a certain position within w an continuing to the right. Obviously, two sistrings starting at ifferent positions are always ifferent. For the simplification of escription, meanwhile to guarantee that no one sistring be a prefix of another, it is enough to en the string w with a unique en-of-string symbol that oes not appear in Σ. Thus, sistrings can be unambiguously ientifie by their starting position. That is to say, P(w, x) is a

3 : 1935 singleton if x is a sistring of string w. The result of a lexicographic comparison between two sistrings is base on the text of the sistrings, instea of their positions. Definition 1. A text atabase TB is a collection of text ocuments, each of which is a string over Σ. Neglecting the bounary between any two ajacent ocuments, a text atabase can be seen as a long string, whose length is the sum of the lengths of all ocuments in the text atabase. We enote TB the length of text atabase TB. From the point of sistring s view, text atabase TB correspons to a sequence of sistrings. In what follows, we treat a text atabase as a string when iscussing inverte file, an see it in the point of sistring s view while ealing with PAT array. Definition 2. Suppose V is the set of all unique characters appearing in text atabase TB, V={t i i=1 V } Σ. For character t i in V, it occurs at ifferent positions in the string of text atabase TB. Let p i be the set of positions that t i occurs in TB, enote p i ={p i1, p i2,, p i pi } where p i is the occurrence number of t i in TB, p ij is the jth position where t i occurs in TB (counting from the starting point of TB string). The full-text inexing inverte file can be written formally as follows. { t i, p i (i=1 V )}. (1) Practically, inexe terms an the corresponing occurrences are store separately, i.e., splitting (1) into two parts: { t i, pt i (i=1 V )}, (2) {p i (i=1 V )}. (3) Above, pt i is a pointer to p i. In (2), all inexe terms are sorte lexicographically an store with corresponing pointers sequentially in the inexe file, while occurrences in (3) are store sequentially in the posting file. Definition 3. Suppose V is the set of all unique characters appearing in text atabase TB, V={t i i=1 V } Σ. Let TB be enote by a string w=c 1 c 2 c n $(n= TB ) where c i is the ith character in w an c i V, $ is the assume unique en-of-string symbol that oes not appear in. String w correspons to a sequence of sistrings: Sis= sis 1, sis 2,, sis n (4) where sis i = c i c i+1 c n $ inicates the ith sistring of w. Sorting the sistrings in (4) lexicographically, we get a new sequence of sistrings: PSis= psis 1, psis 2,, psis n. (5) Obviously, for each sistring psis i in PSis, there is a sistring sis j in Sis that is equal to psis i while i an j are not necessarily the same. (5) is the full-text inexing PAT array of text atabase TB. In practice, the positions of sistrings, instea of the sistrings themselves, are use in (5). Definition 4. For a text atabase TB with a set V of all unique characters appearing in TB, there exists a irecte graph TBG= V g, E g where V g is the set of vertices, each of which correspons to a unique character appearing in TB, i.e., V g =V; E g is the set of irecte eges, each of which correspons to a pair of ajacent characters appearing in TB an its irection points from the first character to the secon one. We call TBG the irecte graph of text atabase TB. Definition 5. As efine in graph theory, a weighte irecte graph is a irecte graph in which each ege has an associate value. In the context of this paper, the value associate with each irecte ege is the position of the character corresponing to the irecte ege s source vertex. Consiering that some ajacent-character pairs occur at ifferent positions in the text atabase, that is, in the irecte graph of text atabase there exists the case of multiple irecte eges having similar starting an en vertices. We compact all irecte eges with similar starting an en vertices to one irecte ege an unite all values associate with these eges to a set of values. We call the result graph weighte irecte graph of text atabase. Formally, text atabase TB correspons to a weighte irecte graph WTBG= V w, E w, L w where V w =V is the set of vertices, E w is the set of irecte eges, an L w is the set of values associate with all irecte eges in E w. Denote L w (l i, l j ) the set of values associate with irecte ege l i l j, then we have

4 1936 Journal of Software 2002,13(10) Example 1. L w (l i, l j ) = P(TB, l i l j ), (6) L w ={L w (l i, l j ) l i l j : l i l j E w }. (7) Given a Chinese text string w:. There are totally eleven unique characters appearing in w, in which nine are Chinese characters, the rest two are punctuation marks. Theses unique characters constitute the vertices set V w ={,,,,,,,,,, }. Fourteen unique ajacent-character pairs constitute the irecte eges set E w ={,,,,,,,,,,,,, }. Values associate with these irecte eges are as follows: L w (, )={1, 7}, L w (, )={13, 19}, L w (, )={25, 31}, L w (, )={6}, L w (, )={12, 18}, L w (, )={24, 30}, L w (, )={2, 8, 14, 20, 26, 32}, L w (, )={3, 15, 27}, L w (, )={9, 21, 33}, L w (, )={5, 17, 29}, L w (, )={11, 23}, L w (, )={35}, L w (, )={4, 16, 28}, L w (, )={10, 22, 34}. Figure 1 illustrates the weighte irecte graph of string w. {6} {24,30} {1, 7} {13, 19} {35} {10,22,34} {11,23} {25, 31} {2,8,14,20,26,32} {5,17,29} {9,21,33} {3,15,27} {4,16,28} {12,18} Fig.1 Weighte irecte graph of string w 2 Ajacency Matrix Base Inverte File an PAT Array Definition 6. For a text atabase, there is a corresponing weighte irecte graph that associates with an ajacency matrix. Thus, a text atabase has an ajacency matrix where each matrix element correspons to a set of values associate with the corresponing irecte ege. Formally, let V Σ be the set of all unique characters appearing in text atabase TB, an enote D the ajacency matrix of TB, we have D=[ ij ], (8.1) ij = (l i,l j )=L w (l i, l j ) = P(TB, l i l j ). (8.2) We term matrix D the ajacency matrix base full text inexing moel of text atabase TB. Example 2. Given the string w in example 1, accoring to efinition 6 an the weighte irecte graph illustrate in Fig.1, it is easy to buil the ajacency matrix inexing moel of w. Let V w ={,,,,,,,,,, }. The text inexing moel D is a matrix of which the elements are as follows: 14 = L w (, )={1, 7}, 24 = L w (, )={13, 19}, 34 = L w (, )={25, 31}, 45 = L w (, )={2, 8, 14, 20, 26, 32}, 56 = L w (, )={9, 21, 33}, 58 = L w (, )={3, 15, 27}, 67 = L w (, )={10, 22, 34}, 7 10 = L w (, )={11, 23}, 7 11 = L w (, )={35}, 89 = L w (, )={4, 16, 28}, 9 10 = L w (, )={5, 17, 29}, 10 1 = L w (, )={6}, 10 2 = L w (, )={12, 18}, 10 3 = L w (, )={24, 30}. The other matrix elements are empty set Φ. Figure 2 illustrates the ajacency

5 : 1937 matrix of string w. D = 101 Fig Ajacency matrix of string w Practically, there are two ways to implement the ajacency matrix base full-text inexing moel: 1) Seeing matrix element ij in D as a set of positions of ajacent-character pair l i l j occurring in TB, an sorting all elements of ij in positional orer, then ij is equivalent to the inverte list of inexe term l i l j. 2) Treating matrix element ij as a set of positions of the sistrings whose prefix is the ajacent-character pair l i l j, an sorting all elements of ij in lexical orer of the corresponing sistrings, then ij is similar to the PAT array of sistrings with a prefix of l i l j. To istinguish this two ifferent implementations, we use D 1 an D 2 to enote the implemente ajacency matrices of approach 1) an approach 2) respectively, an term D 1 the ajacency matrix base full-text inexing inverte file, D 2 the ajacency matrix base full-text inexing PAT array. Formally, D 1 is a kin of reorganization of the traitional inverte file, i.e., transforming the character-base inverte file to a kin of ajacent-character pair base inverte file an organizing all inverte lists in the form of ajacency matrix. Conversely, the traitional inverte file is a kin of compression of D 1, that is, to compact the ajacent-character pair base inverte file to the character-base inverte file. Analogously, D is a kin of ecomposition of the traitional PAT array, i.e., splitting a long PAT array into a collection of short PAT arrays such that each short PAT array correspons a set of sistrings having a similar prefix in the form of ajacent-character pair; on the contrary, the traitional PAT array is a kin of aggregation of D 2. Base on the efinitions above, it is straightforwar to give the algorithms for transformations between D 1 an the traitional inverte file, D 2 an the traitional PAT array as follows. Algorithm 1. Buil D 1 from traitional inverte file. Input: inexing lists of any two inexing terms in text atabase: {l i : p i1,, p in }, {l j :p j1,, p jm }. Output: D 1 =[ ij ]. Process: i = L ( l, l ij w j ). = { k k ( pik p jk : pik = k an p jk = k + 1 while1 k1 n an 1 k Algorithm 2. Buil traitional inverte file from D 1. Input: D 1 =[ ij ]. Output: {l i : p i }. Process: V j = 1 p i = ij. Here, inicates the set union operator. Algorithm 3. Buil D 2 from traitional PAT array. Input: PAT array PSis={psis 1, psis 2,, psis n }. Output: D 2 i j =[ ij ], = L ( l, l ). ij w 1 2 m)}.

6 1938 Journal of Software 2002,13(10) Process: 1) Fin the first sistring psis k matching with l i l j* in PSis by binary search. 2) Starting from psis k, search continuously in PSis the other sistrings matching with l i l j * in the right an left irections till no such sistring can be foun. 3) Assume psis k1 an psis k2 are the two sistrings that match with l i l j* an locate at the furthest right an left positions in PSis respectively, then i = L ( l, l ij w j ) = { 2 psisk1, psisk1 + 1,..., psisk,..., psisk 2 1, psisk }(1 k1 k k2 n). Algorithm 4. Buil traitional PAT array from D 2. Input: D 2 =[ ij ]. Output: PAT array PSis={psis 1, psis 2,, psis n }. Process: PAT = V V i = 1 j = 1 ij. Similarly, is the set union operator. 3 Query Processing 3.1 Query processing base on D 1 Note that in this paper a query is an arbitrary character string an the result is a set of ocuments containing the query string. Query processing base on D 1 is essentially a kin of set operation. Following is a theorem about D 1 base query processing. Theorem 1. Let l 1 l 2... l n be a query string, q(l 1 l 2... l n ) the query result, then 0 2 2k 4 2k 2 1) When n=2k, q( l1 l2... ln) = k 3 2k 2 2k 1 2k ; 0 2 2k 4 2k 2 2k 1 2) When n=2k+1, q( l1 l2... ln) = k 3 2k 2 2k 1 2k 2k 2k + 1. Here, m ij = ( l, l ) m ={(x m) x : x (l i, l j )}. i j Proof. Accoring to Definition 6, it is easy to prove Theorem 1. We omit the etails here. In orer to improve query-processing efficiency, we must reuce the number of isk accesses or set intersections. Some techniques are available as follows. 1) Reucing the number of set intersections. That is equivalent to reucing the number of matrix elements involve in set intersection. We can use the optimizing technique propose in Ref.[4] to cut own matrix elements involve in query processing, which utilizes a irecte graph metho to optimize the search path at the cost of a little more space overhea. 2) Reucing the number of elements in the sets involve in intersection operation. By using of the properties of query string, it is possible to cut own the number of elements in the sets involve in intersection operation. Here consier a special case: substrings occurring repeately in the query string. Suppose l i l i+1 (1 i n 1) occurs twice in the query string l 1 l 2... l n at an interval of k, then the vali elements for intersection operation in (l i, l i+1 ) is (l i, l i+1 ) (l i, l i+1 ) + k. Actually, there exist many of such cases, which we cannot enumerate completely here. 3) Improving the efficiency of set intersection. If the text atabase is very large, the sizes of sets involve in query processing will be large too. Note that intersecting large sets is also time consuming. Given two sets A

7 : 1939 an B that inclue m an n elements respectively (Let m n), a more efficient way to carry out intersection between A an B is 1) sorting the elements of set A an set B in avance; 2) comparing the elements in the larger set with the elements in the smaller one. In such a way, m an n are the lower-boun an upper-boun of element comparisons neee for AB respectively. 3.2 Query processing base on D 2 Algorithm 5. Query processing base on D 2. Input: Full-text inexing moel D 2 an query string l 1 l 2... l n. Output: q(l 1 l 2... l n ). Process: 1) Base on D 2 an l 1 l 2, obtain PAT array (l 1, l 2 )={psis 1, psis 2,, psis m }. 2) Search the first sistring psis k matching with l 1 l 2...l n * in (l 1, l 2 ) by binary search. 3) Starting from psis k, search continuously in (l 1, l 2 ) the other sistrings matching with l 1 l 2...l n * in the right an left irections till no such sistring can be foun. 4) Assume psis k1 an psis k2 are sistrings that match with l 1 l 2...l n * an locate at the furthest right an left positions in (l 1, l 2 ) respectively, then q( l1l2... ln) = { psisk 1, psisk 1+ 1,..., psisk,..., psisk 2 1, psisk 2}( 1 k1 k k2 m). 4 Comparisons with Traitional Inexing Moels Suppose the size of text atabase TB is N (N= TB ), the number of unique characters in the text atabase is m (m= V ), an the length of query is l q. Furthermore, for the simplicity of analysis, we assume all characters occurring in the text atabase at equal probability. Thus the average size of elements in ajacency matrix D 2 is E N/(m*m). 4.1 Traitional inverte file vs. D 1 1) Time cost for inexing builing: O(N). 2) Space overhea a) D 1 : O(N+m*m). b) Inverte file: O (N+m). 3) Search time cost a) D 1 : At most (l q +1)/2 isk accesses an (l q 1)/2 set intersections. b) Inverte file: l q isk accesses an (l q 1) set intersections. Furthermore, consiering that the average size of matrix elements in D 1 is about 1/m of the size of inverte file, which will result in higher query-processing efficiency. Note that 1) isk access is the major factor that influences the value of query efficiency; 2) for large-scale text atabase, we have N>>m because m is limite an N can grow as the text atabase expans. That is to say, the extra space overhea of D 1 is much less than the size of original text atabase. In the case of English text collections, m is not greater than 256. While in Chinese text environment, m is generally between 6000 an Thus, ajacency matrix base inverte file moel can achieve more 50% benefit of query efficiency at the cost of much less extra space overhea compare with the size of original text atabase. 4.2 Traitional PAT array vs. D 2 1) Time cost for inexing builing: O(N log(n)). 2) Space overhea a) D 2 : O(N+m*m).

8 1940 Journal of Software 2002,13(10) b) PAT array: O(N). 3) Search time cost a) D 2 : At most 4log 2 (E) isk accesses an 2log 2 (E) 1 comparisons. b) PAT array: At most 4log 2 (N) isk accesses an 2log 2 (N) 1 comparisons. Let us consier the worst case. For large-scale text atabases, we have N >> m, which means that the extra storage overhea of D 2 is negligible compare with the size of original text atabase. At the same time, D 2 benefits 8log 2 (m) isk accesses an 4log 2 (m) comparisons less than traitional PAT array. Specifically, suppose the size of text atabase N=640Mb, at most 117 isk accesses are requeste for traitional PAT array. In the case of English text collections, let m=256, then 64 isk accesses are cut own by D 2, i.e., the improvement ratio of query efficiency is 54%. For the Chinese text environment, let m=6763 (Taking GB for example), then 101 isk accesses are cut own by D 2, that is, a query efficiency improvement ratio of 86%. Certainly, the results are approximate estimation, which may eviate somewhat from the realistic values. However, because each matrix element in D 2 is only part of the entire PAT array of the text atabase, D 2 can still outperform the traitional PAT array. Finally, we have a comparison between D 1 an D 2. Note that there is no much ifference in space overhea between D 1 an D 2. However, it is noteworthy to analyze the ifference in their search time cost. Still, we consier isk access the major bottleneck of query efficiency. Obviously, query efficiency of D 1 is relate to the length of query string, while query efficiency of D 2 epens on the size of matrix element involve in query processing. When isk accesses of the two moels are equal, we have That is, From (10), we obtain ( + 1) / 2 4log2( E ). (9) l q ( + 1) / 2 4log 2 ( N /( m * m)). (10) l q l q 8log 2 ( N /( m * m)) 1. (11) Equation (11) sets an approximate criterion about when to hanle query using D 1 an D 2. We carry out further estimation in the following two specific cases: 1) In the case of English text atabase, let N=640M an m=256, we have l q 105. That is to say, when the length of query is shorter than 105 characters, it is more efficient to hanle queries using D 1 than using D 2. On the contrary, if the length of query is longer than 105 characters, D 2 is favorable. 2) Uner the Chinese text environment, let N=640M an m=6763, we can obtain l q 30, i.e., when the length of query is shorter than 30 characters, it is more efficient to hanle query using D 1 than using D 2. Otherwise, it is favorable to use D 2 to hanle query. For a certain text atabase, if the query is short, it is more efficient to process query by using D 1 than using D 2. Otherwise, it is favorable to use D 2. In realty, it is likely to accommoate the two query moes simultaneously in the same text atabase. In implementation, we can establish the inexing matrix accoring to D 2, then choose query moe D 1 or D 2 in terms of query length. 5 Experimental Results Experiments are carrie out over five real worl Chinese text collections (liste in Table 1) to valiate the feasibility an efficiency of the new moels. These text collections contain mainly Chinese classics, contemporary Chinese novels an ocuments from several Chinese BBS sites. The experiments are conucte on a PC with a PIII

: 1941 500MHz CPU an 512Mb RAM. The goal of experiments is to measure the improvement ratio of query processing efficiency of the new inexing moels.

9 : MHz CPU an 512Mb RAM. The goal of experiments is to measure the improvement ratio of query processing efficiency of the new inexing moels. We efine the query processing efficiency improvement ratio as follows. r ( t t ) / t 100%. (12) = ol new ol Here, t ol represents the time consume for hanling one or more queries with the traitional inexing moels, i.e., inverte file an PAT array, while t new inicates the time use for hanling the same queries with the new inexing moels, i.e., ajacency matrix base inverte file an ajacency matrix base PAT array. To make the experimental results more reasonable an reliable, we chose manually 1000 ifferent queries for testing. The query processing efficiency is evaluate on the average time cost in hanling the 1000 ifferent queries. Each query is a Chinese text string with from 2 to 25 Chinese characters, an its querie result is requeste not to be empty. Table 2 lists the query number istribution over query length of the 1000 ifferent queries. The number of short queries is larger than that of the long queries, which conforms to the realistic situation of user query elivery. Table 1 Test text collections Text atabase TC-1 TC-2 TC-3 TC-4 TC-5 Size (Mb) Table 2 Query number istribution over query length of the 1000 processe queries l q (characters) N The experimental results are presente in Fig.3, which show that the new moels can improve consierably the query processing efficiencies of the traitional inverte file an PAT array. Fig.3 Experimental results of query processing efficiency improvement ratio 6 Conclusions In this paper, we propose, investigate an implemente two new full-text inexing moels that can improve the query efficiency of IR systems consierably at the cost of much less amount of extra storage overhea compare to the size of original text atabase. Experiments over real worl Chinese text collections were carrie, which valiate the effectiveness an efficiency of the new moels. Further research will focus on exploring efficient builing an upating algorithms as well as query optimization approaches for the new inexing moels over large-scale text atabases. References: [1] Baesa-Yates, R., Ribeiro-Neto, B. Moern Information Retrieval. Reaing, MA: Aison Wesley, [2] Sullivan, D. Search Engine Watch. [3] AltaVista, [4] Zhou, Shui-geng. Key techniques of Chinese text atabases [Ph.D. Thesis]. Shanghai: Fuan University, 2000 (in Chinese).

10 1942 Journal of Software 2002,13(10) [5] Tomasic, A., Garcia-Molina, H., Shoens, K. Incremental upates of inverte lists for text ocument retrieval. In: Snograss, R.T., Winslett, M., es. Proceeings of the SIGMOD 94. New York: ACM Press, [6] Ribeiro-Neto, B.A., Silva e Moura, E., Neubert, M.S., Ziviani, N. Efficient istribute algorithms to buil inverte files. In: Hearst, M., Tong, R., es. Proceeings of the SIGIR 99. New York: ACM Press, [7] Faloutsos, C. Signature-Base text retrieval methos: a survey. Data Engineering Bulletin, 1990,13(1): [8] Manber, U., Myers, E. Suffix arrays: a new metho for on-line string searches. SIAM Journal of Computing, 1993,22(5): [9] Chavez, E, Navarro, G., et al. Searching in metric spaces. ACM Computing Surveys, 2001,33(3): [4]. [ ]. :, , 2 3, 1 (,200433); 2 (,200433); 2 (, ) :,.,., ;,, PAT. ;,.,. : ; ; ;PAT ; : TP311 : A

Online Appendix to: Generalizing Database Forensics

Online Appendix to: Generalizing Database Forensics Online Appenix to: Generalizing Database Forensics KYRIACOS E. PAVLOU an RICHARD T. SNODGRASS, University of Arizona This appenix presents a step-by-step iscussion of the forensic analysis protocol that