A Tree-based Inverted File for Fast Ranked-Document Retrieval

Size: px

Start display at page:

Download "A Tree-based Inverted File for Fast Ranked-Document Retrieval"

Marjory Wiggins
5 years ago
Views:

1 A Tree-based Inverted File for Fast Ranked-Document Retrieval Wann-Yun Shieh Tien-Fu Chen Chung-Ping Chung Department of Computer Science and Information Engineering National Chiao Tung University Hsinchu, Taiwan 300, R.O.C. Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan 621, R.O.C. Department of Computer Science and Information Engineering National Chiao Tung University Hsinchu, Taiwan 300, R.O.C. Abstract Inverted files are widely used to index documents in large-scale information retrieval systems. An inverted file consists of posting lists, which can be stored in either a document-identifier ascending order or a document-weight descending order. For an identifierascending-order posting list, retrieving ranked documents necessitates traversal of all postings, whereas for the weight-descending-order posting list, performing Boolean queries involves very complex processing. In this paper, we transform a posting list to a tree-based structure, called the n-key-heap posting tree, to speedup ranked-document retrieval for Boolean queries. In this structure, the orders of document identifiers and document weights are preserved simultaneously. To preserve the identifier order, the edge pointers are designed to maintain numerical order in the posting tree. To preserve the weight order, greater-weight postings are stored in higher tree nodes by the heap property. We model these criteria to a tree-construction problem and propose an efficient algorithm to construct an optimal posting tree having the minimal access time. Keywords: information retrieval, inverted file, Boolean query, ranked document, posting tree 1. Introduction An indexing structure used by many information retrieval (IR) systems is the inverted file [1]. In an inverted file, for each distinct word (also known as term ) t in the text collection, there is a corresponding list (called the posting list) of the form < t ; ft ;( P1, Wt,1 ),...,( P, Wt, f ) >, where ft t frequency f t indicates the total number of documents in which t appears, identifier P i (also known as posting ) indicates the document that contains t, and weight W, indicates the weight of t i P i associated with t. When a user sends a request containing some query terms to an IR system, the system searches for these query terms in the inverted file to see which documents satisfy the request, and returns ranked documents identifiers to the user. Zobel et al. [2] showed that in terms of the querying time, used space, and functionality, inverted files perform better than other indexing structures. 1.1 Current methods and problems Postings can be permuted in a posting list by either an identifier-sorted order or by a weightsorted order. Both of these sorted types, however, require complex processes in retrieving ranked documents for Boolean queries. For an identifiersorted posting list, retrieving ranked documents requires accesses of all related posting lists from storage, no matter how many terms or how many ranked documents a user queries. As for the weight-sorted posting list, the drawback is to require extra processing cost to compare two posting lists within no identifier numerical order [3]. These problems become more serious as the amount of information increases explosively in the Internet world. If an IR system expands the collection, the lengths of most posting lists in the inverted file will increase. A user may then take longer waiting time in retrieving ranked documents by either the identifier-sorted or weight-sorted posting list. To the best of our knowledge, few studies have proposed suitable posting structures to reduce such complex processes in retrieving ranked documents for Boolean queries.

2 1.2 Research goal We propose a tree-based structure, called the n- key-heap posting-tree, to preserve the orders of document identifiers and document weights simultaneously for fast ranked-document retrieval. In an n-key-heap posting tree, the root node contains the n most important (that is, highest within-document weight) postings, and the n+1 children of the root node recursively contain the n+1 segments of the posting lists created by splitting at these n postings. To preserve the identifier order, the postings in each node are permuted in an identifier-ascending order, and the identifier order among tree nodes are maintained by edge pointers. To preserver the weight order, greater-weight postings are stored in higher tree nodes by the heap property. These criteria can be modeled to a tree-construction problem, in which the objective is to minimize the average access time in retrieving ranked documents. According to this model, we propose an efficient algorithm to construct such an optimal n-key-heap posting tree. Simulation results show that the disk access time and posting-list processing time for retrieving ranked documents can be effectively reduced by the proposed structure. This paper is organized as follows. In Section 2, we define the structure of the n-key-heap posting tree, and develop the posting-tree construction algorithm. Also, we present the scheme for retrieving ranked documents from a posting tree. In Section 3, we show simulation results in terms of disk transfer time and posting processing time. Finally, we give conclusions in Section N-key-heap posting tree The issue of designing a tree structure for a posting list is to preserve the orders of document identifiers and document weights simultaneously. We deal with this problem by following definitions. 2.1 Definition of a posting tree Definition 1: posting tree Given a posting list L, its posting tree T is a rooted tree having following properties: Property 1: Every node x contains following elements: a. n (identifier, weight) pairs: (x.identifier[i], x.weight[i]) L, 1 i n, which are stored in an identifier-ascending order; i.e., x.identifier[1]<... <x.identifier[n]. b. n+1 pointers: x.c[0],, x.c[n] point to x s children. Property 2: Identifiers in a subtree rooted at x.c[i] must be greater than identifier x.identifier[i] but less than identifier x.identifier[i+1]: if k i is any identifier stored in the subtree rooted at x.c[i], then k 1 < x.identifier[1] < k 2 < x.identifier[2] <... < x.identifier[n] < k n+1. By Property 1, when n identifiers are selected from L and are inserted into a root node x, remaining identifiers in L will be split into at most n+1 segments. By property 2, each segment recursively forms a posting tree and is pointed at by corresponding x.c[i]. Take a post list L 1 as example: L 1 : <t; 10 ; (6, ), (15, 0.19), (55, 0.18), (169, 0.07), (191, 0.14), (238, 0.08), (240, 0.04), (242, ), (251, 0.05), (310, 0.13)>. Figure 1 (a) shows an example posting tree for L 1. Here we let n=4. With the pointers x.c[i], all identifiers can be accessed in ascending order by performing DFS (Depth-First Search) along x.c[i] in the posting tree. 2.2 Definition of the n-key heap property Definition 2: n-key heap property A tree T, in which every node x contains n keys x.key[1],, x.key[n], satisfies the n-key heap property if the n keys of every node are all less than any key of its parent. Here the term key can be used to represent any specific characteristic. If document weights are used to be the keys in Definition 2, then nodes in a posting tree satisfying the n-key heap property forms a weight descending hierarchy. That is, higher-weight identifiers are stored in higher tree nodes.

3 Legend: identifier weight x.c[i] x.c[i+1] (a) (b) Figure 1. For posting list L 1 : (a) an example posting tree, (b) the n-key-heap posting tree. This feature helps the system visit more important identifiers early if the nodes are retrieved in a top-down manner. Figure 1(b) shows the example of L 1 posting tree satisfying the n-key heap property. The relation of treenodes between different levels is formulated in Lemma 1. Lemma 1: Assume T is a posting tree satisfying the n-key heap property. If x and y are two document identifiers in T, and the tree node containing x is an ancestor of the tree node containing y, then weight(x) > weight(y). Proof: The claim follows from the n-key heap property. 2.3 Constructing a minimal-access-time posting tree The time to access a document identifier in a posting tree is proportional to the depth of the node containing it. Reducing average node depth in a posting tree results in a shorter identifier access time. A posting tree satisfying n-key heap property can hence be judiciously constructed in accordance with document weights in such a way that the average node depth for retrieving ranked identifiers is minimized. We formulate this construction problem as an optimization problem [4]. Without loss of generality, we assume that any two identifiers in a posting tree have different weights, and all weights are normalized to 1. For convenience, we call a posting tree with n-key heap property an n-key-heap posting tree in the following. Definition 3: N-key-heap posting tree construction problem Let a posting list L contain m postings (p 1, w 1 ), (p 2, w 2 ) (p m, w m ), where p i is the document identifier, and w i is the weight of p i. The weighted node-depth of a posting tree T is defined as m wi DT ( pi ), where D T ( p i ) denotes the nodei = 1 depth of p i in T, and ( pi, wi ) L. The problem is to find an optimal n-key-heap posting tree T whose weighted node-depth is minimal. We derive an algorithm to construct such a posting tree in Figure 2. In Figure 2, the algorithm includes two phases. In the first phase (lines 1-2), we begin with a greedy selection to put the n highest-weight postings in the root node. In the second phase (lines 3-6), the children of the root node are recursively to be constructed in the same manner. Therefore, the time complexity of the algorithm is O( m log m ), where m is the total number of document identifiers in the given posting list. Lemma 2 shows that the problem of constructing an optimal n-key heap posting-tree has the optimal-substructure property: an optimal solution to the problem contains within its optimal solutions to subproblems [7].

4 Building_posting_tree(L, n) Input: a posting list L {(p 1, w 1 ), (p 2, w 2 ) (p m, w m )}, and an integer n. Output: an n-key-heap posting tree T whose weighted node-depth is minimal. Begin 1 Retrieve the n highest-weight postings, ( p, w ),...,( p, ) 1 i1 w from L; i in in 2 Let x be the root node of T. Put { ( p, w ),...,( p, ) 1 i1 w } into x; i in in 3 x.c[0] := Building_posting_tree({ ( p1, w1),...,( p 1, w 1) i1 i1 }, n); 4 for k := 1 to n-1 do 5 x.c[k] := Building_posting_tree({ ( p 1, w 1),...,( p 1, w 1) ik + ik + ik + 1 ik + 1 }, n); 6 x.c[n] := Building_posting_tree({ ( p 1, w 1)...,( pm, wm) in+ in+ }, n); 7 return T; End Figure2. Construct an n-key-heap posting tree with the minimal weighted node-depth. Lemma 2: Let T be an n-key heap posting-tree whose average weighted node-depth is minimal. Then, for any subtree Z in T, the average weighted node-depth of the posting tree T =T-Z is also minimal. Proof: Let WD(T) be the average weighted node-depth of the posting tree T. Since T =T-Z, we have WD( T ') + wi DT ( i) = WD( T ). If i Z the average weighted node-depth of T is not minimal, then there exists another posting tree T such that WD(T )<WD(T ). This implies WD( T '') + wi DT ( i) < WD( T ), i Z contradicting the optimality of T. Thus, the average weighted node-depth of T is minimal. By Lemmas 1 and 2, Theorem 1 thus follows. Theorem 1: The algorithm in Figure 2 produces an n-key heap posting-tree whose average weighted node-depth is minimal. Proof: Immediate from Lemmas 1 and Retrieving ranked documents identifiers from a posting tree For a query which contains one term, and requests R ranked documents, we search them from the root node of depth 1, and then the nodes of depth 2 etc in the related posting tree. The searching process stops when the R highestweight identifiers are returned. This top-down searching process can avoid the traversal of all postings, and is suited to retrieving ranked documents in long posting lists. For another query which contains two terms with a Boolean operator, and requests R ranked documents, we propose a range-checking approach to perform the Boolean operation on as few nodes in the related posting trees as possible. Take the posting tree in Figure 1(b) as an example. When we fetch the root node first, we obtain a set of identifier-ranges split from the original posting list L 1, as shown in Figure 3(a). If we perform an AND operation on these ranges against those of another posting tree, shown in Figure 3(b), a set of intersection ranges can be generated in Figure 3 (c). By recursively performing the same operation on these intersection ranges against the ranges of other nodes of depth k (k>1) in two posting trees, the ranges can be narrowed down to the identifiers satisfying the Boolean operation, or be discarded if they are obviously not satisfying the Boolean operation. By this range-checking process, the R highest-weight identifiers can be returned in topdown sequence, and do not need any sorting process further. For other query containing k terms with m Boolean operators, we can easily extend the range-checking approach to perform k-way retrieval.

5 (a) The ranges of (a): The ranges of (b): Intersection ranges: (b) (c) Figure 3. Performing an AND operation on two sets of ranges: (a) the node of depth 1, and its identifier-ranges in the posting tree of L 1, (b) a node, and its identifier-ranges in another posting tree, (c) intersection ranges. 3. Simulation and performance evaluation Simulation is used to generate performance data. In performance evaluation, factors to be examined include disk access time and posting processing time in retrieving ranked documents. 3.1 Simulation environment We use parts of WT10g, about 460,000 documents, to be our test collection. (WT10g is a widely distributed collection and has been included in TREC Web Test Collections [5].) To simulate query behavior, we implement a queryterm generator to select terms for synthesizing a set of queries. The occurrence of query terms follows the Zipf-like distribution [6]. An IR system is implemented on a Linux platform to simulate the retrieval services for the proposed retrieving algorithms. 3.2 Simulation results Table 1 compares the average disk access time (DT) and posting-list processing time (PT) between the structures of the 10-key-heap posting tree and the identifier-sorted posting list, for 100,000 one-term queries. (We do not compare with the weight-sorted posting list because it is not suited to Boolean query processing [3].) In the second column of Table 1, the average disk access time of the identifier-sorted posting list is fixed, regardless of the number of identifiers requested. This is because the entire linear posting list has to be retrieved from the disk for sorting. Contrarily, the disk access time of the posting tree is only proportional to the amount of requested identifiers because these identifiers can be retrieved selectively in top-down sequence. In addition, the average posting processing time of the posting tree is smaller than that of the identifier-sorted posting list due to reduced sorting process. Table 2 compares the same metrics between two structures but for 100,000 two-term queries. For the posting tree, we apply the rangechecking approach to retrieving ranked identifiers for each two-term query. Simulation results show that the posting tree with the rangechecking approach outperforms the identifiersorted posting list in terms of the average disk access time and posting-list processing time. This is because by range-checking, non-intersection ranges of two posting trees identifiers can be discarded as soon as possible, and unnecessary nodes do not need to be retrieved from storage. We also perform similar experiments on processing the three-term, and four-term queries to evaluate the advantages of the posting tree. The results all show that the posting tree outperforms the identifier-sorted posting list for fast ranked document retrieval.

6 Table 1. Comparison of retrieval performance for 100,000 one-term queries Amount of ranked Identifier-sorted posting list Posting tree identifiers requested DT (ms) PT (ms) DT (ms) PT (ms) Table 2. Comparison of retrieval performance for 100,000 two-term queries Amount of ranked Identifier-sorted posting list Posting tree identifiers requested DT (ms) PT (ms) DT (ms) PT (ms) Conclusion We propose an n-key-heap posting tree to speedup ranked-document retrieval for Boolean queries. This structure simultaneously preserves the orders of document identifiers and document weights by edge pointers and by the heap property, respectively. A greedy algorithm is proposed to construct an optimal n-key-heap posting tree, whose weighted node depth is minimal. The optimal posting tree guarantees that the tree has minimum access time for retrieving ranked postings. We also propose a range-checking approach to speedup retrieval process. The storage space is another issue for the posting tree, and can be reduced through encoding compression. Many studies have involved the identifier and weight compression [3]. However, the compression that is completely suitable for the posting tree needs to be investigated further. 5. References [1] E. Rillof, L. Hollaar, Text Database and Information Retrieval, ACM Computer Surveys, Vol. 28, No. 1, 1996, pp [2] J. Zobel, A. Moffat, K. Ramamohanarao, Inverted Files Versus Signature Files for Text Indexing, ACM Transactions on Database Systems, Vol. 23, No. 4, 1998, pp [3] I. H. Witten, A. Moffat, T. C. Bell, Managing Gigabytes - Compressing and Indexing Documents and Images, 2 nd Ed., Morgan Kaufmann Publishers, Inc, [4] C. H. Papadimitriou, K. Steiglitz, Combinatorial Optimization Algorithms and Complexity. Kenneth Steiglitz, Princeton University, [5] TREC Web Test Collections, [6] L. Breslau, P. Cao, L. Fan, G. Phillips, S. Shenker, Web Caching and Zipf-like Distributions: Evidence and Implications, IEEE INFOCOM, Vol. 1, 1999, pp [7] Cormen, T. H., Leiserson, C. E., & Rivest, R. L. Introduction to Algorithms. Cambridge, MA: MIT Press, 1990.

Inverted file compression through document identifier reassignment

Information Processing and Management 39 (2003) 117 131 www.elsevier.com/locate/infoproman Inverted file compression through document identifier reassignment Wann-Yun Shieh a, Tien-Fu Chen b, Jean Jyh-Jiun