Implementations of Partial Document Ranking Using. Inverted Files. Wai Yee Peter Wong. Dik Lun Lee

Size: px

Start display at page:

Download "Implementations of Partial Document Ranking Using. Inverted Files. Wai Yee Peter Wong. Dik Lun Lee"

Douglas Parsons
6 years ago
Views:

1 Imlementations of Partial Document Ranking Using Inverted Files Wai Yee Peter Wong Dik Lun Lee Deartment of Comuter and Information Science, Ohio State University, 36 Neil Ave, Columbus, Ohio 4321, U.S.A. May 1992 Abstract - Most commercial text retrieval systems emloy inverted les to imrove retrieval seed. This aer concerns with the imlementations of document ranking based on inverted les. Three heuristic methods for imlementing the tf idf weighting strategy, where tf stands for term frequency and idf stands for inverse document frequency, are studied. The basic idea of the heuristic methods is to rocess the query terms in an order so that as many to documents as ossible can be identied without rocessing all of the query terms. The rst heuristic was roosed by Smeaton and van Rijsbergen (Smeaton & Rijsbergen, 1981), and it serves as the basis for comarison with the other two heuristic methods roosed in this aer. These three heuristics are evaluated and comared by exerimental runs based on the number of disk accesses required for artial document ranking, in which the returned documents contain some, but not necessarily all, of the requested number of to documents. The results show that the roosed heuristic methods erform better than that roosed by Smeaton and van Rijsbergen in terms of retrieval, which is used to indicate the ercentage of to documents obtained after a number of disk accesses. For total document ranking, in which all of the requested number of to documents are guaranteed to be returned, no otimization techniques studied so far can lead to substantial erformance gain. To realize the advantage of the roosed heuristics, two methods for estimating the retrieval are studied. Their accuracies and rocessing costs are comared. All the exerimental runs are based on four test collections made available with the SMART system. 1

2 1 Introduction Boolean search strategies are used in most commercial text retrieval systems, but their drawbacks are wellknown (Cooer, 1988; Radecki, 1988; Salton, Fox & Wu, 1983; Salton & McGill, 1983). For examle, formulation of useful boolean queries is not easy to learn, and users often have diculty in controlling the outut size. In order to imrove retrieval eectiveness, vector rocessing systems emloying similarity measures have been suggested and studied extensively (Buckley & Lewit, 1985; Croft & Savino, 1988; Salton, 1989; Salton & McGill, 1983). In a vector rocessing system, both query and document terms can be weighted to distinguish terms which are more imortant for retrieval uroses from those which are less imortant. Let V be the size of the vocabulary for the document collection, a document D i is reresented as a V -dimensional vector < w i;1; w i;2; :::; w i;v >, where w i;j reresents the weight of term j in document i. Likewise, a query Q can be reresented as a V -dimensional vector < q1; q2; : : :; q V >, where q i secies the weight of term i in the query. The similarity between a query and the documents can be comuted in order to rank the retrieved documents in decreasing order of the query-document similarity. Many similarity measures have been studied (Salton, 1989). For instance, the cosine coecient measures the similarity between the query and document vectors based on the cosine of the angle between them in the V -dimensional sace: similarity(d i ; Q) = cos() = P V j=1 w i;j q j q PV j=1 w2 i;j P V j=1 q2 j : One of the well-known weighting strategies uses so-called tf idf weights, in which w i;j = tf Di (t j )idf(t j ), for 1 j V. The term tf Di (t) reresents the term frequency of term t in document D i (i.e., the number of occurrences of term t in document D i ). The function idf(t) is called the inverse document frequency of term t and is set to log 2 (N=df(t)), where N is the number of documents in a collection and df(t) is the document frequency of term t (i.e., the number of documents in which term t is contained) (Croft & Savino, 1988; Salton, 1986). Thus, a term has a high weight in a document if it occurs frequently in the document but infrequently in the rest of the collection. The vector rocessing system allows a query to be exressed as a natural language text describing the user's information need. Thus, the descrition can be treated as a short document so that q i 's can be exressed in tf idf weights as well. Since most query terms are likely to aear only once in a short assage, tf's can be assumed to be 1 in the query and the weights of the query terms are sometimes reresented by the idf's of the terms. To further reduce the comutational cost, q i can often be simlied into a binary value with \1" signifying the resence of term i in the query and \" its absence (Stanll & Kahle, 1986). Concetually, the tf idf ranking strategy is very simle; however, it has been shown to give good retrieval eectiveness. Imlementations of document ranking are studied extensively (Croft & Savino, 1988; Lucarella, 1988; Mohan & Willett, 1985; Murtagh, 1982; Perry & Willett, 1983; Salton, 1968; Shasha & Wang, 199; Stanll & Kahle, 1986; Stanll, Thau & Waltz, 1989; Weiss, 1981; Wong & Lee, 199); much work is based on inverted les (Buckley & Lewit, 1985; Stanll, Thau & Waltz, 1989). An inverted le consists of two comonents, namely the index le and the ostings le. Each item in the index le corresonds to a document term in the collection, and it is associated with a ostings list in the ostings le, which is usually stored on the disk. Each osting records the document which contains the term and some other information such as the corresonding term frequency, deending on the retrieval environement. An inverted le is shown in Figure 1, in which the document frequency of each term is stored in the index le, which can be used to comute the idf value. A query will be resented as a list of terms with associated weights. The ostings lists corresonding to the query terms are retrieved and, from the ostings lists, the document scores can be comuted as shown in the seudo-code below. 2

3 Q = term 1, term 2, term 3,... index terms df D 1, 4 D 1, 3 D i, tfi a ostings 3 D 1, 6 index file ostings lists Figure 1: Inverted le suorting vector rocessing. initialize all document scores to zero for all q in Q do retrieve the ostings list for term q for each osting < D i; tf i; q > in the ostings list do comute score(d i) = score(d i) + tf i;q idf i end fforg end fforg sort document scores and return to documents The algorithm is rather straightforward since it essentially sums u the weights of the terms secied in the query for each document, taking advantage of the availiability of an inverted index. It should be noted that this algorithm is simlied by assuming binary query weights and no normalization, but it can be easily extended to remove the simlications. Many otimization techniques on inverted le systems have been develoed to reduce the I/O cost (Buckley & Lewit, 1985; Perry & Willett, 1983; Smeaton & Rijsbergen, 1981). All of these methods rocess query terms one by one and accumulate artial scores for the documents, rather than comute the nal score of a document comletely before roceeding to the next document. These methods are motivated by the fact that the number of query terms can be very large when thesaurus and relevance feedback are used to exand the original query (Stanll & Kahle, 1986). For instance, a query of 1 terms could be exanded to more than 1 terms, when synonyms and related terms (i.e., narrower terms and broader terms) are included. Thus, the corresonding rocessing cost becomes signicant and needs to be reduced. The basic idea behind the otimization techniques is to rocess the query terms in an order so that the requested number of to documents 1 can be identied without rocessing all of the query terms. Most of the revious methods aim at total document ranking (or total ranking), in which all of the requested number of to documents are guaranteed to be returned. In this aer, we focus on artial document ranking (or artial ranking) (Buckley & Lewit, 1985; Perry & Willett, 1983; Weiss, 1981), in which the retrieved documents contain some, but not necessarily all, of the requested number of to documents. The imortance of artial ranking is twofold. First, when total ranking is imlemented, none of the heuristic methods studied so far can yield any signicant erformance gain (Buckley & Lewit, 1985; Perry & Willett, 1983). Second, document retrieval itself is inherently imrecise. That is, even when total ranking is imlemented, it is known that not every to document returned is actually relevant from the user's ersective. Eectiveness should not suer greatly if the system can return a major ortion of the to documents (Buckley & Lewit, 1985), since other techniques 1 The to documents with resect to the query are those which have the highest scores according to some similarity measure. However, they are not guaranteed to be relevant from the user's ersective. 3

4 such as relevance feedback can be used to further imrove the degree of relevance. Partial ranking is a rotable aroach if it can yield substantial savings in rocessing costs. The organization of this aer is as follows. In Section 2, three heuristic methods, called the L method, W method and SW method, are described, and the notion of retrieval is introduced. Their erformance in imlementing the tf idf weighting strategy are evaluated and comared based uon exerimental runs on the four test collections made available with the SMART system. In Section 3, two heuristic methods, called the document movement method (or simly Dm method) and the linear regression method (or simly Lr method), are roosed to redict the number of to documents obtained at dierent oints of the retrieval rocess for the W method and SW method. The accuracies and rocessing costs of the Dm method and Lr method are comared. The last section summarizes the merits of the roosed methods. 2 Heuristic methods As shown by revious studies (Buckley & Lewit, 1985; Perry & Willett, 1983) and further suorted by our study, total ranking cannot be achieved until almost all query terms are rocessed. The goal of artial ranking is to maximize the number of to documents obtained after rocessing only a subset of the query terms. In other words, the goal of artial ranking technique is to be able to foresee the nal document ranking (usually an aroximate one) when only a subset of query terms is rocessed. Clearly, the chance of getting to documents fast deends on how fast document scores are incremented as query terms are rocessed. This in turn is aected by the rocessing order of the query terms. In Fig. 1, the idf's of term 1, term 2 and term 3 are 4.6, 4. and 6.6 resectively, assuming that N = 1. If the query terms are rocessed in the order as resented in the query, the score of D1 after rocessing each query term will be 18.4, 42.4, and If the query terms are ordered by increasing document frequencies, the document score will be 19.8, 43.8, and With query terms ordered by decreasing term frequencies, the document score will be 24, 42.4, and The rationale is that the faster the document scores of to documents increment, the earlier they can be identied. In this articular examle, either ordering by document frequencies or by term frequencies is better than the original order. In this section, we study three methods which rocess query terms in dierent orders based uon dierent criteria. The rst method, called L method, was roosed by Smeaton and van Rijsbergen (Smeaton & Rijsbergen, 1981), the second and third methods, called the W and SW methods resectively, are roosed by the authors (Lee & Wong, 1991; Wong & Lee, 1991). 2.1 The L method The L method, which has been used in the uerbound search algorithm (Buckley & Lewit, 1985; Croft & Savino, 1988; Fukunaga & Narendra, 1975; Mohan & Willett, 1985; Perry & Willett, 1983; Smeaton & Rijsbergen, 1981; Weiss, 1981), rocesses query terms in descending order of their idf values. Since document frequencies corresond to the lengths of the ostings lists, in other words, query terms are rocessed in increasing lengths of the ostings lists. This method requires the lengths of the ostings lists to be ket in the index le, which can be accessed searately from the ostings lists. For simlicity, Q is reresented by (q1; q2; : : :; q k ), in which df(q i ) df(q j ) for i < j. The uerbound search algorithm works as follows. After a ostings list is rocessed, artial document scores are sorted in descending order and the current to T documents are obtained. The uerbound of the (T +1)st document is then comuted with the assumtion that it contains all the remaining unsearched query terms. The retrieval rocess stos if the uerbound of the (T +1)st document is smaller than the current score of the T th document. If the search can be stoed 4

5 CACM CISI CRAN MED Number of documents Number of distinct terms Average number of terms / document Maximum document frequency (df) Average df Standard deviation of df Maximum tf Maximum idf l Table 1: Characteristics of the document collections. early, the amount of disk accesses and CPU rocessing cost can be reduced substantially, since short ostings lists are rocessed rst and the unrocessed lists are long. In order to evaluate the usefulness of the uerbound search algorithm by using the above sorting method, we take checkoints during the retrieval rocess. To facilitate discussions, three denitions are given: 1. Q f : The nal set of to documents obtained when all the query terms in query Q are rocessed. The number of to documents returned is determined by the system or by the user. 2. Q i : The set of to documents obtained after rocessing i% of the total disk accesses in resonse to query Q. 3. Ra i ( ): (j Q i \ Q f j = j Q f j) 1%. The values of Ra's indicate the ercentages of to documents obtained at dierent oints of the retrieval. Moreover, a sequence of Ra's can show how fast the to documents in Q f are revealed. For instances, if Ra1 =.5, Ra =.7 and Ra3 =.85, then it indicates that the to documents are revealed fast. If Ra1 =.1, Ra =.15 and Ra3 =., then it indicates that the to documents are revealed rather slowly. For artial ranking, the rate at which Ra increases should be the faster the better, since the retrieval rocess can sto earlier for the same. We will study the retrieval of the L method by an exeriment carried out on four test collections, which are made available with the SMART system. Some characteristics of the test collections are given in Table 1. The exeriment is erformed in the following manner. For each collection, 15 queries are generated for each of the three dierent query sizes, namely 3, 5 and 1 terms, for a total of 45 queries. For each query size, three grous of queries are generated with 5 queries each. In the rst grou, each query contains only short ostings lists (i.e., only contains query terms with high idf values). 2 In the second grou, each query contains one-half short ostings lists and one-half long ostings lists. In the third grou, each query contains only long ostings lists. Query terms in each query are randomly selected from the vocabulary of the resective test collection. With queries of dierent sizes and the corresonding ostings lists of dierent lengths, our exeriments do not try to favor any articular oerational environment. Ten equally saced checkoints are taken and the to documents at each checkoint are recorded. It means that if a total of P disk ages is required for rocessing all the query terms, checkoints are taken after dp=1e; d2p=1e; d3p=1e; : : :, and d1p=1e disk ages have been rocessed. Finally, for each collection, we obtain at each checkoint the retrieval accuracies for all the queries and comute the average of the retrieval 2 In this aer, a ostings list is short if it contains fewer than 5 document ostings, long otherwise. 5

6 accuracies for queries of the same size. In our study, each document osting requires 6 bytes, 4 bytes for a document identier and 2 bytes for a term frequency. In most cases, each disk access retrieves document ostings, corresonding to 1 bytes. The reason for this rather small age size is to account for the small document database used in the exeriment; if the age size is too large (e.g., 1K), most ostings list will t in one age, which is not realistic for large databases. The results are shown in Figs. 2-5 for CACM, CISI, CRAN and MED, resectively. In each collection, three curves () in three diagrams are shown, for 3-term, 5-term and 1-term queries. Most of the curves are roughly lying at about the 45 degree line. In general, almost all of the query terms must be rocessed in order to obtain all of the documents in Q f. This indicates that the stoing criterion secied by the uerbound search algorithm is hard to meet and no signicant erformance imrovement can be achieved by the L method if total ranking is required. Our result is consistent with the study by Buckley and Lewit (Buckley & Lewit, 1985). In the tf idf weighting strategy, document weights are determined by both the tf and idf values. For tyical document collections, the range of idf values is rather small since their values are comressed by the log function. Thus, idf is the secondary comonent for determining the weight of a term, when comared to tf, whose range is larger in general (see Table 1). However, the L method determines the rocessing order of query terms based on idf values alone, without taking tf values into consideration. This exlains why the retrieval of the L method increases rather slowly. In fact, the uerbound search algorithm is too conservative to yield any signicant erformance gain. First, it is highly unlikely for the (T +1)st document to contain all the remaining unsearched terms. Second, the uerbound method considers the chance of the (T + 1)st document being romoted to a to document, assuming that the weights of the other documents, including the current to documents, are not changed. This assumtion is unrealistic and will be too essimistic as a stoing criterion. Two methods have been roosed to imrove retrieval eciency by relaxing the stoing criterion (Buckley & Lewit, 1985; Perry & Willett, 1983; Weiss, 1981). The rst one only guarantees a subset of to documents to be returned (i.e., only artial ranking is erformed). The uerbound of the (T +1)st document is comared with the current score of the Sth document, where S < T. With this relaxation, the chance to meet the stoing criterion is higher because of a larger dierence in scores (Buckley & Lewit, 1985). The second one alies robability on the uerbound comutation (Perry & Willett, 1983; Weiss, 1981). The rocess may sto early if the robability for the uerbound of the (T +1)st document to exceed the current score of the T th document is small. But we nd that the robabilistic stoing criterion is still unlikely to be met because the score dierence between the T th and (T +1)st documents is usually very small. Figure 6 shows the document scores of the to 3 documents of one run on MED. There is no signicant ga between two consecutive scores. Even if a ga exists, it may not fall between the T th and (T + 1)st documents to trigger early stoing. Thus, the (T +1)st document has a high robability to become a to document until almost all the query terms have been rocessed. Moreover, the saving due to this method is still questionable since the cost of comuting robabilities with a large number of terms could be very signicant. Thus, the uerbound comutation not only fails to roduce any erformance gain but also induces comutational overhead. These roblems motivate our study on artial ranking. To overcome the shortcomings of the L method, we roose and investigate two search algorithms in the next two subsections. The idea of the methods is based uon greedy algorithms, and their objective is to imrove Ra's, esecially at the initial stage of the retrieval rocess, i.e., to obtain a large ortion of to documents fast without a large amount of I/O oerations. 6

7 CACM ??????????? Disk ages retrieved (a) ??????????? Disk ages retrieved (b) ??????????? Disk ages retrieved (c) Figure 2: Average retrieval accuracies of the L method (), W method (?) and SW method ( [1 bytes/age], 2 [2 bytes/age]) for CACM. (a) 3 query terms er query. (b) 5 query terms er query. (c) 1 query terms er query. Fifteen queries are tested for each query size. 7

8 CISI ??????????? Disk ages retrieved (a) ??????????? Disk ages retrieved (b) ??????????? Disk ages retrieved (c) Figure 3: Average retrieval accuracies of the L method (), W method (?) and SW method ( [1 bytes/age], 2 [2 bytes/age]) for CISI. (a) 3 query terms er query. (b) 5 query terms er query. (c) 1 query terms er query. Fifteen queries are tested for each query size. 8

9 CRAN ??????????? Disk ages retrieved (a) ??????????? Disk ages retrieved (b) ??????????? Disk ages retrieved (c) Figure 4: Average retrieval accuracies of the L method (), W method (?) and SW method ( [1 bytes/age], 2 [2 bytes/age]) for CRAN. (a) 3 query terms er query. (b) 5 query terms er query. (c) 1 query terms er query. Fifteen queries are tested for each query size. 9

10 MED ??????????? Disk ages retrieved (a) ??????????? Disk ages retrieved (b) ??????????? Disk ages retrieved (c) Figure 5: Average retrieval accuracies of the L method (), W method (?) and SW method ( [1 bytes/age], 2 [2 bytes/age]) for MED. (a) 3 query terms er query. (b) 5 query terms er query. (c) 1 query terms er query. Fifteen queries are tested for each query size. 1

11 id score id score id score Figure 6: To 3 document scores of one run on MED. 2.2 The W method The rocess of document ranking is to obtain and accumulate term weights of the documents from the ostings lists, and to sort the nal document scores in descending order. The W method takes two arameters into consideration, namely the maximum tf in a ostings list (denoted by tf max ) and the length of that ostings list. The query terms are then rocessed in descending order of the tf max idf values. In this case, the maximum tf value of each ostings list is stored in the index le. Since two bytes are sucient to store a term frequency, the additional storage overhead will not be signicant. A tyical entry in the inverted le is as follows: term t i df tf max?! d j ; tf x d k ; tf y Since this method rocesses ostings lists which have a high otential of generating large increments to the document scores, artial scores of to documents will be accumulated faster than the L method without using a large amount of I/O. Consequently, the to documents in Q f will be revealed earlier in the retrieval rocess. An exeriment similar to that used for the L method was carried out to nd out the retrieval of the W method. The results are lotted as (?) curves in Figs Like the L method, most of the query terms need to be rocessed in order to obtain all of the documents in Q f if total ranking is required, so the W method is still unable to obtain signicant erformance gain. However, the retrieval accuracies of the W method are consistently higher than those of the L method in all the test collections, excet for the 1-term queries in CACM. It means that if artial ranking is allowed, the W method can lead to a substantial imrovement in terms of the number of disk accesses, for a retrieval less than 1%. For instance, for the 1-term queries in CISI, after rocessing 3% of the disk accesses, the average Ra for the L method is less than 1%, while it is about 7% for the W method. As seen in Figs. 2-5, the increments of Ra's in the W method are fast for the rst % of disk accesses; however, the increments slow down as Ra aroaches 1%. That is, the disk accesses after the rst % increase the retrieval very slowly; and the cost-eectiveness of this ortion of I/O oerations is low. The major advantage of the W method is the ability to obtain a large number of to documents in a small amount of disk accesses, esecially at the initial stage of the retrieval rocess, with a small additional storage overhead in the index le. However, this sorting method is still not the fastest one to obtain to documents for a given amount of disk accesses. In the following subsection, a more comlex method is studied. 11

12 2.3 The SW method The SW method involves a more comlex ordering method than the rst two methods. In the W method, ostings lists are rocessed by decreasing order of tf max idf values. When a ostings list of high tf max idf is retrieved, many disk ages with low term weights are retrieved and rocessed at the same time, but they contribute little to nding the to documents. Thus, rocessing the query Q list by list is not the best way to achieve high retrieval. To further utilize the idea of greediness, the SW search method is investigated. In the SW method, the ostings of each ostings list are sorted by decreasing tf values rst. In other words, for a given term, documents with high tf values are ut at the beginning of the list, and those with low tf's are ut at the end. Moreover, a ostings list is no longer viewed as a single item, but rather as a sequence of individual disk ages. This organization allows disk ages of high term weights to be rocessed before those of low term weights. For instance, the rst age of term t i may be rocessed rst, and then the rst age of term t j, and then the rst age of t k, and then the second age of term t i, and so on. The maximum tf of each age is stored in the index le so that disk ages with high tf values can be determined from the index le. A tyical entry in the inverted le is shown below: term t i df tf1;max tf2;max?! d j ; tf x d k ; tf y The rocessing and storage overheads of the SW will be higher than the rst two methods. However, since most ostings lists are short according to the Zif's law (Zif, 1949) and therefore will occuy only a small number of disk ages, the overheads in keeing the tf values and maintaining the order of the disk ages are insignicant. This is esecially true for environments where udates are done in batch and are infrequent comared to retrieval. Disk ages are rocessed in an order dened by three arameters, namely, the maximum tf of the age, the length of the ostings list and the number of document identiers in the age: tf max idf f(i); where tf max is the maximum tf contained in the disk age and I is the number of document identiers in the age. The function f(i) is included in the formula to account for the number of document identiers in a disk age, which aects the contribution of a disk age to nding the to documents, esecially when the weights of the to documents are determined by many terms. For examle, a age having one identier with a weight of 31 may not have a higher otential to accumulate weights than a age having identiers with the maximum weight of 29. First of all, the dierence between the maximum weights of these two ages is small. Moreover, the former only increments the weight of one document, while the latter increments the weights of documents. In this case, it is reasonable to rocess the disk age with the maximum weight of 29 rst. The function f(i) is introduced to account for the degree of fullness of disk ages in determining the order of rocessing. A number of functions have been investigated; in this study, we use f(i) = I e, where < e < 1. The eect of I is restricted by e so that the maximum weights of disk ages are still the rimary determining factor. Since this method requires the document identiers of each ostings list to be sorted by decreasing tf value, it is referred to as the SW method. We study the SW method of dierent age sizes: 1 bytes and 2 bytes, corresonding to ostings and ostings er disk age. For the age size of 1 bytes, e is set to.5. In other words, we take a square root of I in the formula. As the age size is increased to 2 bytes (i.e., the maximum value of I is increased to ), with the maximum tf in each ostings list unchanged, we accordingly reduce e to.4 to restrict the signicance of I to the rocessing order. The otimal value of e for a certain age size requires further investigation. Once again, an exeriment similar to that used for the L method was carried out to test the 12

13 retrieval of the SW method. The results are lotted as () and (2) curves for age sizes of 1 and 2 bytes in Figs We nd that the SW method is better than the W method and L method in terms of retrieval in all the test collections for all query sizes, demonstrating the suerior erformance of this sorting method. In Fig. 4, we nd that if 8% of to documents in CRAN are required, about 9% of disk accesses are needed for the L method, 6% for the W method and % for the SW method. In Figs. 2-5, the average retrieval accuracies of the SW method are consistently better than those of the W method by about 5-%. By observing the retrieval accuracies at 1-3% of disk accesses for the SW method, there are still lenty of room for imrovement, which is left for future investigation. In the following section, two estimation methods for the W method and SW method are roosed to estimate Ra's at dierent oints of the retrieval rocess. Note that the retrieval accuracies in the revious exeriments are obtained assuming that Q f is known in advance, which is not the case in reality. 3 Estimations of retrieval accuracies for the W method and SW method In order to realize the advantage of the W method and SW method, the system must be able to estimate Ra's at dierent oints of the retrieval rocess so that the system can sto the retrieval rocess when a desired retrieval has been reached. The terminating condition could be a threshold determined by the system or secied by the user. If the system fails to estimate the retrieval, we may have searched more ostings lists than necessary, incurring unnecessary disk accesses, or stoed too soon, resulting in a lower than the user's secication. Estimations of retrieval accuracies can be done in a number of ways. For instance, the values of Ra i 's can be determined by carefully calibrating the retrieval system. This method is simle and roduce good estimations for a stable database, but may not work well when the arameters of the database change frequently. Alternatively, an analytical model can be develoed based uon the distribution functions of the tf values, df values, and the search strategy. An accurate analytical model should be able to give better rediction on Ra i 's under dierent system arameters. Unfortunately, it is dicult to derive. Instead, we study in this section two heuristic methods to estimate Ra's at the checkoints. It should be noted that they are not used to further imrove the caability of obtaining better retrieval, but rather they are used to estimate the retrieval during a retrival rocess. The rst method is called the document movement method (or the Dm method). It uses a heuristic to estimate whether a document is likely to be a to document based uon document movements during the retrieval rocess. The second one is called the linear regression method (or the Lr method). In the Lr method, an indicator is rst develoed, and then a regression line is constructed to relate the retrieval and the indicator by using linear regression techniques. We use these two estimation methods to estimate the retrieval accuracies of both the W method and SW method. Their accuracies and rocessing costs are comared. Since there are many similarities between the W method and SW method, the exlanations are mainly based on the W method, followed by the features corresonding to the SW method in arentheses. 3.1 Estimations by the document movement method As the document weights are sorted during the ranking rocess, the document identiers with their weights change their ositions in the document-weight array, called doc wt hereafter. To study the behavior of document movements, the to T ositions of the array are called the Candidate Region, and the documents in this region are called candidate documents. Fig. 7 illustrates the document movements of the 1 documents with highest scores. In this examle, there are four query terms and assume only the to 5 documents are 13

14 After term After term After term After term final to five documents Figure 7: The movement of documents in a ranking rocess. returned to the user (i.e., T = 5). The arrows indicate the movements of 5 documents which are eventually in Q f. Documents 1261, 298 and 56 stay in the Candidate Region during the whole retrieval rocess. Document 356 stays in the Candidate Region after term 1 is rocessed, but moves out of the Region after term 2 is retrieved. Eventually it gets back to the Candidate Region. Document 115 was outside the to 5 after term 1 is rocessed, but it gets into the Candidate Region after term 2 is rocessed. Even though the document movements seem to be comlex and random, with the use of the W method and SW method, some documents remain in the high ranks consistently, while some move u and/or down in the array. To study the temoral behavior inside the Candidate Region, Fig. 8 records the document movement based on a -term query on the MED collection, with the use of the W method. During the whole retrieval rocess, a total of 47 documents have entered the Candidate Region. To simlify exlanations, the documents are assigned seudo identiers from 1 to 47 and sorted so that to documents are shown on the right of the gure. In this exeriment, the to documents are collected and examined after every ve query terms have been rocessed. As can be observed from Fig. 8, among the to documents obtained at the oint when the rst 5 ostings lists have been rocessed, only 1 remain after the next ve query terms have been rocessed (documents 1 to 1 shown near the lower left corner of the diagram have been eliminated from and never aeared again in the Candidate Region). We nd that some documents stay in the Candidate Region for a longer eriod of time and some are exelled very soon. To show this behavior ictorially, a continuous vertical line is drawn uwards if a document remains in the Candidate Region at consecutive checkoints; the continuous vertical line is terminated if the document leaves the Region. The continuous lines at the right art of the grah indicate that many documents stay in the Candidate Region in a stable manner. For instance, if we sto after 1 query terms had been rocessed, we would have retrieved documents and That is, 18 out of the nal to documents would have been identied, and documents 26 and 27 would be retrieved instead of 28 and 29. Based on the somewhat regular behaviors of the document movements, heuristics can be develoed. The following two observations establish the guidelines for our heuristic. First, those documents which have relatively high scores at the beginning of the retrieval and get into the to t ranks of the Candidate Region, where t < T, tend to stay in the Candidate Region for the rest of the rocessing (the to t ranks of the Candidate Region is called the Stable Region and is reresented by SR). Therefore, there is a high robability for those documents in SR to be retained in Q f. Second, those documents which are eventually in Q f tend to stay in the Candidate Region for a longer eriod of time, while those which are not in Q f tend to move in and out of the Candidate Region more frequently. Thus, the duration for which the documents stay in the Candidate Region can be used to 14

15 MED Number of ostings lists searched Documents Figure 8: The temoral behavior in the Candidate Region. redict documents in Q f. Since ostings lists (disk ages) are sorted by decreasing tf max idf for the W method (tf max idf f(i) for the SW method), the inuence of the remaining ostings lists (disk ages) on the nal ranking will decrease as more lists (ages) are rocessed. Thus, the duration requirement for otential documents to be in Q f should decrease as the search roceeds. To suort this feature, we add to each document a counter which will be increased by one whenever the document moves into the Candidate Region. The document-weight array doc wt is sorted after one disk age of ostings is rocessed. A candidate document is considered as a to document if it stays in the Candidate Region for a certain eriod of time. There are 1 checkoints taken for the whole retrieval rocess. At each checkoint, the number of to documents is estimated based uon the following requirements. In our study, we set SR to be % of the to T documents, with T equal to. The choice of the SR value is exerimental. Documents in this sub-region are counted as to documents indeendently of how long they stay in it. For the rst 4 checkoints (i.e., after 1%, %, 3% and % of disk accesses), we require that candidate documents to be counted as to documents stay in the Candidate Region 75% of the time. This means that for 1 disk accesses, after 3% of disk accesses are rocessed, the counted documents must stay in the Candidate Region for at least 22 times. This requirement is decreased as more lists are rocessed to reect the decreasing inuence of smaller term weights on the nal ranking, towards the end of the retrieval rocess. The requirement is decreased by 5% for every 1% of disk accesses. This means that for 1 disk accesses, at 5% of the disk accesses, the counted documents must stay in the Candidate Region for at least 35 of the times. The discreancies for the four test collections based on this estimation method are shown in Figs. 9 and 1. In each diagram, the horizontal axis corresonds to the ercentages of the total disk accesses; the vertical axis corresonds to the absolute discreancy between the actual Ra and the estimated Ra. We use the height of the box to reresent the maximum discreancy at a articular checkoint, and the horizontal line inside the box to reresent the average discreancy. For each collection, 45 queries are rocessed. In 15

16 Fig. 11, the estimation discreancies of the W method and SW method based on the Dm method are shown. The results will be discussed after the next method is described. 3.2 Estimations by the linear regression method The major reason that the W method (SW method) erforms better than the L method is that it rocesses ostings lists (disk ages) with heavy weights earlier than those with light weights. Let the total weight of all the ostings lists (disk ages) corresonding to a query be w. At a articular checkoint, if the weights rocessed thus far is w, it is reasonable to exect that the larger the weight ratio r = w =w, the more it is true that the to documents have been obtained. Initially, w =, then r = and the corresonding Ra = %. When all ostings lists (disk ages) are rocessed, w = w, then r = 1 and the corresonding Ra = 1%. The weight ratio r can be exected to be more or less roortional to Ra. However, the weight ratio r alone cannot rovide accurate estimations. The roblem can be shown by the dierent distributions of the artial scores among the candidate documents, with a number of unrocessed ostings lists (disk ages). For a given value of w, if most of the weight is concentrated on a few documents, and they are much heavier than the remaining ostings lists (disk ages), then they have a high chance to be in Q f. On the other hand, if the weight is evenly distributed among the documents, and they are not much heavier than the remaining ostings lists (disk ages), then they have a low chance to be in Q f. We nd that the dierences between the artial scores of the candidate documents and the maximum weights of the unrocessed ostings lists (disk ages) are also an imortant indicator for Ra's. If h candidate documents have weights higher than the maximum weight of the next heaviest ostings list (disk age) (i.e., they are heavier than all the remaining ostings lists (disk ages)) and h is very close to T, then it is likely that a large ortion of the candidate documents will be in Q f. However, if h is small, then it is likely that very few candidate documents will be in Q f. Thus, the value of h is roortional to the value of Ra, and s = h=t is called the safe ratio of candidate documents in the Candidate Region. The weight ratio r and the safe ratio s are two indicators that can estimate the number of to documents obtained at a articular checkoint. We exect that if r + s is small at a checkoint, Ra is small; if it is large, Ra is large. However, like the Dm method, the query size can aect Ra at the same time. If there is a small number of ostings lists to be rocessed, the chance for a document to further gain weight is small and the sum of r and s reects the value of Ra to a greater extent. The situation becomes comlex when a large number of ostings lists remain to be rocessed, because there are high chances for a document to gain weight from dierent ostings lists (disk ages). Thus, the size of the query must be taken into consideration as well. To formulate the indicator, a few more denitions are given below: 1. w = P P j=1 max wt[j], where P is the total number of disk ages required to rocess all the query terms and max wt[j] is the maximum weight of the jth age. It should be noted that the maximum term weight of a age, rather than the sum of term weights in a age, is used in our comutation. 2. w i = P dip=1e j=1 max wt[j]. The value deends on which sorting method is used, namely the W method or the SW method. 3. r i = w i =w, the weight ratio. 4. s i = h i =T, the safe ratio, where h i is the number of candidate documents whose current weights are larger than the maximum weight of the remaining ostings lists (disk ages), after i% of the disk ages are rocessed. 16

17 Figure 9: The estimation discreancies of the W method by the Dm method (light gray boxes) and the Lr method (white boxes). Forty-ve queries are rocessed. 17

18 Figure 1: The estimation discreancies of the SW method by the Dm method (dark gray boxes) and the Lr method (gray boxes). Forty-ve queries are rocessed. 18

19 Figure 11: The estimation discreancies of the W method (light gray boxes) and SW method (dark gray boxes) by the Dm method. Forty-ve queries are rocessed. 19

20 Figure 12: The estimation discreancies of the W method (white boxes) and SW method (gray boxes) by the Lr method. Forty-ve queries are rocessed.

21 After i% of the disk accesses have been rocessed and the array doc wt has been sorted, the indicator Ind i for the W method (SW method) is dened as follows: for i 1. Ind i = (r i + s i )? comb factor(k; i); The value of comb factor(k; i) is the amount that is to be subtracted from r + s to account for the eect of the combination of term weights from dierent ostings lists to the nal ranking. If the query size k is large, a large amount should be subtracted. On the contrary, if the query size is small, a small amount should be subtracted. However, this amount of subtraction is reduced to reect the decreasing inuence of smaller term weights to the nal ranking towards the end of retrieval rocess. Therefore, we use comb factor(k; i) = ck (1? i=1), where c is a constant. Since the maximum value of r + s is 2 and the values of r + s are generally below.5 at the rst few checkoints, we set c = 1=5 to control the amount of subtraction; for instance, the maximum subtraction is.2 at the rst checkoint for 1-term queries. At the oint when all query terms have been rocessed, nothing is subtracted, r + s = 2 and Ra = 1%. However, it seems to be reasonable to have a bound for subtractions; we require that Ind i is non-negative. In order to comute the above indicator, in addition to the maximum tf stored in the index le, the sum of the maximum weights of the disk ages in a list, denoted by sum max wt, is also stored in the index le and is used to comute w above. To examine the relationshi between Ind's and Ra's, the following exeriment was conducted. For each collection, 45 queries are run. We take checkoints at every 1% of disk accesses as before and the Ind's are comuted at the same time. In Table 2, the (Ind; Ra) airs for two queries in CACM by the W method are shown. These (Ind; Ra) airs can be viewed as oints in a two-dimensional lane, with the x-axis for the Ind and y-axis for the Ra. In Figs. 13 and 14, the oints corresonding to the W method and SW method are shown in two diagrams for each collection. Since there are 45 queries for each collection, and 11 oints for each query (including the % and 1%), 495 oints in total are lotted in each diagram. The 's and 's in Fig. 13 corresond to samle queries 1 and 2 in Table 2. Based uon the distribution of oints for each collection, we nd that in general the larger the Ind, the larger the Ra. Statistically, we can establish a linear relationshi between Ind's and Ra's; regression lines can be drawn by using the linear regression technique for each collection. The regression lines and the corresonding correlation coecients for CACM, CISI, CRAN and MED are shown in Figs. 13 and 14, resectively. Since the correlation coecients are about.8,.9,.89, and.89 for the W method (.95,.95,.97 and.95 for the SW method) for CACM, CISI, CRAN and MED, we can conclude that there is a strong correlation between Ind and Ra, esecially for the SW method. Once the regression line is drawn, we can estimate the value of Ra when Ind is comuted. The discreancies of this method are shown in Fig. 9 and 1. The estimation discreancies of the W method and SW method based on the Lr method are shown in Fig The comarison between these two estimation methods A good estimator should have the following two roerties: its estimations are close to the actual values and it is easy to imlement. Based uon these criteria, the two estimation methods are comared. Let us rst consider the estimations for the W method. By comaring the heights of the light gray boxes with those of the white boxes in Fig. 9, excet for 3 cases in CACM after the 6% oint and one case in CRAN at the 5% oint, the Lr method erforms better than the Dm method in terms of maximum discreancies. As far as the average discreancies are concerned, the Lr method also erforms better than the Dm method in most cases, excet for 4 cases in CACM and one case in CRAN. The average discreancies in either estimation method are mostly in the 1-% range. In general, both the maximum and average discreancies decrease 21

22 CACM, y = 15: :8x; r = :81 (y) :2 :4 :6 :8 1 1:2 1:4 1:6 1:8 2 Indicator (x) CISI, y = 15: :37x; r = :9 (y) :2 :4 :6 :8 1 1:2 1:4 1:6 1:8 2 Indicator (x) CRAN, y = 12: :71x; r = :89 (y) :2 :4 :6 :8 1 1:2 1:4 1:6 1:8 2 Indicator (x) MED, y = 14: :95x; r = :89 (y) :2 :4 :6 :8 1 1:2 1:4 1:6 1:8 2 Indicator (x) Figure 13: The regression lines for the test collections based on the W method. The horizontal axis corresonds to the indicator; the vertical axis corresonds to the retrieval. The regression line and the correlation coecient (r) are given on to of each diagram. 22

23 CACM, y = 3:8 + 49:17x; r = :95 (y) :2 :4 :6 :8 1 1:2 1:4 1:6 1:8 2 Indicator (x) CISI, y = 3:8 + 48:37x; r = :95 (y) :2 :4 :6 :8 1 1:2 1:4 1:6 1:8 2 Indicator (x) CRAN, y = 4:4 + 48:73x; r = :97 (y) :2 :4 :6 :8 1 1:2 1:4 1:6 1:8 2 Indicator (x) MED, y = 6:3 + 48:13x; r = :95 (y) :2 :4 :6 :8 1 1:2 1:4 1:6 1:8 2 Indicator (x) Figure 14: The regression lines for the test collections based on the SW method. The horizontal axis corresonds to the indicator; the vertical axis corresonds to the retrieval. The regression line and the correlation coecient (r) are given on to of each diagram. 23

24 Samle Query 1 Samle Query 2 Checkoints Ind Actual Ra Ind Actual Ra Table 2: The (Ind; Ra) airs for two samle queries in CACM by the W method. as the search roceeds. Now, consider the estimations for the SW method. By comaring the heights of the dark gray boxes with those of the gray boxes in Fig. 1, the Lr method consistently erforms better than the Dm method at all oints in terms of maximum and average discreancies. The average discreancies by the Dm method are mostly in the 1-% range, while they are most in the 5-1% range by the Lr method. Based uon these comarisons, the estimation ability of the Lr method is better than that of the Dm method. To nd out which sorting algorithm is better estimated by the estimation methods, the maximum and average discreancies of the W method and SW method for each estimation heuristic are comared in Figs. 11 and 12, resectively. For the Dm method, excet for 1 case in CACM, 1 case in CISI and 1 case in MED, the maximum discreancies for the SW method are smaller. Excet for 4 cases in CACM, the average discreancies for the SW method are smaller as well. As far as the Lr method is concerned, the estimations for the SW method are better than those of the W method for almost all of the cases, in terms of maximum and average discreancies. Moreover, the discreancies of the SW method are much smaller than those of the W method. Based uon these results, the estimation methods can aroximate the SW method with higher accuracies. For ractical uroses, both estimation methods achieve satisfactory results on the average. Let us now turn to the eciency of the methods. The algorithm of the Dm method seems to be easy to imlement. It involves the counting of the durations of documents in the Candidate Region and the sorting of the document-weight array after each disk age is rocessed. Based uon the descrition above, the Lr method seems to be more comlex, because it involves the construction of an indicator and the techniques of linear regression to generate a regression line for each document base. However, in oerating environments, the Lr method erforms better than the Dm method. The rocess to generate a regression line is done once for each document collection (or after the document collection has been extensively udated) and thus it is not costly. To rocess a query, the document-weight array are sorted only 1 times at 1 checkoints, and the estimation of Ra's can be done rather inexensively by maing the value of Ind on the regression line. However, the Dm method has to sort the document-weight array after a disk age is rocessed. If there are 3 disk ages retrieved, it will be sorted 3 times, comared to 1 times required by the Lr method. One immediate suggestion for the Dm method is to sort doc wt less frequently. However, the discreancies of estimations could get worse when the behavior of document movements in the Candidate Region is not 24

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann