A NOVEL METHOD FOR THE EVALUATION OF BOOLEAN QUERY EFFECTIVENESS ACROSS A WIDE OPERATIONAL RANGE

Size: px

Start display at page:

Download "A NOVEL METHOD FOR THE EVALUATION OF BOOLEAN QUERY EFFECTIVENESS ACROSS A WIDE OPERATIONAL RANGE"

Annabelle Joseph
6 years ago
Views:

1 A NOVEL METHOD FOR THE EVALUATION OF BOOLEAN QUERY EFFECTIVENESS ACROSS A WIDE OPERATIONAL RANGE Eero Sormunen Department of Information Studies, University of Tampere P.O. Box 607, FIN 330 Tampere, Finland Tel Mail eero.sormunen@uta.fi ABSTRACT Traditional methods for the system-oriented evaluation of Boolean IR systems suffer from validity and reliability problems. Laboratory-based research neglects the searcher and studies suboptimal queries. Research on operational systems fails to make a distinction between searcher performance and system performance. This approach is neither capable of measuring performance at standard points of operation (e.g. across R0.0-R.0). A new laboratory-based evaluation method for Boolean IR systems is proposed. It is based on a controlled formulation of inclusive query plans, on an automatic conversion of query plans into elementary queries, and on combining elementary queries into optimal queries at standard points of operation. Major results of a large case experiment are reported. The validity, reliability, and efficiency of the method are considered in the light of empirical and analytical test data. Keywords evaluation (general), structured queries, testing methodology, test collections. INTRODUCTION The mainstream of the evaluative IR research has followed the Cranfield paradigm. The major focus has been on the best match IR models, see e.g. [2, 23]. The low interest in studying the Boolean IR model can be seen in the low volume of research output (see e.g. [8] and other TREC reports), and also in the slow development of system-oriented evaluation methods for the Boolean IR model. Research on operational systems has focused on Boolean IR systems but the contribution on the development of methods has been very slight [3, 28]. Research within the Cranfield paradigm has shared a very critical attitude towards the Boolean IR model [7]. The studies of Salton [2] and Turtle [30] are examples of attempts to show empirically the overall superiority of the best match IR models over the Boolean IR model. The results of some recent comparisons, have suggested that studying the overall superiority of one model over the other may be a naive approach [, 20]. Boolean queries seem to perform better in some situations, and best match queries in other situations. It may be more reasonable to focus on studying performance of different IR models under changing operational constraints. New methods are needed to draw a more detailed picture of query effectiveness in different IR models.. Methodological Problems in Boolean IR Experiments The Boolean IR model has three features that cause methodological problems for experimental research [3]:. The formulation of Boolean queries requires a trained person to translate the user request into a query. 2. The searcher has very little control over the size of the output produced by a particular query. 3. The Boolean IR model does not support ranking of documents in order of decreasing probability of relevance. The necessity to use a human expert in query formulation is a potential source of validity and reliability problems. It is very difficult to separate the effects of a technical IR system from those of a human searcher. For instance, in the well known STAIRS study, the searchers had a predefined goal to locate at least 75 per cent of all relevant documents. It turned out that only less than 20 per cent of relevant documents were found. On the other hand, the average precision of the test queries was as high as 79 per cent [3]. The searchers were obviously formulating high-precision queries although they were asked to work towards high recall. The latter two features (no ranking, little control over the output size) of the Boolean IR model cause problems in measuring the performance at the standard point of operation (SPO, e.g. at fixed recall levels or document cut-off values). Typically, only one query (from an arbitrary operational level) is formulated per search request. Performance is measured using single recall/precision values. and precision are averaged separately over all requests. As Lancaster has shown [5], the distribution of recall and precision values for a large set of requests is very wide. It is very difficult see how the averaged recall and precision values should or could be interpreted, since averaging mixes queries from different operational levels. The coordination level method developed for the Cranfield 2 project, is a traditional approach to omit the trained searcher from the query formulation, to rank output, and to measure the wide range performance of a Boolean system [5]. Unfortunately, replacing the cognitive effort of a searcher by a mechanical query term selection procedure leads to a

2 Facet A [Information retrieval] Facet B [Search process] (information retrieval OR online systems OR online(w)search?) AND (tactic? OR heuristic? OR trial(w)error OR expert systems OR artificial intelligence OR attitudes/de OR behavior?/de,id,ti OR cognitive/de) Figure. An example of a high recall Oriented query used by Harter [0] to illustrate the facet based query planning approach. fundamental validity problem. Queries exploit the Boolean IR model in a suboptimal way..2 Harter s Idea: the Most Rational Path Harter [0] introduced an idea for an evaluation method based on the notion of elementary queries (EQ). Harter used a single search topic to illustrate how the method could be applied. He designed a high recall oriented query plan (see Fig ). Harter applied the building block search strategy which quite commonly used by professional searchers [6, 9, 2, 6]. The major steps of the building blocks strategy are ) Identify major facets and their logical relationships with one another. 2) Identify query terms that represent each facet: words, phrases, etc. 3) Combine the query terms of a facet by disjunction (OR operation). 4) Combine the facets by conjunction or negation (AND or ANDNOT operation) [9]. The notion of facet is important in query planning. It is a concept that is identified from, and defines one exclusive aspect of a search topic. In step 2, a typical goal is to discover all plausible query terms appropriate in representing the selected facet. Next, Harter retrieved all documents matching the conjunction of facets A and B represented by the disjunction of all selected query terms, and assessed the relevance of resulting 37 documents. In addition, all conjunctions of two query terms (called elementary queries) from the query plan representing facets A and B in Fig. were composed and executed. A sample from the 24 elementary queries and the summary of their retrieval results are presented in Table. Harter [0] demonstrated the procedure of constructing optimal queries (called the most rational path). An estimate for maximum precision across the whole relative recall range was determined by applying a simple incremental algorithm:. To create the initial optimal query, choose the EQ that achieves the highest precision. Eq # Elementary queries # of Docs s information retrieval AND tactic? 8 s2 information retrieval AND heuristic? 7 s3 information retrieval AND trial(w)error s22 online(w)search? AND attitudes/de 9 s23 online(w)search? AND behavior?/de,id,ti 8 s24 online(w)search? AND cognitive/de 0 s25 s-s24/or 37 # of Rel Docs Precision 2. Create in turn the disjunction of each of the remaining EQs with the current optimal query. Select the disjunction with the EQ that maximizes precision. The disjunction of the current optimal query and the selected EQ creates a new optimal query. 3. Repeat step 2 until all elementary queries have been exhausted. Precision and recall values for the 24 elementary queries and the respective curve for the optimal queries are presented in Fig 2. Harter never reported full-scale evaluation results based on the idea of the most rational path except this single example. He did neither develop operational guidelines for a fluent use of the method in practice Table. Retrieval results for the 24 elementary queries in the case search by Harter (990). Precision,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,0 0,00 s3 s3 or s8 s3 or s8 or s24 s s2 0,00 0,0 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90,00 Most rational path Elementary queries Figure 2. and precision of the 24 elementary queries and the most rational path in the case search presented by Harter [0]. Actually Harter talked about elementary postings sets. This is very confusing since it applies set-based terminology to address queries as logical statements.

3 .3 Research Goals The main goal of the study was to create an evaluation method for measuring performance of Boolean queries across a wide operational range by elaborating the ideas introduced by Harter [0]. The method is presented and argued using the framework suggested by Newell M={domain, procedure, justification} [9]:. The domain of the method specifies the appropriate application area for the method. 2. The procedure of the method consists of the ordered set of operations required in the proper use of the method. Especially, two major operations unique to the procedure need to be elaborated: a) Query formulation. How the set of elementary queries is composed from a search topic? b) Query optimization. What algorithm should be used for combining the elementary queries to find the optimal query for different operational levels? 3. The justification of the method. The appropriateness, validity, reliability and efficiency of the method within the specified domain must be justified. The structure of this paper is the following: First, some basic concepts and the procedure of the method are introduced. Second, a case experiment is briefly reported to illustrate the domain and the use of the proposed method in a concrete experimental setting. Third, the other justification issues of the method: validity, reliability and efficiency are discussed. Several empirical tests were carried out to assess the potential validity and reliability problems in applying the method. 2. OUTLINE FOR THE METHOD The aim of this section is to introduce a sound theoretical framework for the procedure of the method and to formulate operational guidelines for exercising it. 2. Query Structures and Query Tuning Spaces IR models address the issue of comparing a query as a representation of a request for information with representations of texts. The Boolean IR model supports rich query structures, a (simple) binary representation of texts, and an exact match technique for comparing queries and text representations [2]. A Boolean query consists of query terms and operators. Query terms are usually words, phrases, or other character strings typical of natural language texts. The Boolean query structures are based on three logic connectives conjunction ( ), disjunction ( ), negation ( ), and on the use of parentheses. A query expresses the combination of terms that retrieved documents have to contain. If we want to generate all possible Boolean queries for a particular request, we have to identify all query terms that might be useful, and to generate all logically reasonable query structures. Facet, as defined in section.2, is a very useful notion in representing relationships between Boolean query structures and the search topic. Terms within a facet are naturally combined by disjunctions. Facets themselves present the exclusive aspects of desired documents, and are naturally combined by Boolean conjunction or negation. [9]. Expert searchers tend to formulate query plans applying the notion of facet [9, 6]. Resulting query plans are usually in a standard form, the conjunctive normal form (CNF) (for a formal definition, see []). The structure of a Boolean query can be easily characterized in CNF queries: Query exhaustivity (Exh) is the number of facets that are exploited. Query extent (QE) characterizes the broadness of a query, and can be measured, e.g. as the average number of query terms per facet. For instance, in the query plan designed by Harter Exh=2 and QE=5.5 (see Fig. ). The changes made in query exhaustivity and extent to achieve appropriate retrieval goals are called here query tuning. The range within which query exhaustivity and query extent can change sets the boundaries for query tuning. The set of all elementary queries and their feasible combinations composed at all available exhaustivity and extent levels form the query tuning space. In the example by Harter (Fig ), seven different disjunctions of query terms can be generated from facet A (=2 3 -) and 255 from facet B (=2 8 -). The total number of possible EQ combinations is then 7 x 255 =,785 at Exh= 2. In addition, 7 and 255 EQ combinations can be formed at Exh= from facets A and B, respectively. Thus, the total number of EQ combinations creating the query tuning space across exhaustivity levels and 2 for the sample query plan is 2, The Procedure of the Method The procedure of the proposed method consists of eight operations at three stages: STAGE I. INCLUSIVE QUERY PLANNING. Design inclusive query plans. Experienced searchers formulate inclusive query plans for each given search topic. It yields a comprehensive representation of the query tuning space available for a search topic. 2. Execute extensive queries. The goal of extensive queries is to gain reliable recall base estimates. 3. Determine the order of facets. The facet order of inclusive query plans is determined by ranking the facets according to their measured recall power, i.e. their capability to retrieve relevant documents. STAGE II. QUERY OPTIMISATION 4. Generate the set of elementary queries (EQ). Inclusive query plans in the conjunctive normal form (CNF) at different exhaustivity levels are transformed into the disjunctive normal form (DNF) where the elementary conjunctions create the set of elementary queries. All elementary queries are executed to find the set of relevant and non-relevant documents associated with each EQ. 5. Select standard points of operation (SPO). Both fixed recall levels R0.,,R.0 and fixed document cut-off values, e.g. DCV2, DCV5,,DCV500 may be used as SPOs. 6. Optimization of queries. An optimisation algorithm is used to compose the combinations of EQs performing optimally at each selected SPO. STAGE III. EVALUATION OF RESULTS 7. Measure precision at each SPO. Precision can be used as a performance measure. Precision is averaged over all search topics at each SPO.

4 8. Analyse the characteristics of optimal queries. The optimal queries are analysed to explain the changes in the performance of an IR system. The above steps describe the ordered set of operations constituting the procedure of the proposed method. Inclusive query planning (steps -3) and the search for the optimal set of elementary queries (steps 4-6), are in the focus of this study..3 Inclusive Query Planning The techniques of query planning are routinely taught to novice searchers [9, 6]. A common feature in different query planning techniques is that they emphasize the analysis and identification of searchable facets, and the representation of each facet as an exhaustive disjunction of query terms. The goal of inclusive query planning is similar, but the thoroughness of identification task is stressed even more. In inclusive query planning, the goal is to identify. all searchable facets of a search topic, and 2. all plausible query terms for each facet. A major doubt in using human experts to design queries is probably associated with the reliability of experimental designs. For instance, the average inter-searcher overlap in selection of query terms (measured character-by-character) is usually around 30 per cent [25]. Fortunately, the situation is not so bad when facets are considered. For instance, in a study by Iivonen [2], the average concept-consistency rose up to 88 per cent, and experienced searchers were even more consistent. This indicates that expert searchers are able to identify the facets of a topic consistently although the overlap of queries at string level may be low. The identification of all plausible query terms for each identified facet is another task requiring searching expertise. Basically, the comprehensiveness of facet representations is mostly a question of how much effort are used to identify potential query terms. The query designer is freed from the needs to make compromised query term selections typical of practical search situations. The optimization operation will automatically reject ill-behaving query terms. The process can be improved by appropriate tools (dictionaries, thesauri, browsing tools for database indexes, etc.). The final step is to decide the order of facets in the query plan. In the case of a laboratory test collection, full relevance data (or at least its justified estimate) is available. The facets of an inclusive query plan can be ranked in the descending order of recall. The disjunction of all query terms identified for a facet is used to measure recall values..4 Search for the Optimal Set of EQs The size of the query tuning space increases exponentially as a function of the number of EQs. We are obviously facing the risk of combinatorial explosion since we do not know the upper limit of query exhaustivity and, especially, query extent in inclusive query plans. Solving the optimization problem by blind search algorithms could lead to unmanageably long running times. The search for the optimal set of EQs is a NPhard problem. Harter [0] introduced a simple heuristic algorithm but he did not define it formally. Query optimization resembles a traditional integer programming case called the Knapsack Problem. The problem is to fill a container with a set of items so that the value of the cargo is maximized, and the weight limit for the cargo is not exceeded [4]. The special case where each item is selected once only (like EQs), is called the 0- Knapsack Problem. Efficient approximation algorithms have been developed to find a feasible lower bound for the optimum [7]. The problem of finding the optimal query from the query tuning space can be formally defined by applying the definitions of the 0- Knapsack Problem as follows: Select a set of EQs so as to maximise z = subject to and DCV n rixi i= n n ixi i= DCV, if eqi is selected where xi = 0, otherwise ri = no of relevant documents retrieved by eqi ni = no of documents retrieved by eqi j = selected document cut j - off value The above definition of the optimization problem is in its maximization version. The number of relevant documents is maximized while the total number of retrieved documents is restricted by the given DCV j. In the minimization version of the problem, the goal is to minimize the total number of documents while requiring that the number of relevant documents exceeds some minimum value (a fixed recall level). Unfortunately, standard algorithms designed for physical objects would not work properly with EQs. Different EQs tend to overlap and retrieve at least some joint documents. This means that, in a disjunction of elementary queries, the profit r i and the weight n i of the elementary query eq i have dynamically changing effective values that depend on the EQs selected earlier. The effect of overlap in a combination of several query sets is hard to predict. A simple heuristic procedure for an incremental construction of the optimal queries was designed applying the notion of efficiency list [7]. The maximization version of the algorithm contains seven steps: Remove all elementary queries eq i a) retrieving more documents than the upper limit for the number of documents (i.e. n i > residual document cutoff value DCV', starting from DCV' = DCV j ) or b) retrieving no relevant documents (r i =0). 2. Stop, if no elementary queries eq i are available. 3. Calculate the efficiency list using precision values r i /n i for remaining m elementary queries and sort elementary queries in order of descending efficiency. In the case of equal values, use the number of relevant documents (r i ) retrieved as the second sorting criterion. 4. Move eq at the top of the efficiency list to the optimal query.

5 5. Remove all documents retrieved by eq from the result sets of remaining elementary queries eq 2,..., eq m. 6. Calculate the new value for free space DCV'. 7. Continue from step one. The basic algorithm favors narrowly formulated EQs retrieving a few relevant documents with high precision at the expense of broader queries retrieving many relevant documents with medium precision. The problem can be reduced by running the optimization in an alternative mode differing only in step four of the first iteration round: eq i retrieving the largest set of relevant documents is selected from the efficiency list instead of eq. The alternative mode is called the largest first optimization and the basic mode the precision first optimization. 3. A CASE EXPERIMENT The goal of the case experiment was to elucidate the potential uses of the proposed method, to clarify the types of research questions that can be effectively solved by the method, and to explicate the operational pragmatics of the method. 3. Research Questions The case experiment focused on the mechanism of falling effectiveness of Boolean queries in free-text searching of largefull-text databases. The work was inspired by the debate concerning the results of the STAIRS study [3, 22]. The goal was to draw a more detailed picture of system performance and optimal query structures in search situations typical of large databases. Assuming an ideally performing searcher, the main question was: What is the difference in maximum performance of Boolean queries between a small database and two types of large databases? The large & dense database contained a larger volume of documents than the small database but the density of relevant documents (generality) was the same. In the large & sparse database, both the volume of documents was higher and the density of relevant documents was lower than in the small database. Twelve hypotheses were formulated concerning effectiveness, exhaustivity and proportional query extent of queries in large databases. For details, see [26]. 3.2 Data and Methods 3.2. Optimization Algorithm The optimization algorithm described in Section 2.5 was programmed in C for Unix. Both a maximization version exploiting a standard set of document cut-off values (DCV 2, DCV 5,, DCV 500 ) and a minimization version exploiting fixed recall levels (R 0. R.0 ) were implemented. At each SPO, the iteration round (called optimization lap) was executed ten times starting each round by selecting a different top EQ from the efficiency list: five laps in the largest first mode, and five in the precision first mode. The alternative results at a particular SPO achieved by the algorithm in different optimization laps were sorted to find the most optimal queries for further analysis Test Collection The Finnish Full-Text Test Collection developed at the University of Tampere was used in the case experiment [4]. The test database contains about 54,000 newspaper articles from three Finnish newspapers. A set of 35 search topics are available including verbal topic descriptions and relevance assessments. The test database is implemented for the TRIP retrieval systems 2. The test database played the role of the large & dense database. Other databases, the small database and the large & sparse database, were created through sampling from EQ result sets. The large & sparse database was created by deleting about 80 % of the relevant documents, and the small database by deleting about 80 % of all documents of the EQ result sets. Thus, the EQ result sets for the small database contained the same relevant documents as those for the large & sparse database. Query optimization was done separately on these three EQ data sets Inclusive Query Plans The initial versions of inclusive query plans were designed by an experienced search analyst working for three months on the project. Query planning was an interactive process based on thorough test queries and on the use of vocabulary sources. Later parallel experiments (probabilistic queries) revealed that the initial query plans failed to retrieve some relevant documents. These documents were analyzed, and some new query terms were added to represent the facets comprehensively. The final inclusive query plans were capable to retrieve 270 (99,3 %) out of the 278 known relevant documents at exhaustivity level one. In total, inclusive query plans contained 34 facets. The average exhaustivity of query plans was 3.8 ranging from 2 to 5. The total number of query terms identified was 2,330 (67 per query plan and 8 per facet). The number of terms ranged from 23 to 69 per query plan, and from to 74 per facet. The wide variation in the number of query terms per facet characterizes the difference between specific concepts (e.g. named persons or organizations) and general concepts (e.g., domains or processes) Data Collection and Analysis Precision, query exhaustivity and query extent data were collected for the optimal queries at SPOs. The sensitivity of results to changes in search topic characteristics like the size of a recall base, the number of facets identified, etc. were analyzed. Also the searchable expressions referring to query plan facets were identified in all relevant documents of a sample of 8 test topics to find explanations for the observed performance differences. Statistical tests were applied to all major results. 3.3 Sample Results Figures 3-5 summarize the comparisons between the small, large & dense, and large & sparse databases: average precision, exhaustivity and proportional extent of optimal queries at recall levels R 0. -R.0. 3 The case experiment could reveal interesting performance characteristics of Boolean queries in large databases. The average precision across R 0. -R.0 was about 3 % lower in the 2 TRIP by TietoEnator, Inc. 3 Proportional query extent (PQE) was measured only for high recall and high precision searching because of research economical reasons. PQE is the share of query terms actually used of the available terms in inclusive query plans (average over facets).

6 Precision Exhaustivity,00 0,90 0,80 0,70 0,60 0,50 0,40 0,30 0,20 0,0 0,00 0,00 0,20 0,40 0,60 0,80,00 Figure 3. Average precision at fixed recall levels in optimal queries for small, large&dense and large&sparse databases. 5,0 4,0 3,0 2,0,0 Figure 4. Exhaustivity of high recall queries optimised for small, large&dense and large&sparse databases. Proportional query extent 0,0 0,00 0,20 0,40 0,60 0,80,00 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0, 0,0 0,0 0,20 0,30 0, ,00 Figure 5. Proportional query extent (PQE) of optimal queries in the small, large&dense, and large&sparse databases. Small db L&d db L&s db Small db L&d db L&s db Small db L&d db L&s db large & dense database (database size effect), and about 40 % lower in the large & sparse database (database size + density effect) than in the small database (see Fig 3). The average exhaustivity of optimal queries was higher in the large databases than in the small one, but the level of precision could not be maintained. Proportional query extent was highest in the large & dense database suggesting that more query terms are needed per facet when a larger number of documents have to be retrieved. The number of topics Figure 6. The number of search topics where full recall can be achieved as a function of query exhaustivity in the small and large recall bases (8 topics in total) (8 topics) 2 (8) 3 (7) 4 (2) 5 (5) Query exhaustivity A very interesting deviation was identified in the precision and exhaustivity curves at the highest recall levels. In the large & dense database, the precision and exhaustivity of optimal queries fell dramatically between R 0.9 and R.0. The results of the facet analysis of all relevant documents in a sample of 8 test topics clarified the role of the recall base size in falling effectiveness at R.0. The more documents need to be retrieved to achieve full recall, the more there occur relevant documents where some query plan facets are expressed implicitly. The results are presented in Fig 6. For Exh= full recall was possible in all but one test topic for both recall bases. At higher exhaustivity levels, the number of test topics where full recall is possible fell much faster in the large recall base. Above results are just examples from the case study findings to illustrate the potential uses of the proposed method. High precision searching was also studied by applying DCVs as standard points of operation. It turned out, for instance, that the database size alone does not induce efficiency problems at low DCVs. On the contrary, highest precision was achieved in the large & dense database. It was also shown that earlier results indicating the superiority of proximity operators over the AND operator in high precision searching are invalid. Queries optimized separately for both operators show similar average performance. For details, see [26]. 4. JUSTIFICATION OF THE METHOD Evaluation methods should themselves be evaluated in regard to appropriateness, validity, reliability, and efficiency [24, 29]. The appropriateness of a method was verified in the case study by showing that new results could be gained. Validity, reliability, and efficiency are more complex issues to evaluate. The main concerns were directed at the unique operations: inclusive query planning and query optimization. 4. Facet Selection Test Three subjects having good knowledge of text retrieval and indexing were asked to make a facet identification test using a sample of 4 test topics. The results showed that the exhaustivity of inclusive query plans used in the case experiment were not biased downwards (enough exhaustivity tuning space). The test also verified earlier results that the consistency in the selection of query facets is high between search experts. 4.2 Facet Representation Test The facet analysis of all relevant documents in the sample of 8 search topics showed that the original query designer had 8 Large recall base Small recall base 0

7 missed or neglected about one third of the available expressions in the relevant documents. However, the effect of missed query terms was regarded as marginal since their occurrences in documents mostly overlapped with other expressions already covered by the query plan. The effect was shown to be much smaller than the effect of implicit expressions. In the interactive query optimization test (see next section), precision was observed to drop less than 4 %. 4.3 Interactive Query Optimization Test The idea of the interactive query optimization test was to replace the automatic optimization operation by an expert searcher, and compare the achieved performance levels as well as query structures. A special WWW-based tool, the IR Game [27], designed for rapid analysis of query results was used in this test. When interfaced to a laboratory test collection, the tool offers immediate performance feedback at the level of individual queries in the form of recall-precision curves, and a visualization of actual query results. The searcher is able to study, in a convenient and effortless way, the effects of query changes. An experienced searcher was recruited to run the interactive query optimization test. A group of three control searchers were used to test the overall capability of the test searcher. The test searcher was working for a period of.5 months trying to find optimal queries for the sample of 8 test topics for which the full data of facet analysis was available. In practice, the test searcher did not face any time constraints. The results showed that the algorithm was performing better than or equally with the test searcher in 98 % out of the 98 test cases. This can be regarded as an advantageous result for a first version of a heuristic algorithm. 4.4 Efficiency of the Method The investment in inclusive query planning was justified to be reasonable in the context of a test collection. It was also shown that the growth of running time of the optimization algorithm can be characterized by O(n log n), and that it is manageable for all EQ sets of finite size. 5. CONCLUSIONS AND DISCUSSION The main goal of this study was to design, demonstrate and evaluate a new evaluation method for measuring the performance of Boolean queries across a wide operational range. Three unique characteristics of the method help to comprehend its potential:. Performance can be measured at any selected point across the whole operational range, and different standard points of operation (SPO) may be applied. 2. Queries under consideration estimate optimal performance at each SPO, and query structures are free to change within the defined query tuning space in search of the optimum. 3. The expertise of professional searchers could be brought into a system-oriented evaluation framework in a controlled way. The domain of the method can be characterized by illustrating the kinds of research variables that can be appropriately studied by applying the method. Query precision, exhaustivity and extent are used as dependent variables, and the standard points of operation as the control variable. Independent variables may relate to:. documents (e.g. type, length, degree of relevance) 2. databases (e.g. size, density) 3. database indexes (e.g. type of indexing, linguistic normalization of words) 4. search topics (e.g. complexity, broadness, type) 5. matching operations (e.g. different operators). The proposed method offers clear advantages over traditional evaluation methods. It helps to acquire new information about the phenomena observed and challenge present findings because it is more accurate (averaging at defined SPOs). The method is also economical in experiments where a complex query tuning space is studied. The query tuning space contains all potential candidates for optimal queries, but data are collected only on those queries that turn out to be optimal at a particular SPO. The proposed method yielded two major innovations: inclusive query planning, and query optimization. The former innovation is more universal since it can be used both in Boolean as well as in best match experiments, see [4]. The query optimization operation in the proposed form is restricted to the Boolean IR model since it presumes that the query results are distinct sets. The inclusive query planning idea is easier to exploit since its outcome, the representation of the available query tuning space, can also be exploited in experiments on best-match IR systems. Traditional test collections were provided with complete relevance data. Inclusive query plans are a similar data set that can be used in measuring ultimate performance limits of different matching algorithms. Inclusive query plans help also in categorizing test topics according to their properties, e.g. complex vs. simple (exhaustivity tuning dimension), and broad vs. narrow (extent tuning dimension). This opens a way to create experimental settings that are more sensitive to situational factors, the issue that has been raised in the Boolean/best-match comparisons [, 20]. 6. ACKNOWLEDGMENTS I am grateful to my supervisor Kalervo Järvelin, and to the FIRE group: Heikki Keskustalo, Jaana Kekäläinen, and others. 7. REFERENCES [] Arnold, B.H. (962). Logic and Boolean algebra. Eaglewood Cliffs: Prentice-Hall. [2] Belkin, N.J. & Croft, W.B. (987). Retrieval Techniques. In: Williams, M.E., Annual Review of Information Science and Technology 22(), 09-45, New York: Elsevier & ASIS. [3] Blair, D.C. & Maron, M.E. (985). An evaluation of retrieval effectiveness for a full-text document retrieval system. Comm. of the ACM (28)3, [4] Chvátal, V. (983). Linear Programming. New York: W.H. Freeman. [5] Cleverdon, C.W. (967). The Cranfield tests on index language devices. Aslib Proceedings 9(6),

8 [6] Fidel, R. (99). Searcher s Selection of Search Keys. Journal of the American Society for Information Science 42(7), , 50-54, [7] Frants, V.I., Shapiro, J., et al. (999). Boolean Search: Current State and Perspectives. Journal of the American Society for Information Science 50(), [8] Harman, D. (993). The First Text Retrieval Conference (TREC-). Gaithersburg: National Institute of Standards and Technology. (NIST Spec. Publ ). [9] Harter, S.P. (986). Online Information retrieval. Orlando: Academic Press. [0] Harter, S.P. (990). Search Term Combinations and Retrieval Overlap: A Proposed Methodology and Case Study. Journal of the American Society for Information Science 4(2), [] Hersh, W.R. & Hickam, D.H. (995). An Evaluation of Interactive Boolean and Natural Language Searching with Online Medical Textbook. Journal of the American Society for Information Science 48(7), [2] Iivonen, M. (995). Consistency in the selection of search concepts and search terms. Information Processing & Management 3(2), [3] Ingwersen, P. & Willett, P. (995). An Introduction to Algorithmic and Cognitive Approaches for Information Retrieval. Libri 45(), [4] Järvelin, K., Kristensen, J., et al. (996). A Deductive Data Model for Query Expansion. In: Proceedings of the 9th International ACM SIGIR Conference, Zürich, Switzerland, August 8-22, 996. [5] Lancaster, F.W. (968). Information Retrieval Systems: Characteristics, Testing, and Evaluation. New York: John Wiley. [6] Lancaster, F.W. & Warner, A.J. (993). Information Retrieval Today. Arlington: Information Resources Press. [7] Martello, S. & Toth, P. (990). Knapsack Problems. Algorithms and Computer Implementations. Guildford: John Wiley & Sons. [8] McKinin, E.J., Sievert, M.E., et al. (99). The Medline Full-Text Project. Journal of the American Society for Information Science 42(4), [9] Newell, A. (968). Heuristic programming: Ill-structured problems. In: Arofonsky, J. (Ed.). Progress in Operations Research, Vol III, New York. [20] Paris, L.A.H. & Tibbo, H.R. (998). Freestyle vs. Boolean: A comparison of partial and exact match retrieval systems. Information Processing & Management 34(2/3), [2] Salton, G. (972). A new comparison between conventional indexing (MEDLARS) and automatic text processing (SMART). Journal of the American Society for Information Science 23(March-April), [22] Salton, G. (986). Another look at automatic textretrieval systems. Communications of the ACM 29(7), [23] Salton, G. & McGill, M.J. (983). Introduction to Modern Information Retrieval. Singapore: McGraw-Hill. [24] Saracevic, T. (995). Evaluation of evaluation in information retrieval. In: Fox, E.A. et al. (Eds.), SIGIR 95 - Proceedings of the 8th Annual International ACM SIGIR Conference. Washington July 9-3, 995, p [25] Saracevic, T., Kantor. P. et al. (988). A Study of Information Seeking and Retrieving. Journal of the American Society for Information Science 39(3), pp. 6-76, 77-96, and [26] Sormunen, E. (2000). A Method for measuring Wide Range Performance of Boolean Queries in Full-Text Databases. Doctoral Thesis. Tampere: University of Tampere. Acta Electronica Universitatis Tamperensis, ISBN: , 23 p. URL: [27] Sormunen, E., Laaksonen, J., et al. (998). The IR Game - A Tool for Rapid Query Analysis in Cross-Language IR Experiments. PRICAI '98 Workshop on Cross Language Issues in Artificial Intelligence. Singapore, Nov 22-24, 998, p [28] Sparck-Jones, K. (98). Information retrieval experiment. London: Butterworths. [29] Tague-Sutcliffe, J. (992). The pragmatics of information retrieval experimentation, revisited. Information Processing & Management 28(4), [30] Turtle, H. (994). Natural Language vs. Boolean Query Evaluation: A Comparison of Retrieval Performance. In: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference. London: Springer-Verlag. p

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND 41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia