CLARIT Compound Queries and Constraint-Controlled Feedback in TREC-5 Ad-Hoc Experiments

Size: px

Start display at page:

Download "CLARIT Compound Queries and Constraint-Controlled Feedback in TREC-5 Ad-Hoc Experiments"

Ashlynn Robinson
5 years ago
Views:

1 CLARIT Compound Queries and Constraint-Controlled Feedback in TREC-5 Ad-Hoc Experiments Natasa Milic-Frayling 1, Xiang Tong 2, Chengxiang Zhai 2, David A. Evans 1 1 CLARITECH Corporation 2 Laboratory for Computational Linguistics 531 Fifth Avenue Carnegie Mellon University Pittsburgh, Pennsylvania Pittsburgh, Pennsylvania Introduction A fundamental problem for searching over large databases in ad-hoc mode is the formulation of an effective initial query that is both comprehensive and focused. The query needs to be comprehensive enough to retrieve, on its own or enhanced by various automatic feedback techniques, relevant documents that possibly address different aspects of the topic. At the same time, it has to be focused enough to ensure quality input into intermediate feedback processes or high precision in the final retrieval. In TREC, the initial query formulation problem has been addressed in two types of ad-hoc tasks: the automatic ad-hoc task that essentially relies on the original description of the TREC topic, a complete topic or some part of it, and the manual ad-hoc task in which users create queries either solely based on their own understanding and knowledge about the topic or by consulting various information resources, including documents from the target database, as permitted in this year s TREC. We are particularly interested in the manual ad-hoc task as it highly resembles real world search situations, especially by allowing the user to use information from the target corpus. Two most natural ways to exploit the target databases directly are (1) to allow the user to re-formulate the query through interactive searches and (2) to enhance the initial query automatically based on documents reviewed by the user. Some of the manual ad-hoc experiments in TREC 5 include both methods, others use the latter. The main objective of our TREC-5 ad-hoc experiments is to evaluate a method by which the user can influence rather than perform the selection of feedback documents for an automatic query enhancement, assuming that the active interaction between the user and the system ends with the initial interactive search over the target data. In this respect CLARIT TM 1 TREC-5ad-hoc experiments represent a continuation and further refinement of the study on constraints controlled feedback that we initiated in TREC-4 ad-hoc experiments. However, this year we performed ad-hoc experiments with the CLARIT commercial system, which, in contrast to the CLARIT experimental suite, does not yet incorporate system features that have been effective in improving retrieval performance in previous TRECs, such as partitioning documents into fixed length overlapping windows and using automatic negative feedback in the form of the CLARIT distractor term space [1, 2]. The CLARIT commercial system, on the other hand, is equipped with a GUI that fully supports the user in interactive generation of CLARIT compound queries, a natural language (NL) query supplemented by Boolean type constraints, that we previously explored in a more rudimentary form in TREC 4 [2]. The second objective of our study is to determine how effective user specified constraints are in facilitating the final ranking of documents in order to achieve a higher front-end precision. Our official submissions, CLTHES and CLCLUS, explore the use of constraints to control both the automatic feedback and the final ranking of documents. Finally, we include in our analysis a new experimental feature of the CLARIT system, namely manual query expansion using pre-computed concept clusters from the target database. Generally there are many difficulties associated with experimental designs that rely upon an authoritative document relevance judgment that is independent of the experiment process. In manual ad-hoc experiments this problem is particularly emphasized since the characteristics of individual searchers become more pronounced and 1 CLARIT is a registered trademark of CLARITECH Corporation.

2 influential with increased levels of user interaction with the system. In relation to our TREC-5 experiments we anticipate difficulties in differentiating the effects of: the discrepancy between the user s and a NIST expert s relevance criteria the ability of the users to translate their relevance criteria into an operational definition expressed through Boolean type constraints the inherent limitation of constraints in characterizing document relevance. We are also sensitive to the problem of over-fitting of the initial query, in particular the set of constraints, to the limited number of documents that the user reviews during the process of creating the initial query. In contrast to the relevance judgment problem that is inherent to our experiment design, user bias introduced through the query construction process is a problem that needs to be dealt with in real life applications. It exists in all situations in which the user has a limited view of the information space. Thus, an important issue that will be a focus of our future experiments is the representation of the global information space as an aid to document retrieval. Although still in an experimental form, the concept clustering technique used in CLCLUS experiment represents our first step in that direction. In Section 2 of this paper we present a detailed description and analysis of the experiments with feedback control. In Section 3 we discuss the official TREC-5 experiments, CLTHES and CLCLUS. We summarize our findings in Section 4. The Appendix contains information about system parameters and experiments performed in the TREC-5 Very Large Collection (VLC) track. 2. Experiments with Feedback Control 2.1 CLARIT Compound Query The general CLARIT approach to TREC retrieval tasks relies on a rich representation of both TREC topics and documents in TREC databases. CLARIT TREC queries are often composed of several layers of terminology that predominantly originate from the target data [2]. It is essentially the overlap between the statistically prominent features in the query term space and the document terms space that determines the degree of document relevance to the topic. This retrieval strategy generally reduces the ambiguity of query concepts by providing adequate contextual description and consequently improves retrieval precision and recall. However, the key to the procedure is a relatively difficult task of identifying good sources of terminology in the target corpus and effective methods for extracting such terminology. Normally, in ad-hoc experiments the top N documents or document windows retrieved in response to the initial query are assumed to be relevant to the query and further processed by the CLARIT Thesaurus Extraction module to obtain the terminology characteristic of the selected documents [1, 2]. This terminology, when added to the initial query, helps the system identify documents with similar content. Although the thesaurus extraction technique is robust and provides effective query augmentation even when the precision of the initial retrieval is not very high, we wish to design a more reliable procedure for selecting potentially relevant documents for ad-hoc searching. For that purpose we create a notion of the CLARIT compound query, which consists of a natural language (NL) query and a set of Boolean type constraints constructed by the user. The constraints are intended to capture and enforce the user s relevance criteria during selection of documents for automatic feedback in a partially interactive or non-interactive search environment where the user has limited or no access to the target database. In such situations constraints can serve the general purpose of providing the user s input to intermediate automatic processes, such as automatic feedback, or propagating the user s relevance criteria even further through the retrieval process and facilitating the final ranking of documents. Therefore, in our experiments that address feedback control, we allow the user to specify a compound query which, in addition to the NL query, contains the user s criteria for selecting documents expressed in the form of Boolean constraints. These constraints are applied as filters over documents retrieved in response to the NL query (see Figure 1). More precisely, the top N retrieved document windows are evaluated with respect to the user s selection criteria. Only those that satisfy the user s criteria are used for automatic feedback.

3 &RQVWUDLQWV )HHGEDFN Initial Query CLARIT Retrieval Doc LIST1 CLARIT Thesaurus Extraction Augmented Query CLARIT Retrieval Final Doc LIST Figure 1: Document filtering as a mechanism to control automatic feedback This approach to achieving more effective automatic feedback was first evaluated in our TREC-4 experiments in the context of different levels of control that the user may have over the feedback procedure [2]. In fact, in TREC 4 we committed ourselves to completing a sequence of experiments that explore the feedback control in more depth. In the subsequent sections we present the results that we obtained by performing these experiments on TREC-5 data. 2.2 Feedback Control Experiment Design In our feedback control experiments we explore two factors: 1. The type of the initial query, i.e., the source of terminology used by the user to formulate the initial query; and 2. The feedback control, i.e., the level of control that the user has over the system s expansion of the query. Initial Query Types NLP Query Terminology from Non-Target Databases Terminology from the Target Database Fully Automatic m A1 m M1 m I1 Feedback Control Document Filtering m A2 m M2 m I2 Table1: Set of CLARIT Feedback Control Experiments Manual Selection of Documents m A3 m M3 m I3 Based on the source of terms for the query, we classify queries into three categories: 1. Queries created by automatic NL processing of the topic description Terms in such queries are derived directly from CLARIT NLP of the topic description or an information request. In fact, in our TREC-5 exploration of the feedback control we used queries that consist of a subset of the NLP-generated terms. This subset of terms was identified by the user to avoid terminology that is not directly related to the topic but serves only as descriptors of the retrieval task. 2. Manually created queries supplemented with terms from a non-target corpus The user creates the query using the CLARIT IR system (in particular, thesaurus-discovery operations) to find useful terms from information sources other than the target corpus. 3. Manually created queries supplemented with terms from the target corpus The user creates the queries using the CLARIT IR system over the target corpus.

4 The second parameter in the experiments represents different levels of control that the user has over the feedback procedure 2 : 1. Fully automatic query augmentation The top N document windows retrieved by the initial NL query are used by the system to extract the CLARIT Thesaurus. A specified portion of thesaurus terms is automatically added to the initial query. 2. Filtering of feedback documents based on user specified constraints The user specifies Boolean type constraints to be used to filter document windows retrieved by the initial NL query. Only the top N document windows that meet the user s constraints are used in thesaurus extraction. The specification of constraints may or may not include information about the target corpus. 3. User selection of feedback documents The user is allowed to review documents that are retrieved in response to the initial NL query and select the ones to be used for automatic query augmentation. 2.3 Feedback Control Experiments with TREC-5 data As mentioned before, we used document filtering to facilitate automatic feedback in TREC-4 ad-hoc experiments for the first time [2]. The results of those experiments showed a modest improvement of the retrieval performance. Our objective here is to examine the effects of similar feedback control using TREC-5 data and topics. We extend the analysis to all three types of manual queries described in Section 2.2 since they represent interesting classes of IR problems and applications. Indeed they represent an abstraction of typical situations such as searching with minimal user intervention, building of user s interest profiles without having example relevant documents for individual topics, or enhancing the basic interactive search by automatic query augmentation. The main differences between the TREC-4 and TREC-5 experiments are in the implementation of the document filtering mechanism and the choice of the system used in the experiments: (1) TREC-5 constraints are designed dynamically, using interactive searches over the non-target or target data, rather than being specified separately from the NL query. (2) In addition to the logical AND and OR operators, the TREC-5 constraints use the NOT 3 operator. (3) In TREC 5 both the manual construction and the batch processing of queries are performed using the commercial CLARIT IR System rather than the CLARIT experimental system. This implies that instead of partitioning and matching documents on the level of fixed length overlapping windows we used fixed length disjoint document windows. Furthermore, we did not apply the CLARIT negative feedback technique to contrast the potentially useful terminology and the general, distracting terminology [2]. We also indexed individual TREC databases separately and used the CLARIT database merging technique to obtain the final rank list of documents, rather than creating one monolithic TREC-5 database. The results of the TREC-5 feedback control experiments are summarized in Tables 2-4. For clarification, we give a brief description of each group of experiments, as follows: Experiments with simple NL queries. Initial queries for A1-A3 experiments were created by reviewing the output of the NL processing of the topic text and eliminating frequent and non-specific terms. In the document filtering experiment (A2) we applied the same set of constraints that was generated through interactive searches over the target database and used in both the experiment I2 and the official run CLTHES. For completeness and consistency we will, in the future, test the NL queries with constraints generated only based on the topic description. 2 Originally we considered an additional type of user feedback: selection of feedback documents and terminology to be added to the query. However, the simulation of such feedback by a NIST expert cannot be reliably implemented since we cannot reliably measure the degree to which our term selection approximates that of a NIST expert. 3 In fact, the NOT operator was used in very few topics. Thus, in that respect TREC-4 and TREC-5 filtering mechanisms are comparable.

5 Experiments with queries constructed using terminology from non-target data. Initial queries for this set of experiments (M1-M3) were created through interactive searches over the AP89 database, which is not included in the TREC5 data set. Constraints used in M2 were constructed either dynamically, with verification of the effects that they have on the search over the AP89 data, or simply from the user s knowledge of the topic domain. Experiments with queries constructed using terminology from the target data. For these experiments (I1-I3) we formulated the compound query, i.e., both the initial NL queries and the associated constraints, by searching over the target data. The same NL queries and constraints are used in the official CLTHES run (see Section 3). All the experiments with feedback based on the user selection of documents, A3-M3-I3, use the NIST relevance judgment of documents to simulate search by an expert user. Additional query terminology is extracted from the relevant documents that appear among the top 1 documents retrieved in response to the initial NL query. In the future we plan to re-run these experiments with feedback based on document windows rather than full documents, to make them more consistent and comparable with the rest of the discussed experiments. We, in fact, expect a higher recall and precision since, in our past experiments, thesaurus based query expansion using terminology from system discovered relevant portions of documents, rather than the text of complete documents, has proven to be more effective. 2.4 Experiment Results The results of the three sets of experiments presented in Tables 2-4 and Figures 2-4 lead us to the conclusion that the controlled feedback technique, facilitated by user specified constraints, helps bridge the performance gap between automatic feedback (A1-M1-I1) in which the system extracts additional terminology from the top N retrieved documents windows and automatic query augmentation with terminology from the truly relevant documents (A3-M3-I3). Indeed, filtering generally improves precision: an increase in average precision and R-precision is observed for all three types of queries. We suspect that the reason for a consistently higher precision in experiments with the compound query created interactively over the target data (I1-I2-I3) was achieved because of the high precision of the initial query and the nature of the thesaurus based query enhancement used in the feedback phase. Although an initial query, such as the one created interactively over AP89 database (and used in the experiments M1-M2-M3), might provide a better general representation of the topic, as it could, perhaps, be inferred from the associated recall statistics, its initial precision over the target database may not be that high. Since the CLARIT Thesaurus extraction technique discovers terminology prominent in a given set of documents, its use in the feedback phase further emphasizes features of the initially retrieved documents. In that manner, higher initial precision yields a higher overall retrieval performance. Initial Query Type Feedback Control Fully Automatic Document Filtering User Selection of Documents NL Query l A1 l A2 l A3 Recall (Max = 5,524) 2,859 2,778 (-3%) 1 3,7 (7%) 1 (11%) 2 Average Precision (21%) 853 (59%) (32%) R-Precision (15%) 913 (32%) (15%) Front-end Precision (8%).8295 (66%) (53%) 1 Relative difference with respect to the fully automatic feedback exp. 2 Relative difference with respect to the document filtering exp. Table 2: Feedback control experiments with simple NL manual queries Furthermore, although we expect that user constraints may inhibit recall, this is observed only in the experiments with the simple NL query combined with CLTHES constraints (A2). Similarly, front-end precision is higher for all searches with constrains except of the run M2.

6 Initial Query Type Terminology from Non-Target Databases Feedback Control Fully Automatic Document Filtering User Selection of Documents l M1 Recall (Max = 5,524) 3,279 3,322 (1%) 1 3,386 (3%) 1 (2%) 2 Average Precision (12%) 139 (59%) (42%) R-Precision (9%) 184 (32%) (22%) Front-end Precision (-4%).8964 (55%) (61%) 1 Relative difference with respect to the fully automatic feedback exp. 2 Relative difference with respect to the document filtering exp. Table 3: Feedback control experiments with queries created using terminology from non-target databases Initial Query Type Terminology from Target Databases l M2 l M1 Feedback Control Fully Automatic Document Filtering User Selection of Documents l I1 Recall (Max = 5,524) 3,116 3,144 (1%) 1 3,21 (3%) 1 (2%) 2 Average Precision (12%) 52 (35%) (2%) R-Precision (15%) 195 (25%) (9%) Front-end Precision (1%).8954 (34%) (22%) 1 Relative difference with respect to the fully automatic feedback exp. 2 Relative difference with respect to the document filtering exp. Table 4: Feedback control experiments with CLTHES manual queries From the above experiments we can also make interesting observations regarding two important issues related to the manually created compound queries: Over-fitting of the NL query We suspect that the lower recall in the experiment I1 (with the initial NL query generated over the target data), in comparison to the recall achieved in M1 (the experiment with the NL query constructed over a non-target database) is a consequence of NL query over-fitting to the documents reviewed by the user during NL query construction. Having tried several search strategies and reviewed retrieved documents, the user probably focused on the aspects of the topics represented in the documents. Consequently, the created queries reflect the user s view and understanding of the topic based on a limited number of documents viewed from the target database. Over-fitting of the user constraints Experiments A1-A1-A3 are very useful for assessing the degree to which user constraints, independently from the NL query, are influenced by the user interpretation of the topic. It is interesting to note that, when combined with the constraints generated through interactive search over target database (I2), the automatically generated NL queries yield a slightly lower recall then with fully automatic feedback (A1). This is an indicator of over-fitting of manually built constraints to the documents reviewed by the user during manual building of queries. Indeed, it seems that some of the features in the automatic NL query that were responsible for retrieving certain types of relevant documents were suppressed by the use of constraints. However, the decrease in recall in A2 was not followed by a decrease in retrieval precision. The reason for that is the robustness of the thesaurus based feedback. Loss of a relatively small percentage of relevant documents (3%) does not have a great impact on the types of terms extracted by the thesaurus technique. In fact, it seems that the user s constraints were successful in retaining a sufficient number of relevant documents and that the thesaurus extracted terms further amplified the role of query features responsible for retrieving relevant documents. This resulted in increased retrieval precision. l I2 l I3

7 P/R Curve for Feedback Control Experiments A1-A3 Precision RelevanceFeed ControlledFeed AutoFeed RelFeed-AvePrec ContFeed-AvePrec AutoFeed-AvePrec Recall P/R Statistics for Feedback Control Experiments A1-A Precision @1 R-PREC Recall (Num of docs) RelFeed ControlledFeed AutoFeed Figure 2: Precision/recall curves and statistics for the TREC-5 feedback control experiments A1-A3

8 P/R Curves for Feedback Control Experiments M1-M3 Precision RelevanceFeed ControlFeed AutoFeed-AvPrec RelFeed-AvPrec ContFeed-AvPrec AutoFeed-AvPrec Recall P/R Statistics for Feedback Control Experiments M1- M3.8 Precision @1 R-PREC Recall (Num of docs) RelevanceFeed ControlledFeed AutoFeed Figure 3: Precision/recall curves and statistics for the TREC-5 feedback control experiments M1-M3

9 P/R Curves for Feedback Control Experiments I1-I3 Precision RelevanceFeed ControlledFeed AutoFeed RelFeed-AvePrec ContFeed-AvePrec AutoFeed-AvePrec Recall P/R Statistics for Feedback Control Experiments I1-I Precision @1 R-PREC Recall (Num of docs) RelevanceFeed ControlledFeed AutoFeed Figure 4: Precision/recall curves and statistics for the TREC-5 feedback control experiments I1-I3

10 3. CLARIT TREC5 Experiments: CLTHES and CLCLUS 3.1 Experiment Design For the official submission of TREC-5 ad-hoc experiments we selected two experimental runs, CLTHES and CLCLUS, that include new techniques: (1) Creating the initial manual queries using CLARIT term clusters (2) Applying a second set of constraints to filter the results of the augmented query with the goal of improving the front-end precision of the retrieved documents (3) Merging the results from the constrained and unconstrained searches to obtain the final set of documents with a higher front-end precision and higher recall. Initial Query &RQVWUDLQWV CLARIT Retrieval Doc LIST1 )HHGEDFN CLARIT Thesaurus Extraction Augmented Query CLARIT Retrieval Doc LIST2 Doc LIST3 Result Merging Final Doc LIST &RQVWUDLQWV Figure 5: Design of the CLTHES and CLCLUS Experiments Focusing more on exploiting constraints to improve the precision of the final retrieval and related result merging techniques, CLTHES and CLCLUS represent an extension of the feedback control experiments presented in Section 2. Their main objectives are: (1) To verify whether user input in the selection of documents for final retrieval, facilitated through a second set of constraints, would be beneficial for achieving better retrieval performance, in particular higher front-end precision (2) To explore result merging techniques that combine the benefit of the higher front-end precision observed with more tightly controlled searches with the higher recall expected from less constrained searches Construction of Queries In CLTHES and CLCLUS experiments, CLARIT manual queries were created through iterative searching and review of documents from the target databases, TREC Disk 2 and Disk 4, using the CLARIT Interactive system. The two experiments differ mainly in the source of terminology that the system provides to the user as an aid for creating the queries. In CLTHES the terminology is generated by the CLARIT Thesaurus extraction technique from user specified documents. More precisely, the user starts building a NL query by initiating a search with key concepts from the topic text and extracting CLARIT Thesaurus from the documents that the user finds relevant to the query. The user then reviews the thesaurus terms and selects those suitable for the query and/or the constraints. The NL queries and the Boolean constraints are created and tested simultaneously.

11 In the CLCLUS experiment we create for each TREC-5 database a complementary database of terminology clusters. Individual clusters consist of terminology that tends to co-occur in content related documents and essentially provide an overview of themes in the corpus. The clusters themselves are treated as documents by the CLARIT system and therefore can be explored using CLARIT search. In particular, the user can search over terminology clusters to identify those that best correspond to a given term or a set of terms. In CLCLUS the user selected the terms from individual clusters that seem most appropriate for describing the topic and added them to the query. We view CLARIT terminology clustering as an alternative to the thesaurus based technique for constructing and automatically enhancing queries. The main advantage of using clusters to create a query is that the procedure does not involve document review and relevance assessment since the user is presented with already digested information. Furthermore, terminology clusters have a higher potential for capturing various aspects of a particular topic in the database than a thesaurus created from the top N initially retrieved or user specified documents Constraints and Merging of Results The CLARIT System supports searching on the content of various fields in the documents either by NL querying alone or by supplementing the NL query with Boolean type constraints. In CLTHES and CLCLUS experiments we indexed the documents as having only two fields, the body of the text and the document title. The constraints formulated by the user were restricted to the body of the text. For example, for the Topic 279 we have: Topic 279: Earth magnetic pole shifting Constraints 1: Constraints 2: 'RF+DV7HUPHDUWK 'RF+DV7HUPPDJQHWLFSROH 'RF+DV7HUPQRUWKSROH 'RF+DV7HUPVKLIWLQJ 'RF+DV7HUPZDQGHULQJ 'RF+DV7HUPVKLIW 'RF+DV7HUPHDUWK 'RF+DV7HUPPDJQHWLFSROH 'RF+DV7HUPQRUWKSROH The first constraint, for example, is interpreted by the system as a requirement that the body of the text, in this case a document window, contains the term "earth" and at least one of the terms from each of the two term sets: {"magnetic pole", "north pole"} and {"shifting", "wandering", "shift"}. The first set of constraints is applied to the result set of the initial query in order to select document windows for feedback. Only those document windows among the top 4 that satisfy the first set of constraints are considered for feedback. The top 5% of the terms in the thesaurus extracted from these document windows are added to the query. The second set of constraints is applied to the results of the augmented query and the documents that satisfy the constraints are placed at the top of the final retrieval list. The remaining documents that do not satisfy the constraints are included at the bottom of the list 3.2 Comparative Analysis Results of CLTHES and CLCLUS Experiments Statistically there is no significant difference in the performance between CLTHES and CLCLUS, as can be seen from Figure 6 and Table 5. However, it is interesting to note that, for a number of queries the precision and recall differ significantly (Figure 7). It is our belief that the observed difference in performance is mostly due to the difference in formulation of the initial NL query, although for some queries the constraints may have caused dramatic changes.

12 P/R Curves for CLTHES and CLCLUS.8 Precision CLTHES (47) CLTHES (5) CLCLUS (47) CLCLUS (5) CLTH-AvP (47) CLTH-AvP (5) CLCL-AvP (47) CLCL-AvP (5) Recall P/R Statistics for CLTHES and CLCLUS CLTHES (47) CLTHES (5) CLCLUS (47) CLCLUS (5) Precision @1 R-PREC Recall (Num of docs) Figure 6: Precision and recall statistics for the CLTHES and CLCLUS Experiments Experiment CLTHES (5) CLTHES (47) CLCLUS (5) CLCLUS (47) Recall 3,147 3,144 3,163 3,16 AvePrecision R_Precision Front-end-prec (at.) Table 5: Performance statistics for CLTHES and CLCLUS

13 Difference in Average Precision: CLTHES - CLCLUS CLTHES AvePrec - CLCLUS AvePrec CLTHES AvePrec - CLCLUS AvePrec Difference in Recall: CLTHES - CLCLUS CLTHES Rec - CLCLUS Rec CLTHES Recall - CLCLUS Recall Figure 7: Difference in the retrieval performance for individual queries Indeed, if we compare the average precision and recall obtained by the NL queries only, without constraints and automatic feedback, the difference in retrieval and average precision statistics has similar characteristics as for CLTHES and CLCLUS experiments (see Figure 8). In fact, for 34 out of 5 topics the relationship between the average precision of CLTHES and CLCLUS initial queries does not change after the controlled feedback and document re-ranking have been applied. For example, the average precision for 19 topics is higher for a CLTHES NL query than for a CLCLUS NL query and it remains such when constraints and the automatic feedback are applied.

14 Difference in Average Precision between CLTHES and CLCLUS Initial NL Queries Difference in Recall Between CLTHES and CLCLUS Initial NL Queries CLTHES - CLCLUS AvePrec CLTHES Rec - CLCLUS REc CLTHES AvePrec - CLCLUS AvePrec CLTHES Recall - CLCLUS Recall Figure 8: Difference in average precision and recall for CLTHES and CLCLUS initial NL queries On the other hand, the retrieval performance of some of the queries shows that the constraints may change the average precision quite significantly. For example, for topic 273 we find that the experiment without constraints, thus using only unconstrained automatic feedback, yields average precision of.569. This precision was reduced to 632 when constraints were used to control the feedback and the selection of final documents (see Table 6). Topic 273: Volcanic and Seismic Activity Levels No constr. Constr. Retrieved: 1, 1, Relevant: Rel_ret: No constr. Constr. Interpolated Recall - Precision Averages: Precision: at At 5 docs:.8 at At 1 docs:.8 at At 15 docs:.8667 at At 2 docs:.9.5 Average precision: R-Precision: Constraint: (!(DocHasTerm "act")) 1 (!(DocHasTerm "rule")) && ((DocHasTerm "vocanic") 2 (DocHasTerm "seismic") (DocHasTerm "earthquake")) 1 2 "!"Indicates Boolean NOT operator Misspelled in the experiment Table 6: Difference in precision between constrained and unconstrained search for Topic Effects of result merging The final results for both CLTHES and CLCLUS are obtained by merging results of more constrained and less constrained searches, giving the preference to the results from the more constrained search in order to achieve higher front-end precision. Table 7 summarizes the performance statistics of the experiments related to CLTHES.

15 Experiment Constraints1 Feedback Constraints2 Recall AvePrec R-Prec Front-end-prec E1(Double-Constr.) yes yes yes 2, I2 (Single-Constr.) yes yes no 3, Merge E1 & I2 (CLTHES) 3, Merge E1 (1) & I2 3, Table 7 Results of the more constrained search E1, which involves two sets of constraints and automatic feedback, are improved by the merge with results from the simple feedback control experiments I2, in which we relax the constraint on the final selection of documents. The recall and all precision statistics are increased. However, when compared with I2 the merged results are inferior, which is the consequence of the significantly worse performance of E1 itself. In our result merging experiments we also began to explore the effects of different ratios in which the participating list are combined. For example, we used only the top 1 documents from the more constrained experiments and supplemented them with the remaining documents from the less constrained search. While in the case of the double constrained experiment this resulted in a list that is better only than the results of the more constrained search, other experiments indicate that the merged list can attain performance statistics higher than merged components. As an example, we present in Table 8 the same merging procedures for the simple controlled experiments I2 and E2 which involve only the constraints for selection of feedback documents. Experiment Constraints1 Feedback Constraints2 Recall AvePrec R-Prec Front-end-prec E2 (No-Feedback) yes no no 2, I2 (Single-Constr.) yes yes no 3, Merge E2 & I2 2, Merge E2 (1) & I2 3, Table Comparison with TREC-5 Participating Systems In comparison with other systems that participated in the ad-hoc manual category, CLARIT achieved recall and average precision above the median for most of the queries (see Figures 9-1 and Tables 9-1). More precisely, the recall of CLTHES was above or equal to the median for 41 of 5 (82%) queries. Similarly, for CLCLUS, the recall was above or equal to the median for 37 out of 5 (74%) queries. The system achieved the best recall for 11 queries in the CLTHES and for 12 queries in the CLCLUS experiment. Comparison of the CLARIT average precision for individual queries shows that 38 out of 5 (76%) CLTHES queries and 33 out of 5 (66%) CLCLUS queries achieved precision above or equal to the median. In CLTHES an average precision within 1% from the best average precision was achieved for 4 queries and in CLCLUS for 5 queries. Figures 9-1 show the comparison of the average precision and recall statistics for individual queries.

16 TREC-5: CLTHES and CLCLUS Average Precision Statistics AvePrecision (Sorted by Median) Best Median Min CLTHES CLCLUS TREC-5: CLTHES and CLCLUS Average Precision - Comparison with the Median AvePrec - MedPrec Best-MedPrec CLTHES-MedPrec CLCLUS-MedPrec Figure 9: CLTHES and CLCLUS average precision statistics AvePrec Statistics > Med =Med <Med Best(1%diff) CLTHES CLCLUS Table 9: Comparison with the median average precision

17 TREC-5: CLTHES and CLCLUS Recall Statistics 4 Recall - Sorted by Med Best-1 Med-1 Min-1 CLTHES CLCLUS TREC-5: CLTHES and CLCLUS Recall - Comparison with the Median Recall Recall-MedRecall Best-Med CLTHES-Med CLCLUS-Med Figure 1: CLTHES and CLCLUS recall statistics Recall Statistics > Med =Med <Med Best CLTHES CLCLUS Table 1: Comparison with the median recall 4. Conclusions The CLARIT feedback control experiments and the two official TREC-5 runs, CLTHES and CLCLUS, provide useful insights in the interaction between the constraints and automatic feedback. Since the evaluation of experiments is done with respect to available relevance judgment of the NIST experts, the improvement figures reflected in our experiments as reported in this paper are only an approximation of the true performance indicators. Our experience with the analysis of TREC-4 Interactive experiments [6] lead us to believe that the general improvement of retrieval

18 performance achieved by constraint controlled feedback would be more significant when computed within the actual user s evaluation system. We expect the same to be the case with the results of CLTHES and CLCLUS in which the constraints are also used to facilitate the final ranking of the retrieved documents. We speculate that the slight decrease in performance observed in CLTHES will in fact be interpreted as an improvement in the user s value system since the ranking of documents controlled by user specified constraints more strongly reflects the true user s relevance judgment criteria. In summary, since the NL queries are created through interactive searches over the target database it is not surprising that the experiment which uses only the initial queries (I) yields a reasonably high precision (Table 11). Applying fully automatic feedback to the initial queries increases recall significantly at the expense of the precision as can be seen from the retrieval results in the experiment I1. Furthermore, a refined automatic feedback that includes user s input in the form of user constraints helps further improve recall and increase precision. That can be seen from experiment I2. In the experiment CLTHES we observe a slight decrease in precision when user constraints are also used for final document ranking. We believe that this is due to the user bias, in particular to the constraints over-fitting to the documents reviewed and judged relevant by the user during the initial, interactive query formulation process. Experiment Constraints1 Feedback Constraints2 Recall AvePrec R-Prec Front-end-prec I no no no 2, I1 no yes no 3, I2 yes yes no 3, CLTHES yes yes yes 3, Table 11 We wish to emphasize that our study of constraint facilitated control feedback is primarily aimed at addressing the problem of limited user access to the target database. We do not expect that the use of constraints will outperform thesaurus based query expansion when the user can manually select relevant documents for automatic feedback. We also do not consider Boolean constraints - as they have been used in classical Boolean systems and in some recent work related to the special IR problems [7] - to be a substitute for the NL representation of a query. Indeed, in our experiments, the NL component of the CLARIT compound query is conceptually much richer than the set of associated constraints. The constraints used in the final ranking of documents merely ensure that documents which responded to the NL query and contain features that, in user s opinion, should be sufficient for characterizing relevant documents, are presented at the top of the list. We assume that a user familiar with the topic has a particular information need that might be better addressed if constraints are used in this manner but by no means would we recommend using constraints as a primary tool for retrieving documents. Indeed, because of the extreme sensitivity of constraints to the global conceptual structure of the database we would not attempt to use user specified constraints as a complete set of sufficient conditions for determining document relevance. References [1] Evans, David A. and Lefferts, Robert G. "Design and Evaluation of the CLARIT-TREC-2 system". In Harman D. (Ed.), The Second Text REtrieval Conference (TREC-2). National Institute of Standards and Technology Special Publication Washington, DC: Government Printing Office, 1994, [2] Evans, David A., Milic-Frayling, Natasa and Lefferts, Robert G. "CLARIT TREC-4 Experiments". In Harman D. (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication. Washington, DC: U.S. Government Printing Office

19 [3] Harman D. (Ed.), (1993). The First Text REtrieval Conference (TREC-1). National Institute of Standards and Technology Special Publication 5-27, Gaithersburg, Md [4] Harman D. (Ed.), (1994a). The Second Text REtrieval Conference (TREC-2). National Institute of Standards and Technology Special Publication 5-215, Gaithersburg, Md [5] Harman D. (Ed.), (1995). The Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 5-225, Gaithersburg, Md [6] Milic-Frayling, Natasa, Zhai, C., Tong, X., Mastroianni, M.P., Evans, D.A., and Lefferts, R.G."CLARIT TREC-4 Interactive Experiments". In Harman D. (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication. Washington, DC: U.S. Government Printing Office [7] Hearst, Marti A."Improving Full-Text Precision on Short Queries using Simple Constraints". Proceedings of the Fifth SDAIR 96, , Las Vegas, Nevada, 1996.

20 Appendix CLARIT TREC-5 Very Large Collection Experiments CLARIT Very Large Collection (VLC) Experiments were performed with the goal of exploring the efficiency and effectiveness of the CLARIT IR System over large collections of data. In order to complete the task we indexed all the individual databases separately. The unique retrieval list was obtained by merging the results from retrieval over individual databases. We submitted two runs for each task, the baseline and the main task: CLVLCBA CLVLCBC CLVLCMA CLVLCMC - baseline (2G) with automatically created queries - baseline (2G) with automatically queries + manually created constraints - main (4G) with automatically created queries - main (4G) with automatically created queries + manually created constraints. In all four runs, the queries were constructed by applying CLARIT NL processing to the full description of the TREC topics. In the runs with the constraints, CLVLCBC and CLVLCMC, we applied Boolean type filters to the vector space results of the initial queries to select documents for automatic feedback. In these experiments we used the same set of constraints that were designed for the CLARIT ad-hoc experiments, CLTHES and CLCLUS. Evaluation of the VLC track experiments is based on the precision achieved at 2 retrieved documents. According to the answer key for the VLC track, CLARIT system performance is as follows: Precision at 2 Documents Task NL Queries NL Queries + Constraints Baseline CLVLCBA: 12 CLVLCBC: 292 Main Task CLVLCMA: 83 CLVLCMC: 92 Table 1 For the experiments we used a DECAlpha workstation with 128M of RAM. In Table 2 we summarize the system characteristics and performance statistics. Main Task Baseline Task Ratio (Main/Baseline) Data Structure Building Speed ( Mb/hour) Disk Space Requirement 8G 4G 2 Memory Requirements (for retrieval) 32M 22M 1.45 Query Processing Speed NL Queries (No. of queries/hour) NL Queries + Constraints Table 2

Document Retrieval Using The MPS Information Server (A Report on the TREC-7 Experiment)

Document Retrieval Using The MPS Information Server (A Report on the TREC-7 Experiment) François Schiettecatte (francois@fsconsult.com) FS Consulting, Inc. 20709 Delta Drive, Laytonsville, MD, 20882 http://www.fsconsult.com/