The impact of query structure and query expansion on retrieval performance

Size: px

Start display at page:

Download "The impact of query structure and query expansion on retrieval performance"

Marcus Lewis
5 years ago
Views:

1 The impact of query structure and query expansion on retrieval performance Jaana Kekäläinen & Kalervo Järvelin Department of Information Studies University of Tampere Published in Croft, W.B. & Moffat, A. & van Rijsbergen, C.J. & Wilkinson, R. & Zobel, J. (Eds.), Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM SIGIR '98), Melbourne, Australia, August 24-28, New York, NY: ACM Press, pp

2 The impact of query structure and query expansion on retrieval performance Jaana Kekäläinen Dept. of Information Studies University of Tampere Abstract The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (INQUERY 1 ). Query structure means the use of operators to express the relations between search keys. Eight different structures were tested, representing weak structures (averages and weighted averages of the weights of the keys) and strong structures (e.g., queries with more elaborated search key relations). QE was based on concepts, which were first selected from a conceptual model, and then expanded by semantic relationships given in the model. The expansion levels were (a) no expansion, (b) a synonym expansion, (c) a narrower concept expansion, (d) an associative concept expansion, and (e) a cumulative expansion of all other expansions. With weak structures and Boolean structured queries, QE was not very effective. The best performance was achieved with one of the strong structures at the largest expansion level. 1 Introduction Query formulation and expansion have been studied extensively because the selection of good search keys is difficult but crucial for good search results. Query formulation is typically based on search keys given by the user. The original query does not always give satisfactory results, thus it may be reformulated by adding search keys with or without reweighting, which process is known as query expansion (QE, for short). QE may be based either on search results or knowledge structures [10]. Relevance feedback is a QE method based on search results, which has often proved to be beneficial, but its effectiveness depends on queries, ranking function and number of relevant items known, i.e., on the quality of the search results [2], [3], [6], [11], [33]. QE based on knowledge structures (e.g., thesauri) does not depend on search output, but it has not been found unambiguously useful. Statistical, collection based QE methods have been prevalent in research. Jing and Croft [17] report that an automatically constructed association thesaurus improves retrieval performance. Voorhees [32] and Jones and others [18] tested general, intellectually constructed thesauri in QE. They found no notable improvement in retrieval performance. QE with a domain specific thesaurus was tested by Kristensen [21] and Järvelin and others [19]. They report that thesaurus based QE enhances retrieval performance both in Boolean and best match environments. Nevertheless, these studies do not consider the interaction of query structure, Permission to make digital/hard copy of all part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or fee. SIGIR 98, Melbourne, Australia 1998 ACM /98 $5.00. Kalervo Järvelin Dept. of Information Studies University of Tampere query expansion, and the number of search concepts and search keys. Thus, we suspect that the usefulness of different types of thesauri (i.e., collection dependent or domain specific, automatically or intellectually constructed) in QE cannot be settled without studying the structural aspects of queries. Characteristics of requests and queries are usually not reported in detail. Requests seem to be of the type 'conscious topical' [16], and often rather verbose, like early TREC topics. The effect of this is that queries have usually been broad, i.e., they include many search keys, since they are automatically produced from written requests. They are not similar to real users' short queries, which typically include only a few words. Lu and Keefer [22] tested the effect of query size on retrieval effectiveness by reducing and expanding TREC queries. They report that reduction was detrimental but QE based on automatically constructed associative thesaurus enhanced retrieval performance. In addition, they found some evidence that the effectiveness of different indexing/retrieval methods was dependent on the number of search keys in queries. In other tests the broadness of unexpanded queries has been varied and, quite logically, shorter queries gain most from QE [17], [32]. In a best match retrieval system the retrieved documents are scored by combining (e.g., by product or sum) the weights of the keys appearing in a document and a query [20], [28]. The obvious ways to change the scoring are changing the index key weighting, changing the calculation of the scores, or weighting the search keys. Croft and Das [9] tested tf.idf weighting modifications and search key weighting with phrase information and QE. They found that search key weighting and QE improve performance. Their approach has similarities to ours, but they tested different types of weighting schemes, not different query structures, and the source of the QE vocabulary was different. In the literature a difference is typically made between natural language and Boolean queries, the former treating all search keys equally, the latter showing relations between the search keys. Because these relationships as the structure of a query are thought to be useful, Boolean operators are given interpretations in best match systems (e.g., [25], [29]). With query structure we refer to the calculation of scores from individual weights of the keys, which is guided by the query, explicitly shown by operators. The structure of queries may be described as weak (queries with a single operator, no differentiated relations between search keys) or strong (queries with several operators, different relationships between search keys). More precisely, strong query structures are based on facets. Each facet indicates one aspect of the request, and is represented by a set of concepts, which, in turn, are expressed 1 The INQUERY software was provided by the Information Retrieval Laboratory, University of Massachusetts Computer Science Department, Amherst, MA, USA.

3 by a set of search keys. Some evidence of better performance for strong structured queries exists (e.g., [5], [15], [29]), though not in QE studies. Queries with different types of operators, e.g., 'soft' interpretations of the Boolean operators and types of 'probabilistic' operators, have been combined [4], [5], [24], [26]. This method leads to rather complicated structures, but improves performance. Because neither the number nor the overlap of search keys in different queries is reported, it is hard to judge the possible effects of the structure and broadness on performance. In this study we analyse retrieval performance of different query formulation methods in a probabilistic retrieval system. The interaction and effects of the following variables are tested: the number of search concepts the number of search strings representing concepts QE with different semantic relationships the combination of weights in scoring, or query structures. Query formulation is based on concepts, and QE is based on a thesaurus. First, concepts representing requests are selected from the thesaurus, second, queries are formulated in which the number of concepts and the number of search strings representing the concepts are varied by QE, and the query structure of strings for scoring is varied. This study, however, is not about thesauri but, instead, about query structures and expansion extent the thesaurus is just a source for expansion keys. Our attempt is not to compare this approach to unaugmented human performance, nor do we present claims for a conceptual approach. In this study it is used to provide various kinds of expansions in a controlled way. 2 Method 2.1 Test environment The test environment was a text database containing Finnish newspaper articles operated under the INQUERY retrieval system (version 3.1). The database contained articles published in three Finnish newspapers. The average article length was 233 words, and typical paragraphs were two or three sentences in length. The database index contained all keys in their morphological basic forms. In the basic word form analysis all compound words were split into their component words in their morphological basic forms. For the database there is a collection of 35 requests, which are 1-2 sentences long, in the form of written information need statements. For these requests there is a recall base of articles. It is collected by pooling the result sets of hundreds of different queries formulated of the requests in different studies, using both exact and partial match retrieval. For a set of tests concerning query structures 30 requests were selected on the basis of their expandability, i.e., they provided possibilities for studying the interaction of the test parameters. For each request 78 different queries were formulated and retrieved with a result set size of 50 documents (document cut-off value [DCV] 50). The pool of these result sets contained 7010 articles that were not in the recall base. Being able to find through 2340 queries of varying structure and expansion including also various conceptual approaches 7010 novel articles which did not yield more than 42 new relevant, we strongly believe that practically all relevant documents in the database have been identified. We thus believe that our recall estimates are valid. The number of known relevant documents for the 30 queries is The INQUERY system was chosen for the test, because it has a wide range of operators, including probabilistic interpretations of the Boolean operators, and it allows search key weighting. Moreover, INQUERY has shown good performance in several tests (e.g., [1], [12], [33]). INQUERY is based on Bayesian inference networks (see, [1], [29]). All keys are attached with a belief value, which is approximated by the following tf.idf modification: log N tfij * tfij * dlj * dfi log(n +1. 0) adl where tfij = the frequency of the key i in the document j dlj = the length of document j (as a number of keys) adl = average document length in the collection N = collection size (as a number of documents) dfi = number of documents containing key i. The INQUERY query language provides a set of operators to specify relations between search keys. As with Boolean operators it is possible to construct facets, and mark relationships between concepts. The interpretations for the operators used in this study are given below: P and (Q 1, Q 2,..., Q n ) = p 1 *p 2 *... *p n P or (Q 1, Q 2,..., Q n ) = 1 (1 p 1 )*(1 p 2 )*... *(1 p n ) P sum (Q 1, Q 2,..., Q n ) = (p 1 +p p n )/ n P wsum (w s, w 1 Q 1, w 2 Q 2,..., w n Q n ) = w s (w 1 p 1 +w 2 p w n p n ) / (w 1 +w w n ) where P denotes probability, Q i is either a string or an INQUERY expression, p i, i = 1...n, is the belief value of Q i, w i, i = 1...n, is the weight of Q i, and w s is a weight given for a clause [24], [29]. The probability for operands connected by SYN operator is calculated by modifying the tf.idf function as follows: log N +0.5 tfij * i S tfij * dlj * dfs log(n +1.0) i S adl where tfij = the frequency of the key i in the document j S = a set of search keys within the SYN operator dlj = the length of document j (as a number of keys) adl = average document length in the collection N = collection size (as a number of documents) dfs = number of documents containing at least on key of the set S. 2.2 Concept based query formulation and expansion Three levels a conceptual, linguistic and string level can be differentiated in query formulation [30]. Differentiating between conceptual and linguistic levels has several advantages: it becomes obvious that a concept and a facet may be represented by several expressions; these expressions may be varied according to the search environment; the expressions may be selected from any language; the searcher may just select concepts without having to concern about search keys, which may be attached to each concept through a conceptual model and further automatically formed to a query. This idea is developed by Järvelin and others [19] to a thesaurus data model for QE. In the model concepts and their relations are represented at the conceptual level. Each concept is

4 represented by one or more natural language expressions at the linguistic level. The expressions for one concept are synonyms. At string level each expression is represented by one or more search strings, which also model, when needed, truncation and phrase component proximity. Matching in IR is done at the string, not at the linguistic level. We adopted the thesaurus data model for query formulation and expansion. The test thesaurus was aimed for QE in a database of Finnish newspaper articles, thus, its language was Finnish. The vocabulary was collected from exhaustive Boolean queries constructed for the test requests in another study [27], and from dictionaries. The thesaurus included 832 concepts, 1345 expressions for the concepts, and 1558 search strings for the expressions. Concepts have hierarchic (generic, partitive and instance) and association relationships. Expressions representing concepts are each other's synonyms or quasi-synonyms, i.e., equivalence exists between expressions at the linguistic level. The most typical or obvious of the synonyms, in linguistic sense, was chosen as a principal expression, term, of the concept. The relations between concepts and between expressions are valid for the whole subject area, i.e., they are standard thesaurus or semantic relations. Each expression may be represented by several strings, which are spelling variants, or, in case of phrases, formed with different proximity operations. There is no variation caused by word truncation because the words are in their basic forms in the database index. The thesaurus is a database managed with the ExpansionTool, which is a tool for concept based query construction and expansion (see, [19]). In query formulation the researchers recognised facets representing different aspects of the request and selected concepts for the facets from the thesaurus in accordance with the request. Concepts were organised in facets according to a Boolean block search strategy, i.e. concepts representing the same aspect form an OR block, which blocks are combined with the AND operator. Real searchers were not involved in the query formulation because this study seeks to test the effect of structural and representation parameters on the search performance, not to find out how real searchers would have used the thesaurus, nor to evaluate the performance of their queries. In this study the number of facets in a query is called complexity, and the total number of concepts in facets is coverage. The number of facets was fixed in each request, i.e., the queries included all aspects of the request. The coverage of an unexpanded query was also determined by the request. Coverage was varied in QE by adding semantically related concepts of the original concepts to the query, first narrower and then associative concepts (conceptual expansion). At the linguistic level concepts were represented by expressions. In the unexpanded query each concept was represented by its term. In synonym expansion all synonymous expressions of the term were added to the query. Expressions were then represented by search strings. All strings corresponding to an expression were always added to the query, thus, there is no expansion at the string level. The total number of search strings in a query is called broadness. Although the relation between expressions and strings was not one to one but one to many (1.2 strings per expression, on the average), we do not consider the number of expressions separately. Broadness was counted at the string level because matching is based on search strings. We refer to search strings as (search) keys. The different levels of query formulation and expansion are demonstrated with an example 1. Let us assume the following request: Storage of radioactive waste produced in nuclear power plants. Examples of risks and accidents. At the conceptual level four facets with altogether seven concepts are recognised from the request. But conceptual level is an abstraction, because concepts cannot be discussed without names, thus when the query is represented as a Boolean block search, it is at the linguistic level, and concepts are represented by terms. A sample conceptual query plan follows: nuclear power plants AND radioactive waste AND storage AND (risk OR accident) At the string level terms are replaced by search keys. The query is expressed with syntax of the query language. Our sample query is first formulated into the Boolean query structure using the query language of INQUERY. An unexpanded query, (E0), contains the search concepts selected on the basis of the request. The concepts are represented by terms. The sample unexpanded query is following: #and(#3nuclear power plant) #3(radioactive waste) storage #or(risk accident)) 2 With the synonym expansion (Es) synonyms of the terms are added to the query. The sample query with the synonym expansion is the following: #and(#or(#3(nuclear power plant) #3(nuclear station) #3(atomic power plant)) #or(#3(radioactive waste) #3(nuclear waste)) #or(storage store) #or(risk accident danger hazard)) In the narrower concept expansion (En) the expressions terms and their synonyms of the narrower concepts of the original search concept are added to the query. Thus, synonyms are also included in this expansion. The sample query is following: #and(#or(#3(nuclear power plant) #3(atomic power plant) #3(nuclear station) #3(atomic reactor) #3(nuclear reactor)) #or(#3(radioactive waste) #3(nuclear waste) #3(high active waste) #3(low active waste)) #or(storage store) #or(risk accident danger hazard disaster catastrophe)) In the next expansion (Ea) the expressions of the associative concepts of the original search concepts are added to the original unexpanded query, that is expressions representing the narrower concepts are not included in this query. The sample query with the associative concept expansion is as follows: #and(#or(#3(nuclear power plant) #3(atomic power plant) #3(nuclear station) #3(nuclear energy) #3(nuclear power)) #or(#3(radioactive waste) #3(nuclear waste) #3(spent fuel) #3(fission product)) #or(storage store) #or(risk accident danger hazard)) All the previous expansions are accumulated in the largest expansion (Et). Practically, Et expansion is a bag of En and Ea expansions, because these include all search keys. The sample query with the largest expansion is following: 1 Test queries were in Finnish. The translation of the example may not be exact, but should clarify the idea of the test. The example is shortened, i.e., the expanded sample queries are not complete with all search keys. 2 NB. #3 is a proximity operator in INQUERY query language. The keys within an #N operator must be found within N words of each other in the text in order to contribute to the document's belief value.

5 #and(#or(#3(nuclear power plant) #3(atomic power plant) #3(nuclear station) #3(atomic reactor) #3(nuclear reactor) #3(nuclear energy) #3(nuclear power)) #or(#3(radioactive waste) #3(nuclear waste) #3(high active waste) #3(low active waste) #3(spent fuel) #3(fission product)) #or(storage store) #or(risk accident danger hazard disaster catastrophe)) In highly agglutinative languages, like Finnish and German, compound words are often spelled together (e.g., ydinvoimalaitos, atomkraftwerk, which mean nuclear power plant). Compound word splitting makes all parts of the word searchable. In the test setting the database index contains both the whole compound words and their parts. Compound word splitting has also a query expansion effect. Many hierarchical relations in Finnish are based on compound words (e.g., tehdas -> paperitehdas -> hienopaperitehdas which means factory -> paper mill -> paper mill producing fine paper). When compound words are split, the use of any component word as a search key will match all compound words that include the component. Typically searchers might use heads (like tehdas/mill) to get all the narrower concepts. Usually there is no way to control whether the search key matches only heads, or any other compound part, thus, false matches occur. However, as the compound words contain the search key, many of them have a right kind of association to the search key. To some extent compound word splitting causes automatically narrower concept expansion and associative concept expansion to queries. This reduces the effects of these expansion types, but on the other hand the test setting resembles more other, less agglutinative languages. 2.3 Query structures The structures used to combine the search keys are explained in the following. Examples of the different structures based on the sample request are given with the narrower concept expansion. SUM (average of the weights of keys) and WSUM1 (weighted average of the weights of keys) queries represented weak structures. An unexpanded SUM query was constructed of the original concepts of a request, and each concept was represented by a single key or a set of keys corresponding to the term but without phrases. In the expansions all expressions were added as single words, i.e., no phrases were included. (Other structures included phrases expressed with proximity operators.) Weighting original keys higher than expansion keys is a usual method in QE. In WSUM queries expansion keys and the keys of equivalent expressions (i.e., synonyms for the terms of the original concepts) were given smaller weights (1) than the keys the original concepts (2). In WSUM2 queries keys were weighted according to the type of expansion they belonged to: the weight of term keys was 10, of synonym keys 9, of narrower concept keys 7, and of associative concept keys 5. Examples of these structures are given below: SUM #sum(nuclear power plant nuclear station atomic power plant atomic reactor nuclear reactor radioactive waste nuclear waste storage store risk accident danger hazard disaster catastrophe) WSUM1 #wsum(1 2 #3(nuclear power plant) 1 #3(nuclear station) 1 #3(atomic power plant) 1 #3(atomic reactor) 1 #3(nuclear reactor) 2 #3(radioactive waste) 1 #3(nuclear waste) 2 storage 1 store 2 risk 2 accident 1 danger 1 hazard 1 disaster 1 catastrophe) WSUM2 #wsum(1 10 #3(nuclear power plant) 9 #3(nuclear station) 9 #3(atomic power plant) 7 #3(atomic reactor) 7 #3(nuclear reactor) 10 #3(radioactive waste) 9 #3(nuclear waste) 10 storage 9 store 10 risk 10 accident 9 danger 9 hazard 7 disaster 7 catastrophe) A query with Boolean operators (BOOL ) was constructed using the block search strategy. It is a typical facet based query structure. The interpretations of Boolean operators are given above, i.e., queries were not strict Boolean queries. Earlier studies by Turtle [29], Belkin and others [5] and Hull [15] provide evidence that queries with Boolean operators are more effective than weak structured queries in best match retrieval systems. An example of Boolean query structure is given above (see, Chapter 2.2). In a SUM-of-synonym-groups-query (SSYN) each facet formed a clause with the SYN operator. SYN clauses were combined with the SUM operator. All keys within the SYN operator are treated as instances of one key. ASYN queries were similar to SSYN queries, but SYN groups were combined with the probabilistic AND operator, i.e., a product of facets was calculated instead of an average. SSYN #sum( #syn (#3(nuclear power plant) #3(atomic power plant) #3(nuclear station) #3(atomic reactor) #3(nuclear reactor)) #syn(#3(radioactive waste) #3(nuclear waste)) #syn(storage store) #syn(risk accident danger hazard disaster catastrophe)) ASYN #and(#syn(#3(nuclear power plant) #3(atomic power plant) #3(nuclear station) #3(atomic reactor) #3(nuclear reactor)) #syn(#3(radioactive waste) #3(nuclear waste)) #syn(storage store) #syn(risk accident danger hazard disaster catastrophe)) The XSUM query structure was a modification of the Boolean structure: facets were combined with SUM operator instead of AND operator. An average is less discriminating than product, thus, a failure or a lack in expansion of one facet does not affect the total score dramatically. In OR clauses extra weight was given to term keys through structure. All expansion keys for one facet formed a SUM clause which was combined to an OR clause with the term key(s) of the facet. XSUM #sum(#or (#3(nuclear power plant) #sum(#3(atomic power plant) #3(nuclear station) #3(atomic reactor) #3(nuclear reactor))) #or(#3(radioactive waste) #sum(#3(nuclear waste))) #or(storage #sum(store)) #or(risk accident #sum(danger hazard disaster catastrophe))) An OSUM query was a combination of Boolean queries consisting of term keys of different facets and SYN clauses consisting of expansion keys for each facet. Boolean queries and SYN clauses were combined with the SUM operator. This structure gave extra weight for search strings representing the original, unexpanded query. This approach was similar to query combination explored in some earlier studies (e.g., [4], [5], [26]). The unexpanded OSUM query was identical to the unexpanded BOOL query. OSUM #sum(#and(#3(nuclear power plant) #3(radioactive waste) storage #or(risk accident)) #syn(#3(atomic power plant) #3(nuclear station) #3(atomic reactor) #3(nuclear reactor))

6 #syn(#3(nuclear waste)) #syn(store) #syn(danger hazard disaster catastrophe)) In all there were eight query structure types, three of them representing weak structures (SUM, WSUM1-2) and five representing strong structures which were also facet structures (BOOL, SSYN, ASYN, OSUM, XSUM). Queries with different structures were constructed without expansion (E0) and expanded at all four expansion levels (Es, En, Ea, Et), except WSUM2 queries, which were either unexpanded or fully expanded (E0/Et) because of the weighting scheme. This setting generated 35 structurally different queries for each of the 30 requests, i.e., in all 1050 queries were run and evaluated for the present study. 3 Results The complexity of the queries ranged from 3 to 5, the average being 3.7. The average coverage of the unexpanded queries 4.9. The broadness of unexpanded queries when no phrases were marked (i.e., SUM structure) was 6.1 on the average, and for expanded queries without phrases, on the average, as follows: Es 18.5, En 30.6, Ea 49.5, Et The broadness of queries with phrases was E0 5.4, Es 14.1, En 24.4, Ea 42.1, Et 52.4, on the average. The number of documents in the result set, or document cut-off value (DCV), was fixed to 50. Average precision was then calculated over 11 cut-off points (1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50). Besides, average precision at 10 recall points (10-100) and ranks of query structures and expansion type combinations were calculated. DCV precision With three query structures QE proved beneficial at all expansion levels (SSYN, ASYN, XSUM), and the largest expansion (Et) was the best (Table 1). The precision scores of both SSYN or ASYN queries were very similar, and with QE notably better than precision scores of the other structures. The effect of QE on OSUM and WSUM1 queries was mixed. The Es, En and Ea expansions decreased the average precision of OSUM queries compared to the E0 level, but the Et expansion increased precision. The En expansion decreased the precision of WSUM1 queries, otherwise QE was beneficial. QE at any level was detrimental for three query structures (SUM, WSUM2, BOOL). Weighting the original search keys (terms) higher than expansion keys gave middle range results. Giving equal weights to all expansion keys (WSUM1) was more effective than giving weights according to the semantic relations (WSUM2). Without expansion the SUM queries were the most effective (Table 1). This is somewhat unexpected: queries without phrase information gave better precision than the queries with phrases. An explanation might be that a phrase as the only expression for a concept was a too strict condition, single search keys were more effective. The splitting of compound words for the database index might also have an effect because single words match to the components of compound words. Another explanation might be that the repetitive search keys, which resulted from relaxing the phrases, were not removed from queries. In INQUERY this kind of repetition within SUM or WSUM structures gives extra weight to the repeated keys. However, the advantage was lost with expansions, because the expansion without phrases increases the possibility of arbitrary search key combinations. All expansions were unbalanced since SUM structure does not take into account the relative importance of the search keys, nor does it recognise concepts or facets. In the WSUM queries the balance in expansions was sought by weighting original search keys higher than expansion keys, as suggested in literature. The ranking is based on weighted averages of the search key weights. The weighting supported somewhat QE, but the overall performance was not markedly better than the performance of the SUM queries. Thus, the problem of arbitrary search key combinations remained. For Boolean structured queries the result can be explained with the interpretation given to the AND and OR operators. If a search key is not present in a document it gets a default weight of 0.4, if the key is present it is given the tf.idf weight. The weight for all keys within the OR operator is calculated as follows: 1 (1 p 1 )*(1 p 2 )*... *(1 p n ), where p i is the weight of the search key i. The more search keys the OR clause includes, the higher weight it gets, whether the search keys are present or not. The weights of OR clauses are combined as a product. This leads to ranking where the OR clause or clauses with fewest keys are decisive, i.e., in the shortest OR clauses the presence or absence of search keys becomes decisive irrespective of the importance of the keys for the query. Query structures E-levels SUM WSUM1 WSUM2 BOOL SSYN ASYN OSUM XSUM E Es En Ea Et Table 1. Average precision over DCVs 1-50 for the query structures at varying QE levels. (N=30) Query structures QE-levels SUM WSUM1 WSUM2 BOOL SSYN ASYN OSUM XSUM E Es En Ea Et Table 2. Average ranks based on average precision over DCVs The OSUM and XSUM structures give extra weight for original search keys through structure. In the OSUM queries the Boolean structured clauses consisting of the original search keys ruled the performance over the expansion keys

7 formed to SYN clauses by facets. This held until the largest expansion, but even then the performance did not improve very much. In the XSUM structure the effects of QE remain rather modest. The SSYN and ASYN structures combine weights of search keys of a facet in a SYN clause, in which all search keys are treated as instances of one key. An average or a product is then calculated of the weights of SYN clauses. The facet structure formed into a SYN clause is the decisive component, the operation used to combine the weights of SYN clauses does not seem so important. Ranks and statistical significance Hull (1996) points out that when the number of relevant documents is very different for different requests, the variability in precision may be somewhat misleading. In this test the number of relevant documents ranged from 6 to 98. Hull proposes that the unequal variance in evaluation measure can be normalised by replacing the score for each method by ranking the scores of different methods with respect to each request. Table 2 gives the average ranks based on average precision scores over DCVs (NB. The best rank is 35, because the number of comparisons was five expansion levels times eight structures minus four missing cases, while three expansion levels are missing for WSUM2, and unexpanded queries for BOOL and OSUM, and WSUM1 and WSUM2 are identical.) To decide whether the differences between query structures with different coverage are statistically significant, we ran the Friedman test, which is based on ranks, i.e., a nonparametric alternative for comparing more than two related samples [8], [13]. The Friedman test showed that there was a significant difference between different structure and expansion combinations (p < 0.001). Pairwise comparisons were conducted to reveal the significant differences between combinations. Table 3 shows the differences when the best expansion was compared against others within each query structure. In all other structures, except XSUM, the difference between the best and the worst expansion was significant. Within the SUM and BOOL structures expansion led to several significantly worse combinations. SUM BOOL SSYN ASYN OSUM E0 > En Ea E0 >> En; Et >> E0 Et >> E0 Et > En Et E0 > Ea Et Table 3. Results of the Friedman test pairwise comparisons (the expression x > y refers to p < 0.01; x >> y refers to p < 0.001). The best expansions were then compared over the structures. Significant differences were found between the best SYN expansions and the following other groups: SSYN >> SUM WSUM1-2 BOOL OSUM XSUM; ASYN > XSUM; ASYN >> SUM WSUM1-2 BOOL OSUM. Precision at 10 recall levels Figure 1 visualises the results. It is based on average precision at 10 recall points and shows the worst query structure and expansion combination, and the best expansion of each query structure type. The worst case is the query with Boolean structure with the narrower concepts expansion (BOOL/En). In the middle there is a group of very similarly behaving structure and expansion combinations, the best Boolean and OSUM being slightly worse than SUM, WSUM1, and XSUM. Best SSYN and ASYN queries are high above others all the way, the difference being largest between % recall. Precision SSYN/Et WSUM/En ASYN/Et BOOL/E0 SUM/E0 WSUM/Et XSUM/Et BOOL/En Recall Figure 1. Precision-recall graph of a sample of query structure and QE combinations. Reading the results one should bear in mind that none of the structures or expansions was optimal for all requests. The smallest difference between best and worst query for one request was 14.4 percentage units, the largest was 89 percentage units. There were three requests for which the performance of unexpanded queries was better than the performance of any expanded query. The number of known relevant documents for these queries was rather low (8, 13, 14) which may partly explain the phenomenon: the variation in expressions discussing the topic might not have been great. The synonym expansion worked best for five queries, narrower concept expansion for five, associative concept expansion for nine, and largest expansion for 15 queries. (NB. Here the total number of queries is larger than 30 because some expansions gave the same result.) Among query structures, the SSYN and ASYN queries gave the highest precision for 18 requests, but at different expansion levels. SUM and WSUM1 were best structures for four requests, OSUM and BOOL for three requests, and XSUM for one request. Expanded Boolean queries were the worst performers for 19 requests. For three requests the ASYN or SSYN structured query was the worst. 4 Discussion and Conclusions We have tested concept based QE with different query structures in a best match retrieval system. The results show that the effects of QE on retrieval performance depend on the structure of the query. For weakly structured SUM and WSUM2 queries, and for Boolean structured queries QE was detrimental. The expanded Boolean queries gave the worst precision scores overall. The effect of QE on WSUM1 and OSUM queries varied at different expansion levels, but the largest expansion, including narrower and associative concepts, and synonyms, yielded the best precision score. With other query structures QE improved performance, and the

8 largest expansion was the best. The combination of the SSYN or ASYN structures with the largest expansion (Et) achieved the highest average precision scores. These were also significantly better than performance of any other query structure and expansion combination. In general, QE interacts with query structure: with a large expansion strong query structures seem necessary, but with a slight or no expansion weak structures perform well. Our results are supported in these Proceedings by Pirkola [23]. He shows in cross-language IR environment that query structures have effects on retrieval performance. Our results question earlier findings (e.g., [31], [32]) suggesting lower weighting of expansion keys. Our findings show that query structure is an essential factor in determining the effects of expansion. They also allow the view that expansion keys are not necessarily worse sources of evidence than the original request keys. This, again, is compatible with earlier findings ([18], [21]) about the small overlap of relevant document sets retrieved through different key combinations: relevant documents may have different vocabularies. QE within weak structures punishes relevant documents in scoring for the missing expansion keys. With strong (SYN) structures this does not happen and relevant documents with different vocabularies score high. It seems that the semantic division of relationships typical for thesauri was not particularly useful in QE. In most cases the best performance was obtained by the largest expansion including all semantic relationships. When the largest expansion did not work the explanation might be either that the thesaurus did not provide accurate relations and expressions for the concepts, or that the request was very precise, or the vocabulary of the topic was not very variable. Since the largest expansion performed generally best, it seems that any keys associated with the original keys, disregarding their relationship type, should be considered for QE. In statistical thesaurus construction statistically associated keys are identified but the type of the relationship cannot be ascertained. Statistical thesauri, combined with proper query structures, might supply us with a good query strategy. This is suggested by the results, though not shown by the present study, because no statistical thesaurus was compared. In a further study we will compare the expansion keys provided by a statistical and an intellectual thesaurus, as well as the effectiveness of QE based on both types of thesauri. Another theme to test is whether the effectiveness of relevance feedback could be enhanced by concept based QE before feedback. Acknowledgements. This study was funded by Emil Aaltosen Säätiö, Pirkanmaan Kulttuurirahasto, and National Multimedia Project, KaMu-Tila. References [1] J. Allan, J. Callan, B. Croft, L. Ballestros, J. Broglio, J. Xu & H. Shu. INQUERY at TREC 5. In: D.K. Harman & E.M. Voorhees (eds.) Information technology: The Fifth Text REtrieval Conference (TREC-5). Gaithersburg, MD: National Institute of Standards and Technology, 1997, [2] M.M. Beaulieu, M. Gatford, X. Huang, S.E. Robertson, S. Walker & P. Williams. Okapi at TREC 5. In: D.K. Harman & E.M. Voorhees (eds.) Information technology: The Fifth Text REtrieval Conference (TREC-5). Gaithersburg, MD: National Institute of Standards and Technology, 1997, [3] M. Beaulieu, S. Robertson & E. Rasmussen. Evaluating Interactive Systems in TREC. Journal of the American Society for Information Science, 47(1): 1996, [4] N. Belkin, C. Cool, W.B. Croft & J.P. Callan. The effect of multiple query representations of information retrieval performance. In: R. Korfhage, E.M. Rasmussen, & P. Willett, P. (eds.) Proc. of the 16th International Conference on Research and Development in Information Retrieval. New York, NY: ACM, 1993, [5] N. Belkin, P. Kantor, E.A. Fox & J.A. Shaw. Combining evidence of multiple query representations for information retrieval. Information Processing & Management, 31(3): 1995, [6] C. Buckley, G. Salton & J. Allan. The effect of adding relevance information in a relevance feedback environment. In: W.B. Croft & C.J. van Rijsbergen (eds.) Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 1994, [7] J.P. Callan, W.B. Croft & S.M. Harding. The INQUERY retrieval system. In: Proceedings of the Third International Conference on Databases and Expert Systems Applications. Berlin: Springer Verlag, 1992, [8] W.J. Conover. Practical nonparametric statistics. 2nd ed. New York: John Wiley & Sons, [9] W.B. Croft & R. Das. Experiments with query acquisition and use in document retrieval systems. In: J-L. Vidick (ed.) Proc. of the 13th International Conference on Research and Development in Information Retrieval. Bruxelles: ACM, 1990, [10] E.N. Efthimiadis. Query expansion. In: M.E. Williams (ed.) Annual Review of Information Science and Technology, vol. 31. Medford, NJ: Information Today, 1996, [11] D.K. Harman. Relevance feedback revisited. In N. Belkin, P. Ingwersen & A. Mark Pejtersen (eds.) Proc. of the 15th Aannual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 1992, [12] D.K. Harman (s.a.). Overview of the fourth text retrieval conference (TREC-4) [online]. [Cited ] Available from: <URL: [13] D. Hull. Using statistical testing in the evaluation of retrieval experiments. In: Korfhage, R., Rasmussen, E.M. & Willett, P. (eds.) Proc. of the 16th International Conference on Research and Development in Information Retrieval. New York, NY: ACM, 1993, [14] D. Hull. Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for Information Science, 47(1): 1996, [15] D. Hull. Using structured queries for disambiguation in cross-language information retrieval. In: AAAI Spring Symposium on Cross-Language Text and Speech Retrieval Electronic Working Notes [online]. Stanford University, March 24-26, [Cited ] Available from: <URL: papers/ hull3.ps> [16] P. Ingwersen & P. Willett. An introduction to algorithmic and cognitive approaches for information retrieval. Libri, 45: 1995, [17] Y. Jing & W.B. Croft. An association thesaurus for information retrieval [online] [Cited ] Available from: <URL: biblo.html#1> [18] S. Jones, M. Gatford, S. Robertson, M. Hancock- Beaulieu & J. Secker. Interactive thesaurus navigation:

9 Intelligence rules ok? Journal of the American Society for Information Science 46(1): 1995, [19] K. Järvelin, J. Kristensen, T. Niemi, E. Sormunen & H. Keskustalo. A deductive data model for query expansion. In: H.-P. Frei, D. Harman, P. Schäuble & R. Wilkinson (eds.) Proc. of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM, 1996, [20] P. Kantor. Information retrieval techniques. In M.E. Williams (ed.) Annual Review of Information Science and Technology. Vol. 29. Medford, NJ: Learned Information, 1994, [21] J. Kristensen. Expanding end-users query statements for free text searching with a search-aid thesaurus. Information Processing & Management, 29(6): 1993, [22] X.A. Lu & R.B. Keefer. Query expansion/reduction and its impact on retrieval effectiveness. In: D.K. Harman (ed.) The Third Text REtrieval Conference (TREC 3). Gaithersburg, MD: National Institute of Standards and Technology, 1995, [23] A. Pirkola. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Melbourne, Australia, August , [24] T.B. Rajashekar & W.B. Croft. Combining automatic and manual index representations in probabilistic retrieval. Journal of the American Society for Information Science, 46(4): 1995, [25] G. Salton, E. Fox & H. Wu. Extended boolean information retrieval. Communications of the ACM, 26(11): 1983, [26] J.A. Shaw & E.A. Fox. Combination of multiple searches. In: D.K. Harman (ed.) The Third Text REtrieval Conference (TREC 3). Gaithersburg, MD: National Institute of Standards and Technology, 1995, [27] E. Sormunen. Vapaatekstihaun tehokkuus ja siihen vaikuttavat tekijät sanomalehtiaineistoa sisältävässä tekstikannassa [Free-text searching efficiency and factors affecting it in a newspaper article database]. Espoo, Finland: Valtion Teknillinen Tutkimuskeskus, VTT Julkaisuja 790, [In Finnish] [28] K. Sparck Jones. Reflection on TREC. Information Processing & Management, 31(3): 1995, [29] H.R. Turtle. Inference networks for document retrieval. Ph.D. Thesis. Computer and information Science Department, University of Massachusetts. COINS Technical Report 90 92, [30] UMLS. UMLS Knowledge Sources. 5th Experimental Edition. Bethesda, MD: National Library of Medicine, [31] Y.-C. Wang, J. Vandendorpe & M. Evans. Relational thesauri in information retrieval. Journal of the American Society for Information Science, 36(1): 1985, [32] E. Voorhees. Query expansion using lexical-semantic relations. In: W. Bruce Croft & C.J. van Rijsbergen (eds.) Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY:ACM, 1994, [33] J. Xu & W.B. Croft. Query expansion using local and global document analysis. In: H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson (eds.) Proc. of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY:ACM, 1996, 4 11.

Inter and Intra-Document Contexts Applied in Polyrepresentation

Inter and Intra-Document Contexts Applied in Polyrepresentation Mette Skov, Birger Larsen and Peter Ingwersen Department of Information Studies, Royal School of Library and Information Science Birketinget