A Voting Method for XML Retrieval

Size: px

Start display at page:

Download "A Voting Method for XML Retrieval"

Valentine Burke
6 years ago
Views:

1 A Voting Method for XML Retrieval Gilles Hubert 1 IRIT/SIG-EVI, 118 route de Narbonne, Toulouse cedex 4 2 ERT34, Institut Universitaire de Formation des Maîtres, 56 av. de l URSS, Toulouse hubert@irit.fr Abstract. This paper describes the retrieval approach proposed by the SIG/EVI group of the IRIT research centre in INEX 2004 evaluation. The approach uses a voting method coupled with some processes to answer content only and content and structure queries. This approach is based on previous works we leaded in the context of automatic text categorization. 1 Introduction The development of systems to perform searches in collections constituted of XML (extensible Markup Language) documents [3] has become a need since the use of XML is growing. Consequently, a growing number of systems intend to provide means to retrieve relevant components among XML documents. XML retrieval systems need to take into account content and structural aspects. Regarding the variety of proposed XML retrieval systems it is interesting to evaluate their effectiveness. For tha the INitiative for the Evaluation of XML retrieval (INEX) provides a testbed and scoring methods allowing participants to evaluate and compare their results. Underlying approaches of systems participating to INEX can be classified in two categories [5] : model-oriented approaches and system-oriented approaches. Modeloriented approaches gather notably approaches based on language models [11], [8], [1] or other probabilistic models [14] which obtained good results in Systemoriented approaches extend textual document retrieval system adding XML-specific processing. Various systems in this category [10], [6], [13], [16] obtained good results in In this paper, we present an IR approach initially applied to automatic categorization of structured documents according to concept hierarchies and its evolution brought for XML retrieval notably within the context of INEX. Section 2 is a short presentation of the INEX initiative 2004 edition. Section 3 presents the initial context in which the method was initiated and its first application within INEX in The evolutions made to this approach for INEX 2004 are described in section 4. Section 5 presents the submitted runs and the obtained results. In section 6 we conclude analyzing the experiment and considering future works.

2 2 The INEX initiative 2.1 Collection The INEX documents correspond to approximately 12,000 articles of the IEEE Computer Society s publications from 1995 to 2002 marked up in XML. All the documents respect the same DTD. The collection gathers over eight millions XML elements of varying length and granularity (ex. title, paragraph or article). 2.2 Queries INEX introduces two types of queries: CO (Content Only) queries describe the expected content of the XML elements to retrieve. CAS (Content and Structure) queries combine content and explicit references to the XML structure using a variant of Xpath [4]. CAS topics contain indications about the structure of expected XML elements and about the location of expected content. Both CO and CAS topics are made up of four parts: topic title, topic description, narrative and keywords. Within the ad-hoc retrieval task, two types of tasks are defined: (1) the CO task, using CO queries, (2) the VCAS task, using CAS queries, for which the structural constraints are considered as vague conditions. 3 A Voting method in information retrieval The approach we proposed is derived from a process we first defined for textual document categorisation [7], [2]. Document categorisation intends to link documents with pre-defined categories. Our approach focuses on categories organised as taxonomy. The original aspect of our approach is that it involves a voting principle instead of a classical similarity computing. The association of a text to categories is based on the Vector Voting method [12]. The voting process evaluates the importance of the association between a given text and a given category. This method is similar to the HVV method (Hyperlink Vector Voting) used within the Web context to compute the relevance of a Web page regarding the web sites referring to it [9]. In our contex the initial strategy considers that the more the category terms appear in the tex the more the link between the text and this category is strong. Thus, this method relies on terms describing each category and their automatic ex-

3 traction from the document to be categorised. The result is a list of categories annotating each document. For INEX 2003, this categorisation process has been applied. Every XML component has been processed as a complete document. Every topic has been considered as a category of a flat taxonomy. The result was a list of topics corresponding to each XML component. It was then reversed and reordered to fit the INEX format of results. Results obtained for the submitted runs [15] have led us to improve the process to suit a retrieval process. The axes of this evolution have been as follows: inverse the voting process to estimate the relevance of each XML component according to each topic, modify the voting function to take into account the great variations of element sizes and to take into account topic treatment rather than category treatmen integrate the aggregation aspect of an XML element (i.e. elements composed of relevant elements), integrate structural constraint processing for CAS topics. 4 Evolution of the voting method within INEX The approach we proposed is derived from a process we first defined for textual document categorisation [7], [2]. Document categorisation intends to link documents with pre-defined categories. Our approach focuses on categories organised as taxonomy. The original aspect of our approach is that it involves a voting principle instead of a classical similarity computing. The association of a text to categories is based on the Vector Voting method [12]. The voting process evaluates the importance of the association between a given text and a given category. 4.1 INEX collection pre-processing From the INEX collection point of view, the documents are considered as sets of text chunks identified by xpaths. For each XML componen concepts are extracted automatically and saved with the xpath identifying the XML component in which they appear and the number of occurrences in the component. Concept extraction involves notably stop word removal. Optionally, some processes can be applied to concepts such as stemming using Porter s algorithm. For INEX'2004 experiments all XML tags except text formatting tags (bold, italic, underline) have been taken into account. From the topic point of view, although our method can use all the parts constituting CO and CAS topics, we used only the title part for the INEX'2004 experiments as requested. For both topic types, stop words are removed and optionally terms can be stemmed using Porter s algorithm.

4 4.2 Voting function The voting function must take into account the importance in the XML element of each term describing the topic and the importance of each term in the topic representation. We have studied different voting functions and the one providing the best results is described as follows: where T is the topic Vote( E is an XML element = t T F( F( F ( This factor measures the importance of the term t in F( the XML element E. F( corresponds the number of occurrences of the term t in the element E. This factor measures the importance of the term t in the topic representation T. F(T) corresponds to the number of occurrences of the term t in the topic T and T) corresponds to the size (number of terms) of T. The voting function combines two factors: the presence of a term in the element and the importance of this term in the topic. 4.3 Scoring function The voting function is coupled with a third factor representing the importance of the topic presence within the XML element. The final function (scoring function) that computes the score of an XML element regarding a given topic is the following: where NT ( T, NT ( T, Score( = Vote( f ( ) This factor measures the presence rate of terms representing the topic in the text (importance of the topic). T) corresponds to the number of terms in the topic representation T and NT(T, corresponds to the number of terms of the topic T that appear in the XML element E.

5 NT ( T, E ) Applying a function ƒ to the third factor (i.e. the presence rate of terms representing the topic in the text) aims at varying the influence of this factor on the scoring function. We tried different functions ƒ, for example the initial function was the exponential (i.e. NT ( T, S ( f ( ) = e ). 4.4 Additional processes for both CO and CAS topics The scoring function is completed with the notion of coverage. The aim of the coverage is to ensure that only documents in which the topic is represented enough will be selected for this topic. The coverage is a threshold corresponding to the percentage of terms from a topic that appears in a text. For example, 50% of coverage implies that at least half of the terms describing a topic have to appear in the text of a document to select it. If NT ( T, CT then NT ( T, Score( = Vote( f ( ) else Score ( = 0. 0 where CT is a real constant (CT 0.0) corresponding to the coverage threshold The hierarchical structure of XML has to be taken into account. The hypothesis on which is based our system is that an element containing a component selected as relevant is also relevant. Our system takes into account this hypothesis propagating the score of an element to the elements it composes. The score propagated to the composed elements is decreased applying a reducing factor. where E a ancestor of E and d( E, α < 1 Score( E, = Score( E, + (1 d( E, α) Score( a α is a constant coefficient and E is an XML element a d(e a, is the distance between E a and E in the xpath associated to E (e.g. in the xpath /article/bdy/s/ss1/p the distance between p and bdy is equal to 3 i.e. d(bdy,p)=3) This process tends to consider a composed element less relevant than the element it is composed of. However, an element composed of several relevant elements can obtain a score greater than one of its components. The hypothesis chosen for INEX is quite different notably due to relevance dimensions: exhaustivity and specificity. Considering exhaustivity, a composed element is considered at least as relevant as the most relevant of its components. Considering specificity, the relevance of an element composed of several relevant components is a a

6 less or equal to the relevance of the most relevant component. It would be interesting to evaluate the impact of this difference of relevance propagation on the retrieval results of our system. In addition, in INEX, terms constituting a topic title can have either the prefix + or -. The sign + is used to emphasize a concept and denotes an unwanted concept. The + and signs do not have strict semantics but just indicate preferences wished by the topic s author. An element containing a term prefixed by in the topic title can be judged relevant to the information need. In the same way, an element judged relevant to the information need even if it does not contain the term prefixed by + in the topic title. To take into account the possibility of having prefixed terms, a coefficient is associated to each term. A coefficient is fixed for each case: term not prefixed, term with the prefix + and term with the prefix -. where Vote( = t T F( sc( F( sc(t) = a if t has the prefix in the topic sc(t) = b if t has no prefix in the topic sc(t) = c if t has the prefix + in the topic a, b, c are real constants 4.5 Specific processes for CAS topics On one hand, we take into account different types of constraints on content. Structural constraints on xpath of elements which are expected to contain keywords (e.g. about(.//p,'+authorization +"access control" +security') and constraints on the year of the article.(e.g. //yr <='2000') are taken into account. These kinds of structural constraints on content gathered all the constraints appearing in the CAS topics of INEX The voting method applied to CO topics has been extended to take into account such constraints as follows: where Vote( = t T F( (1 + β ) F( if E matches a structural constraint defined on t then β>0.0 else β=0.0 On the other hand, an additional step identifies the structural constraints on target elements indicated in CAS topics. All the structural constraints defined on target

7 elements of topics are taken into account and stored to be processed in a post-voting step to enrich the results issued from the voting step. For VCAS evaluation, the target constraint specified in the topic does not have to be strictly verified. The constraint is rather regarded as a hint for expected results without eliminating the elements which do not satisfy the target constraint. To take into account these principles, the score associated to the elements of the results that match the expected xpaths are increased. A factor is applied to the score of matching elements as follows: If R matches X then NT ( T, Score( = γ Vote( f ( ) where γ>1.0 where R is the location path (xpath) of the element E from the root of the document X is the location path (xpath) defined as the target constraint in the topic 5 Experiments 5.1 Experiment setup Our experiments aim at evaluating the efficiency of the evolution given to the voting function and the coefficient adjustments resulting from training performed on the INEX 2003 assessment testbed. The training phase only concerns system processes applied to both CO and CAS topics. Three runs based on the voting method were submitted to INEX'2004. Two runs were performed on CO topics and one run was performed on CAS topics. The runs on CO topics differ from the function f used in the voting method. The run labelled VTCO2004TC35xp400sC-515 uses the voting function: NT ( T, ( ) S ( Score( = Vote( ϕ where ϕ=400. The run labelled VTCO2004TC35p4sC-515 uses the voting function: NT ( T, D) Score ( = Vote( where λ=4. λ

8 The run on CAS topics labelled VTCAS2004C35xp200sC-515PP1 uses the voting function: Score( Vote( ϕ NT ( T, ( ) S ( = where ϕ=200. The coefficient taking into account structural predicates associated to searched concepts was fixed to 1.0 (i.e. the vote of an element regarding a given concept is doubled when the element matches the structural constraint associated to the concept). The coefficient taking into account structural predicates for expected results was fixed to 2.0 (i.e. the score of an element matching the structural predicate is doubled). The values of these two coefficients were fixed arbitrarily. For all submitted runs the other parameters of the scoring function were the same. Coverage threshold was fixed to 35% (i.e. more than a third of terms describing the topic must appear in the text to keep the XML component). Coefficients applied to take into account the signs + and - used to emphasise a concept or to denote an unwanted one were fixed to: +5.0 for concepts marked with + (the vote of these concepts increases the score of the elements in which they appear), -5.0 for concepts marked with - (the vote of these concepts reduces the score of the elements in which they appear), 1.0 for unmarked concepts. The coefficient α used to propagate a component score through the hierarchical structure of the XML document was fixed to 0.1. The values of the parameters are those which gave the best results during a training phase done with INEX 2003 CO topics. 5.2 Results The following table shows the preliminary results of the three runs based on the voting method: Table 1. Results of the 3 runs performed using the voting method Run Aggregate score Rank VTCO2004TC35xp400sC /70 VTCO2004TC35p4sC /70 VTCAS2004TC35xp200sC /51

9 The results of the two runs for CO topics are detailled in the following table: Table 2. Detailed results of the 2 runs for CO topics Quantisation VTCO2004TC35xp400sC-515 Average Rank precision VTCO2004TC35p4sC-515 Average Rank precision strict / /70 generalised / /70 so / /70 s3_e / /70 s3_e / /70 e3_s / /70 e3_s / /70 For CO topics, the run which has obtained the best results is the run labelled VTCO2004TC35xp400sC-515. The best measures have been obtained with e3s321 quantisation. Average precision is equal to , placing the run at the 10 th rank. The run labelled VTCO2004TC35p4sC-515 has obtained values slightly lower for most of the quantisations. Only the best results obtained for CO topics are presented in the following graphs that is to say run VTCO2004TC35xp400sC-515 for e3s321 quantisation. Fig. 1. Precision/Recall curve of the CO run labelled VTCO2004TC35xp400sC-515 for e 3 s 321 quantisation

10 Fig. 2. Rank of the CO run labelled VTCO2004TC35xp400sC-515 for e 3 s 321 quantisation For CAS topics, the run VTCAS2004TC35xp200sC-515PP1 has been ranked at the 5 th place. The results of the run are detailled in the following table: Table 3. Detailed results of the run for CAS topics VTCAS2004TC35xp200sC-515PP1 Quantisation Average precision Rank strict /51 generalised /51 so /51 s3_e /51 s3_e /51 e3_s /51 e3_s /51 The best measures have been obtained for quantisations stric e3s321 and e3s32 for which the run is ranked 5. The following figures present the results corresponding to the strict quantisation and e3s321 quantisation.

11 Fig. 3. Precision/Recall curve of the VCAS run labelled VTCAS2004TC35xp200sC-515PP1 for strict quantisation Fig. 4. Rank of the VCAS run labelled VTCAS2004TC35xp200sC-515PP1 for strict quantisation

12 Fig. 5. Precision/Recall curve of the VCAS run labelled VTCAS2004TC35xp200sC-515PP1 for e 3 s 321 quantisation Fig. 6. Rank of the VCAS run labelled VTCAS2004TC35xp200sC-515PP1 for e 3 s 321 quantisation

13 6 DISCUSSION AND FUTURE WORKS Regarding the experiments that were performed and the obtained results we can notice that: the chosen functions and parameters for the scoring method tend to support exhaustivity rather than specificity. Indeed, the importance of the factor measuring the representation of the topic (i.e. NT(T,/T)) dominates in the scoring function and this factor is related to the exhaustivity relevance. It would be interesting to modify the scoring function to increase the number of elements judged as relevant regarding specificity. The measures obtained using INEX 2003 CO topics were globally better. This suggests that our scoring method is more efficient on certain queries. It would be interesting to identify a class (or classes) of queries for which the function works better, a class (classes) of queries for which the function is less efficient and to understand why. The function could evolve to extend its efficiency to other kinds of queries or different functions could be applied regarding different query classes. The values of coefficients applied for structural constraint matching have been fixed arbitrarily. Additional experiments on INEX 2004 CAS topics will help us to adjust the values of these coefficients. Evaluate the profit of adding a relevance feedback process to our method. On one hand, feedback from first ranked elements of the assessments can be performed. This is the process chosen this year in the relevance feedback track. On the other hand, we plan to integrate a feedback process using first ranked elements of a first search using our system. Acknowledgments Research outlined in the paper is part of the project QUEST: Query reformulation for structured document retrieval, PAI Alliance N 05768UJ. However, this publication only reflects the author s view. References 1. Abolhassani, M., Fuhr, N.: Applying the Divergence from Randomness Approach for Content-Only Search in XML Documents. 26th European Conference on IR Research (ECIR), Lecture Notes in Computer Science vol (2004) Augé, J., Englmeier, K., Huber G., Mothe, J. : Catégorisation automatique de textes basée sur des hiérarchies de concepts. 19ième Journées de Bases de Données Avancées (BDA) Lyon (2003) 69-87

14 3. Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., Yergeau, Y.: Extensible Markup Language (XML) 1.0 (Third Edition). W3C Recommendation., (2004) 4. Clark, J., DeRose, S.: XML Path Language (XPath). W3C Recommendation, (1999). 5. Fuhr, N., Maalik, S., Lalmas, M.: Overview of the INitiative for the Evaluation of XML Retrieval (INEX) Proceedings of the Second INEX Workshop, Dagstuhl, Germany (2004) Geva, S., Leo-Spork, M.: XPath Inverted File for Information Retrieval. INEX 2003 Workshop Proceedings, (2003) IRAIA: Getting Orientation in Complex Information Spaces as an Emergent Behaviour of Autonomous Information Agents. European Information Societies Technology, IST , ( ). 8. Kamps, J., de Rijke, M., Sigurbjörnsson, B.: Length normalization in XML retrieval. Proceedings of the 27th International Conference on Research and Development in Information Retrieval (SIGIR). New York NY, USA, (2004) Li, Y.: Toward a qualitative search engine. IEEE Internet Computing, vol. 2 n 4, (1998) List J., Mihajlovic V., de Vries A. P., Ramirez G., Hiemstra D.: The TIJAH XML-IR system at INEX INEX 2003 Workshop Proceedings, (2003) Ogilvie, P., Callan J.: Using Language Models for Flat Text Queries in XML Retrieval. Proceedings of the Second INEX Workshop. Dagstuhl, Germany, (2004) Pauer, B., Holger, P.: Statfinder. Document Package Statfinder, Vers. 1.8, (2000) 13. Pehcevski, J., Thom J., Vercoustre, A.M.: Enhancing Content-And-Structure Information Retrieval using a Native XML Database. Proceedings of The First Twente Data Management Workshop on XML Databases and Information Retrieval (TDM'04), Enschede, The Netherlands, (2004) 14. Piwowarski B., Vu H.-T., Gallinari P.: Bayesian Networks and INEX'03. Proceedings of the Second INEX Workshop. Dagstuhl, Germany, (2003) Sauvagna K., Huber G., Boughanem, M., Mothe, J.: IRIT at INEX Proceedings of the Second INEX Workshop. Dagstuhl, Germany, (2003) Trotman, A., O'Keefe, R. A.: Identifying and Ranking Relevant Document Elements. INEX 2003 Workshop Proceedings, (2003)

Processing Structural Constraints

SYNONYMS None Processing Structural Constraints Andrew Trotman Department of Computer Science University of Otago Dunedin New Zealand DEFINITION When searching unstructured plain-text the user is limited