Identifying and Ranking Relevant Document Elements
|
|
- Isaac Lawson
- 5 years ago
- Views:
Transcription
1 Identifying and Ranking Relevant Document Elements Andrew Trotman and Richard A. O Keefe Department of Computer Science University of Otago Dunedin, New Zealand andrew@cs.otago.ac.nz, ok@otago.ac.nz ABSTRACT A method of indexing and searching structured documents for element retrieval is discussed. Documents are indexed using a modified inverted file retrieval system. Modified postings include pointers into a collection-wide document structure tree (the corpus tree) describing the structure of every document in the collection. Retrieval topics are converted into Boolean queries. Queries are used to identify relevant documents. Documents are then ranked using Okapi BM25 and finally relevant elements are identified using coverage. Search results are presented sorted first by document then coverage. The design is presented in the context of the second annual INEX workshop. 1. INTRODUCTION Otago first entered INEX [2] during its second year. There were three objectives: understand the participation process, gain access to this and last year s judgments, and create a baseline for comparing future experiments. Participation involved design of six topics, generation and submission of search results, and online judging of three topics. Of these, generating the results was the most problematic as it required software changes. The chosen retrieval engine was designed from the onset for retrieval of whole academic documents in XML [1]. A predecessor can be seen on BioMedNet and ChemWeb [4]. This engine, like that used in the IEEE digital library, returns relevance ranked lists of whole documents the natural (citable) unit of information in an academic environment. From experience, information vendors are not interested in converting their documents from propriety DTDs into a common DTD or any other format so software was needed to handle documents in heterogeneous formats. Boolean searching, field restricting and relevance ranking were already supported, so modifications focused on identifying and ranking document elements. The modified retrieval engine can be thought of as working in three parts. Candidate documents are identified using a Boolean query. Candidates are then ranked using Okapi BM25 [7]. Finally, relevant non-overlapping elements are identified and presented as the result. Although it is easier to understand in three parts, in fact the most relevant elements of the most relevant documents are computed in a single pass of the indexes. 2. INDEXING Much of the index design has already been described elsewhere [8]. Inverted file retrieval is used. There is one dictionary file and each dictionary term points to a single inverted list of postings. An unstructured inverted list is usually represented {<d 1, f 1 >, <d 2, f 2 >,, <d n, f n >} where d n is a document ordinal number and f n is the frequency of the given term in the given document. For structured retrieval, each <d n, f n > pair is replaced by the triple <d n, p n, f n >, where p n is a position in the document. When phrase or proximity searching is required, this triple is replaced with the triple <d n, p n, w n > where w n is the ordinal number of the term in the collection (starting from 0 at the start of the collection, incrementing by 1 for each term, not incrementing for tags, and not reset at the beginning of each record). On disk the postings are stored compressed. The p n value in each posting is a position in the corpus tree. The tagging structure for any one document represents a tree walk. Start at the root of the tree. When an open tag is encountered, the branch labelled with the tag name is followed downwards. When a close tag is encountered, the walk backtracks one branch. For a well-formed XML document, the walk will start and end at the root. This tree-walking property also holds for a collection of well-formed documents. The tree they collectively describe is called the corpus tree and can be built during single pass indexing. As each node is encountered for the first time, a branch is added to the tree and labelled with a unique ordinal identifier, p n. Terms can lie either at the nodes or the leaves of this tree. The corpus tree includes every single path in every single document, but is unlikely to match the structure of any one document. In Figure 1, three well-formed documents are given, as is the corpus tree for those documents. For clarity, the branches of the tree are labelled with which document they describe although this information is not computed and not stored.
2 sec[1]:2 <doc> <sec><p c= red >Fox in Sox</p></sec> </doc> <doc> <sec><p>the Cat in the Hat</p></sec> <sec><p>comes Back</p></sec> </doc> <doc> <sec><p>green Eggs</p> <p>and Ham</p></sec> </doc> doc[1]:1 p[2]:7 sec[2]:5 p[1]:6 Figure 1: Three documents and the corpus tree including every path through every document, but not matching the structure of any one document. For the purpose of this figure each document is marked white, gray, or black and each node with which documents include that path. Each node is numbered with the instance of the tag (e.g. p[2]) and the node id, p n (after the colon). Term p 1 p 2 p 3 d p1,1, d p1,2,, d p1,n w p1,1, w p1,2,, w p1,n d p2,1, d p2,2,, d p2,n w p2,1, w p2,2,, w p2,n d p3,1, d p3,2,, d p3,n w p3,1, w p3,2,, w p3,n Figure 2: The in-memory postings structure allows quick access to only those postings relevant to the required document elements. The inverted lists are built and processed using the structure represented in Figure 2. Postings for each term are ordered by increasing p n. Each p n points to the list of document ids (the d-sublist) and word ids (the w-sublist) found at that point in the tree. Each list is held in increasing order and compressed. To search the collection for a given term, each d- sublist is examined in turn. By doing so, documents may not be examined in turn. This does not matter so long as all documents that would be examined are examined. Further, whole documents may not be examined in turn this, too, does not matter as many ranking functions can be computed piecewise 1. To field-restrict a term, a restricted set of sublists is examined. The w-sublists are used for proximity searching. Storing and processing the postings in this way has computational advantages. For a field-restricted search, postings not pertaining to the restriction can be skipped. As postings are stored compressed, they need not even be decompressed. Word postings are used only for proximity searching. On disk the w-sublists are collected together and stored after all d-sublists. They are not even loaded from disk if not needed. 3. SEARCHING As the retrieval engine starts up, the corpus tree is loaded and an additional structure is created from it, the field list. For each instance of each tag, the list of nodes at or below that node is collected. For each tag, the same is collected. These lists are then merged and sorted. Table 1: The field list for the corpus tree given in Figure 1. Field {4} doc {1, 2, 3, 4, 5, 6, 7} doc[1] {1, 2, 3, 4, 5, 6, 7} p {3, 4, 6, 7} p[1] {3, 4, 6} p[2] {7} sec {2, 3, 4, 5, 6, 7} sec[1] {2, 3, 4, 7} sec[2] {5, 6} The field list for the Figure 1 corpus tree is given in Table 1. From this, a search restricted to sec requires postings at or below all sec nodes of the corpus tree, or where p n ={2, 3, 4, 5, 6, 7}. To 1 BM25 cannot, so the lists are merged then processed.
3 search in p[1], the postings are needed where p n ={3, 4, 6}. For a search restricted to p[1] in sec, these two lists are ANDed together (giving p n ={3, 4, 6}), and the members of this list are checked against the corpus tree to ensure they satisfy p[1] in sec and not sec in p[1]. Equivalence tag restrictions are also computed from the field list. The restrictions for each equivalent tag are ORed giving the equivalent restriction. If, for example, p[2] were equivalent in Table 1, the restriction would be p n ={4, 7}. Several extensions were added to support element and attribute retrieval: Attributes are now distinguished from tags by prefixing attributes with symbol. This symbol was chosen because it makes for easy parsing of INEX queries, which use the same symbol. The attribute value is considered to be content lying not only within the attribute, but also the tag. For example, <tag att= number > term </tag>, is equivalent to <tag> <@att> number </@att> term </tag>. In this way, a search for number in tag will succeed. Tags can now be identified not only by their name and path, but also by the tag instance. Where before it was only possible to restrict to paragraph for example, it is now possible to restrict to the second paragraph. Trotman [8] suggests the corpus tree will be small for real data. In this extended model this no longer holds true. In the TREC [3] Wall Street Journal collection there are only 20 nodes, for INEX there are 198,041 nodes after noise nodes are removed (4,789 with attributes and instances also removed). Table 2: Tags ignored during indexing. ariel en item-text ss art entry label stanza b enum large sub bi f li super bq it line tbody bu item math tf bui item-bold proof tfoot cen item-both rm tgroup colspec item-bullet rom thead couplet item-diamond row theorem dd item-letpara scp tmath ddhd item-mdash sgmlf tt dt item-numpara sgmlmath u dthd item-roman spanspec ub Many tags are used to mark elements too small to be relevant. An example of such a tag is ref, used to mark references in the text. This tag cannot be relevant to any topic as the contents are simply reference numbers. Some tags were used for visual appearance such as b used to mark text in bold. Others were used as typesetting hints such as art used to specify the size of an image. If any of these tags, or those in Table 2 were encountered during indexing, tagging was ignored (until the matching close tag), but the content still indexed. Tags in this group were hand selected even though automated systems for choosing such tags have been proposed [5]. 4. QUERY FORMATION The title of the topic is extracted and converted into a Boolean query. This query is used to determine which documents to retrieve. Ranking is computed from the postings for the search terms. For content and structure (CAS) topics, the target element is computed and stored for later use. The complete path for each about-function is computed by concatenating the about-path to the contextelement restricting it. All equivalent paths are then computed by permuting this path with the equivalence tags. This fully specified path now replaces the original about-path and the contextelement is removed. At this point, the topic has been transformed from INEX topic syntax into a query whereby each about-clause is Boolean separated and explicitly field restricted. Create mandatory by ANDing each mandatory term (+) Create optional by ORing each optional term Create exclusion by ORing each exclusion term (-) If all three sub-expressions are non-null, combine: mandatory AND (* OR optional) NOT exclusion If two sub-expressions are non-null, combine using one of: mandatory AND (* OR optional) optional NOT exclusion mandatory NOT exclusion If only one sub-expression is non-null, use one of: mandatory optional * NOT exclusion Where * finds all documents Figure 3: Algorithm to convert an about phrase into a Boolean expression. Examining the about-string, optional, mandatory (+), and exclusion (-) terms are allowed. These terms are converted into a Boolean expression. Optional terms are collected and converted into a sub-expression by ORing ( a b c a OR b OR
4 c ). Likewise, exclusion terms are also ORed. Mandatory terms are collected and ANDed ( +d +e +f d AND e AND f ). These three subexpressions are then combined to form a complete about-query. The whole algorithm is presented in figure 3. Separate about clauses are already Boolean separated so these operators are preserved. Finally, all context-elements must be satisfied so these are ANDed together. For content only (CO) topics, a Boolean expression is computed exactly as for one about-string using the algorithm presented in Figure RANKING The retrieval engine is a Boolean ranking hybrid. Result sets are computed in two parts; a bit-string of documents satisfying a strict interpretation of the query, and a set of accumulators holding document weights. 5.1 Document Ranking The Boolean expression constructed above is converted into a parse tree then evaluated. At each leaf, the posting are loaded and converted into a bitstring, one bit per document. If a given leaf in the parse tree is not tag-restricted, each posting is examined in turn, and the bit at position d n of each posting is set. Should the leaf be tag-restricted, only those postings for the given tags are examined (see Section 3) and converted. The bit-strings are combined at the nodes of the parse tree using the operator there. At the root of the tree, the bit-string has set bits for all documents exactly satisfying the query and unset for those that do not. The accumulator values are the sum of Okapi BM25 scores computed at each leaf of the parse tree. Scores are summed regardless of the operators in the parse tree. For AND and OR nodes scores are summed because the influence at these nodes is the sum of influences of the children. For NOT nodes, they are also summed. If a document is excluded from the result set, the accumulator value is irrelevant. If a document is not, it is either re-included through other terms (e.g. mammal OR (dog NOT cat)), or there is a double negative in the query (e.g. cat NOT (dog NOT cat)). In both cases, the document has successfully satisfied a query leaf so receives a positive weight. 5.2 Element Ranking and Selection The Boolean ranking hybrid engine was extended to include element ranking. Although whole documents are valid as results for CO topics, SCAS topics specify a target element. This targeting establishes the retrieval unit. If the target element is sec, this tag must be returned. It essentially directs the retrieval engine to search and rank each given tag instance separately. Wilkinson [9] suggests that ranking whole documents then extracting elements from these is a poor ranking strategy. The opposite may hold for this collection. A relevant element lies in the greater context of a relevant document. A relevant document will lie in a relevant journal, which, in turn, lies in a relevant collection. To this end, every paragraph of every section of every document is contextually placed so extracting elements from relevant documents may be a good approach. The coverage of any one posting is computed as those nodes in the document tree at or above the posting. Each posting is already annotated with a pointer into the tree, p n. To compute the coverage, the tree is traversed upwards from p n to the root. Coverage is computed for each document with respect to each search term. sec[1] doc[1] p[2] sec[2] p[1] Figure 4: Coverage of a term occurring at p[2]. The coverage includes all those nodes at and above the occurrence node; those parts of the tree that cover the term. In figure 4, the term ham occurs at p[2]. The coverage includes all nodes above that point in the document tree. In this example that is sec[1] and doc[1]; all nodes that cover the search term those highlighted in grey. For each document in the result set, the weighted coverage is computed as the covered branches of the document tree and how many search terms cover that branch. This is computed during a single pass of the indexes by storing the weighted coverage as part of each accumulator. For the query eggs and ham against the documents in Figure 1, the weighted coverage is shown in Figure 5. doc[1] and sec[1] have a weight of 2, while p[1] and p[2] each have a weight of 1.
5 1 sec[1] 2 doc[1] p[2] 2 1 sec[2] p[1] Figure 5: Weighted coverage of each node is the number of search terms that occur at or below that point in the tree. Weighted coverage is shown in the bottom right of those nodes with weights greater than zero. In any given document, the document root must have the highest weighted coverage, but this can be equal to that of other nodes. For CO topics, all branches of the document tree with coverage less than the root are pruned. The remaining leaves are presented as the result set for that document (in Figure 5, the result is //doc[1]//sec[1]). In this way, the most information dense elements in the document are considered most relevant and no part of any document is returned more than once (overlapping is eliminated). If a target element is specified in a SCAS topic, all non-target branches are pruned. From the remains, those branches with the highest weighted coverage are presented as the result set for that document. 5.3 Ranking summary Recall is determined by evaluation of the Boolean expression, documents are then ranked using Okapi BM25, and elements are selected by weighted coverage. As all the metrics needed for ranking are available at search time, the search and rank process is computed in a single pass of the postings. 6. RESULTS Evaluation results are presented in Table 3. Table 3: INEX performance measures Strict Precision Rank CO nd SCAS th CO-ng-o th CO-ng-s Unknown Not top 10 Generalized Precision Rank CO th SCAS th CO-ng-o st CO-ng-s th The retrieval engine performed badly using the INEX_EVAL measure. This is most likely because this measure treats each tag in a hierarchy as relevant but coverage eliminates overlapping tags the measure is inappropriate for this retrieval technique. Good results were shown when performance is measured using INEX_EVAL_NG. NG measures the ratio of relevant to irrelevant information returned. Coverage finds those parts of the document that contain most of the search terms. The correlation between information density and coverage is reflected in the result. The results show the best performance when generalized quantization is used. This suggests the ordering of the results is not optimal for strict quantization or the most relevant documents are not ranked before less relevant documents. This may be a consequence of sorting into document order before coverage order. 7. OTAGO AT INEX The participation process involved the design and contribution of six topics. Of these, four were selected for inclusion in the final topic set. Otago was assigned three of these to assess. The assessment took three people one week each; this was one week per topic. The retrieval engine described herein was used for designing the contributed topics. This was somewhat problematic as the topic parser was written at the same time the topics were being written, each with few examples. From the final CAS topic set, 19 required corrections, corrections finally running to 12 rounds! This suggests the topic syntax is unnecessarily complex. See our further contribution [6] for a discussion on a possible language to use for future workshops. The assessors were overburdened by the multitude of obviously irrelevant documents to assess. Examining some of these documents suggests many retrieval engines were aiming at high recall by retrieving any document containing any of the title terms. In particular, the word java appeared in one topic; this was a somewhat popular research area over the years included in the IEEE collection. The assessment task could be reduced by carefully designing topics (and retrieval engines) to avoid this problem. 8. CONCLUSIONS Element ranking was added to a Boolean ranking hybrid retrieval engine. Relevant documents were identified using Boolean searching. Documents were ranked using Okapi BM25. Finally coverage was used to rank elements within documents.
6 The results suggest coverage is a good method of identifying relevant and non-overlapping elements. Performance was best for generalized quantization, so ordering is not ideal. This may be a consequence of presenting results in document order. 9. ACKNOWLEDGEMENTS In addition to the authors, Yerin Yoo contributed to the assessment task. Without her contribution we would not have been able to complete the task. This work was supported by University of Otago Research Grant (UORG) funding. REFERENCES [1] Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., Yergeau, F., & Cowan, J. (2003). Extensible markup language (XML) 1.1 W3C proposed recommendation. The World Wide Web Consortium. Available: [2003. [2] Fuhr, N., Gövert, N., Kazai, G., & Lalmas, M. (2002). INEX: Initiative for the evaluation of XML retrieval. In Proceedings of the ACM SIGIR 2000 Workshop on XML and Information Retrieval. [3] Harman, D. (1993). Overview of the first TREC conference. In Proceedings of the 16th ACM SIGIR Conference on Information Retrieval, (pp ). [4] Hitchcock, S., Quek, F., Carr, L., Hall, W., Witbrock, A., & Tarr, I. (1988). Towards universal linking for electronic journals. Serials Review, 24(1), [5] Kazai, G., & Rölleke, T. (2002). A scalable architecture for XML retrieval. In Proceedings of the 1st workshop of the initiative for the evaluation of XML retrieval (INEX), (pp ). [6] O'Keefe, R. A., & Trotman, A. (2003). The simplest query language that could possibly work. In Proceedings of the 2nd workshop of the initiative for the evaluation of XML retrieval (INEX). [7] Robertson, S. E., Walker, S., Beaulieu, M. M., Gatford, M., & Payne, A. (1995). Okapi at TREC-4. In Proceedings of the 4th Text REtrieval Conference (TREC-4), (pp ). [8] Trotman, A. (2003). Searching structured documents. Information Processing & Management, (to appear) doi: /s (03) , available on ScienceDirect since 6 June [9] Wilkinson, R. (1994). Effective retrieval of structured documents. In Proceedings of the 17th ACM SIGIR Conference on Information Retrieval, (pp ).
The Interpretation of CAS
The Interpretation of CAS Andrew Trotman 1 and Mounia Lalmas 2 1 Department of Computer Science, University of Otago, Dunedin, New Zealand andrew@cs.otago.ac.nz, 2 Department of Computer Science, Queen
More informationProcessing Structural Constraints
SYNONYMS None Processing Structural Constraints Andrew Trotman Department of Computer Science University of Otago Dunedin New Zealand DEFINITION When searching unstructured plain-text the user is limited
More informationFrom Passages into Elements in XML Retrieval
From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles
More informationMounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom,
XML Retrieval Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom, mounia@acm.org Andrew Trotman, Department of Computer Science, University of Otago, New Zealand,
More informationPassage Retrieval and other XML-Retrieval Tasks. Andrew Trotman (Otago) Shlomo Geva (QUT)
Passage Retrieval and other XML-Retrieval Tasks Andrew Trotman (Otago) Shlomo Geva (QUT) Passage Retrieval Information Retrieval Information retrieval (IR) is the science of searching for information in
More informationA Universal Model for XML Information Retrieval
A Universal Model for XML Information Retrieval Maria Izabel M. Azevedo 1, Lucas Pantuza Amorim 2, and Nívio Ziviani 3 1 Department of Computer Science, State University of Montes Claros, Montes Claros,
More informationComponent ranking and Automatic Query Refinement for XML Retrieval
Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge
More informationA Comparative Study Weighting Schemes for Double Scoring Technique
, October 19-21, 2011, San Francisco, USA A Comparative Study Weighting Schemes for Double Scoring Technique Tanakorn Wichaiwong Member, IAENG and Chuleerat Jaruskulchai Abstract In XML-IR systems, the
More informationPassage Retrieval and other XML-Retrieval Tasks
Passage Retrieval and other XML-Retrieval Tasks Andrew Trotman Department of Computer Science University of Otago Dunedin, New Zealand andrew@cs.otago.ac.nz Shlomo Geva Faculty of Information Technology
More informationOverview of the INEX 2009 Link the Wiki Track
Overview of the INEX 2009 Link the Wiki Track Wei Che (Darren) Huang 1, Shlomo Geva 2 and Andrew Trotman 3 Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia 1,
More informationRetrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents
Retrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents Norbert Fuhr University of Duisburg-Essen Mohammad Abolhassani University of Duisburg-Essen Germany Norbert Gövert University
More informationUniversity of Amsterdam at INEX 2010: Ad hoc and Book Tracks
University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,
More informationThe Utrecht Blend: Basic Ingredients for an XML Retrieval System
The Utrecht Blend: Basic Ingredients for an XML Retrieval System Roelof van Zwol Centre for Content and Knowledge Engineering Utrecht University Utrecht, the Netherlands roelof@cs.uu.nl Virginia Dignum
More informationA Fusion Approach to XML Structured Document Retrieval
A Fusion Approach to XML Structured Document Retrieval Ray R. Larson School of Information Management and Systems University of California, Berkeley Berkeley, CA 94720-4600 ray@sims.berkeley.edu 17 April
More informationPlan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML
CS276B Text Retrieval and Mining Winter 2005 Plan for today Vector space approaches to XML retrieval Evaluating text-centric retrieval Lecture 15 Text-centric XML retrieval Documents marked up as XML E.g.,
More informationHeading-aware Snippet Generation for Web Search
Heading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp Web Search Result Snippets Are short
More informationXPath Inverted File for Information Retrieval
XPath Inverted File for Information Retrieval Shlomo Geva Centre for Information Technology Innovation Faculty of Information Technology Queensland University of Technology GPO Box 2434 Brisbane Q 4001
More informationAccessing XML documents: The INEX initiative. Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen)
Accessing XML documents: The INEX initiative Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen) XML documents Book Chapters Sections World Wide Web This is only only another
More informationRelevance in XML Retrieval: The User Perspective
Relevance in XML Retrieval: The User Perspective Jovan Pehcevski School of CS & IT RMIT University Melbourne, Australia jovanp@cs.rmit.edu.au ABSTRACT A realistic measure of relevance is necessary for
More informationArticulating Information Needs in
Articulating Information Needs in XML Query Languages Jaap Kamps, Maarten Marx, Maarten de Rijke and Borkur Sigurbjornsson Gonçalo Antunes Motivation Users have increased access to documents with additional
More informationApplying the IRStream Retrieval Engine to INEX 2003
Applying the IRStream Retrieval Engine to INEX 2003 Andreas Henrich, Volker Lüdecke University of Bamberg D-96045 Bamberg, Germany {andreas.henrich volker.luedecke}@wiai.unibamberg.de Günter Robbert University
More informationQuery Processing and Alternative Search Structures. Indexing common words
Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such
More informationModern Information Retrieval
Modern Information Retrieval Chapter 13 Structured Text Retrieval with Mounia Lalmas Introduction Structuring Power Early Text Retrieval Models Evaluation Query Languages Structured Text Retrieval, Modern
More informationControlling Overlap in Content-Oriented XML Retrieval
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science, University of Waterloo, Canada claclark@plg.uwaterloo.ca ABSTRACT The direct application of standard
More informationTHE weighting functions of information retrieval [1], [2]
A Comparative Study of MySQL Functions for XML Element Retrieval Chuleerat Jaruskulchai, Member, IAENG, and Tanakorn Wichaiwong, Member, IAENG Abstract Due to the ever increasing information available
More informationXML Information Set. Working Draft of May 17, 1999
XML Information Set Working Draft of May 17, 1999 This version: http://www.w3.org/tr/1999/wd-xml-infoset-19990517 Latest version: http://www.w3.org/tr/xml-infoset Editors: John Cowan David Megginson Copyright
More informationRMIT University at TREC 2006: Terabyte Track
RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction
More informationA Voting Method for XML Retrieval
A Voting Method for XML Retrieval Gilles Hubert 1 IRIT/SIG-EVI, 118 route de Narbonne, 31062 Toulouse cedex 4 2 ERT34, Institut Universitaire de Formation des Maîtres, 56 av. de l URSS, 31400 Toulouse
More informationXML Retrieval More Efficient Using Compression Technique
XML Retrieval More Efficient Using Compression Technique Tanakorn Wichaiwong and Chuleerat Jaruskulchai Abstract In this paper, we report experimental results of our approach for retrieval large-scale
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More informationLexisNexis PatentOptimizer Quick Reference Card. August, 13, 2010
LexisNexis PatentOptimizer Quick Reference Card August, 13, 2010 Contents How do I activate the LexisNexis PatentOptimizer Service?...1 How do I access service features?... 2 How do I submit feedback?...4
More informationAssignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1
COMPUTER ENGINEERING DEPARTMENT BILKENT UNIVERSITY Assignment No. 1 Abdurrahman Yasar June 10, 2014 1 QUESTION 1 Consider the following search results for two queries Q1 and Q2 (the documents are ranked
More informationCS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University
CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression
More informationΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου
Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs
More informationFormulating XML-IR Queries
Alan Woodley Faculty of Information Technology, Queensland University of Technology PO Box 2434. Brisbane Q 4001, Australia ap.woodley@student.qut.edu.au Abstract: XML information retrieval systems differ
More informationComparative Analysis of Clicks and Judgments for IR Evaluation
Comparative Analysis of Clicks and Judgments for IR Evaluation Jaap Kamps 1,3 Marijn Koolen 1 Andrew Trotman 2,3 1 University of Amsterdam, The Netherlands 2 University of Otago, New Zealand 3 INitiative
More informationIndexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationSound and Complete Relevance Assessment for XML Retrieval
Sound and Complete Relevance Assessment for XML Retrieval Benjamin Piwowarski Yahoo! Research Latin America Santiago, Chile Andrew Trotman University of Otago Dunedin, New Zealand Mounia Lalmas Queen Mary,
More informationSound ranking algorithms for XML search
Sound ranking algorithms for XML search Djoerd Hiemstra 1, Stefan Klinger 2, Henning Rode 3, Jan Flokstra 1, and Peter Apers 1 1 University of Twente, 2 University of Konstanz, and 3 CWI hiemstra@cs.utwente.nl,
More informationEvaluating the effectiveness of content-oriented XML retrieval
Evaluating the effectiveness of content-oriented XML retrieval Norbert Gövert University of Dortmund Norbert Fuhr University of Duisburg-Essen Gabriella Kazai Queen Mary University of London Mounia Lalmas
More informationMelbourne University at the 2006 Terabyte Track
Melbourne University at the 2006 Terabyte Track Vo Ngoc Anh William Webber Alistair Moffat Department of Computer Science and Software Engineering The University of Melbourne Victoria 3010, Australia Abstract:
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationMore on indexing CE-324: Modern Information Retrieval Sharif University of Technology
More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan
More informationCSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"
CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Evaluation" Evaluation is key to building
More informationA Practical Passage-based Approach for Chinese Document Retrieval
A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationEvaluation Metrics. Jovan Pehcevski INRIA Rocquencourt, France
Evaluation Metrics Jovan Pehcevski INRIA Rocquencourt, France jovan.pehcevski@inria.fr Benjamin Piwowarski Yahoo! Research Latin America bpiwowar@yahoo-inc.com SYNONYMS Performance metrics; Evaluation
More informationCOSC 431 Information Retrieval. Phrase Search & Structured Search
COSC 431 Information Retrieval Phrase Search & Structured Search 1 Phrase Searching What are Structured documents Meta-data Structured documents Outline Searching Structured documents Meta-data Semi-structured
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationXML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson
Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges
More informationIndexing and Query Processing. What will we cover?
Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries
More informationindexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa
Term Distillation in Patent Retrieval Hideo Itoh Hiroko Mano Yasushi Ogawa Software R&D Group, RICOH Co., Ltd. 1-1-17 Koishikawa, Bunkyo-ku, Tokyo 112-0002, JAPAN fhideo,mano,yogawag@src.ricoh.co.jp Abstract
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationVerbose Query Reduction by Learning to Rank for Social Book Search Track
Verbose Query Reduction by Learning to Rank for Social Book Search Track Messaoud CHAA 1,2, Omar NOUALI 1, Patrice BELLOT 3 1 Research Center on Scientific and Technical Information 05 rue des 03 frères
More informationEfficient query processing
Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:
More informationRetrieval Evaluation
Retrieval Evaluation - Reference Collections Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, Chapter
More informationStructured Queries in XML Retrieval
Structured Queries in XML Retrieval Jaap Kamps 1,2 Maarten Marx 2 Maarten de Rijke 2 Börkur Sigurbjörnsson 2 1 Archives and Information Studies, University of Amsterdam, Amsterdam, The Netherlands 2 Informatics
More informationOverview of the INEX 2008 Ad Hoc Track
Overview of the INEX 2008 Ad Hoc Track Jaap Kamps 1, Shlomo Geva 2, Andrew Trotman 3, Alan Woodley 2, and Marijn Koolen 1 1 University of Amsterdam, Amsterdam, The Netherlands {kamps,m.h.a.koolen}@uva.nl
More informationDocument Structure Analysis in Associative Patent Retrieval
Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationOverview of the INEX 2008 Ad Hoc Track
Overview of the INEX 2008 Ad Hoc Track Jaap Kamps 1, Shlomo Geva 2, Andrew Trotman 3, Alan Woodley 2, and Marijn Koolen 1 1 University of Amsterdam, Amsterdam, The Netherlands {kamps,m.h.a.koolen}@uva.nl
More informationImproving Suffix Tree Clustering Algorithm for Web Documents
International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal
More informationIncorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches
Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance
More informationClassification and Comparative Analysis of Passage Retrieval Methods
Classification and Comparative Analysis of Passage Retrieval Methods Dr. K. Saruladha, C. Siva Sankar, G. Chezhian, L. Lebon Iniyavan, N. Kadiresan Computer Science Department, Pondicherry Engineering
More informationConfigurable Indexing and Ranking for XML Information Retrieval
Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCL Computer Science Department, Los ngeles, C, US 90095 {sliu, zou, wwc}@cs.ucla.edu BSTRCT
More informationScore Region Algebra: Building a Transparent XML-IR Database
Vojkan Mihajlović Henk Ernst Blok Djoerd Hiemstra Peter M. G. Apers Score Region Algebra: Building a Transparent XML-IR Database Centre for Telematics and Information Technology (CTIT) Faculty of Electrical
More informationRelevancy Workbench Module. 1.0 Documentation
Relevancy Workbench Module 1.0 Documentation Created: Table of Contents Installing the Relevancy Workbench Module 4 System Requirements 4 Standalone Relevancy Workbench 4 Deploy to a Web Container 4 Relevancy
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationAnalyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc
Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB
More informationEfficiency vs. Effectiveness in Terabyte-Scale IR
Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa
More informationUsing XML Logical Structure to Retrieve (Multimedia) Objects
Using XML Logical Structure to Retrieve (Multimedia) Objects Zhigang Kong and Mounia Lalmas Queen Mary, University of London {cskzg,mounia}@dcs.qmul.ac.uk Abstract. This paper investigates the use of the
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationSearch Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson
Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique
More informationCS646 (Fall 2016) Homework 1
CS646 (Fall 2016) Homework 1 Deadline: 11:59pm, Sep 28th, 2016 (EST) Access the following resources before you start working on HW1: Download the corpus file on Moodle: acm corpus.gz (about 90 MB). Check
More informationINEX a Broadly Accepted Data Set for XML Database Processing?
INEX a Broadly Accepted Data Set for XML Database Processing? Pavel Loupal and Michal Valenta Dept. of Computer Science and Engineering FEE, Czech Technical University Karlovo náměstí 13, 121 35 Praha
More informationHeading-Based Sectional Hierarchy Identification for HTML Documents
Heading-Based Sectional Hierarchy Identification for HTML Documents 1 Dept. of Computer Engineering, Boğaziçi University, Bebek, İstanbul, 34342, Turkey F. Canan Pembe 1,2 and Tunga Güngör 1 2 Dept. of
More informationCLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationChallenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track
Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde
More informationInformativeness for Adhoc IR Evaluation:
Informativeness for Adhoc IR Evaluation: A measure that prevents assessing individual documents Romain Deveaud 1, Véronique Moriceau 2, Josiane Mothe 3, and Eric SanJuan 1 1 LIA, Univ. Avignon, France,
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationStatic Pruning of Terms In Inverted Files
In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationCADIAL Search Engine at INEX
CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
Rashmi Gadbail,, 2013; Volume 1(8): 783-791 INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK EFFECTIVE XML DATABASE COMPRESSION
More informationMy name is Brian Pottle. I will be your guide for the next 45 minutes of interactive lectures and review on this lesson.
Hello, and welcome to this online, self-paced lesson entitled ORE Embedded R Scripts: SQL Interface. This session is part of an eight-lesson tutorial series on Oracle R Enterprise. My name is Brian Pottle.
More informationInformation Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2
Volume 5, Issue 5, May 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Adaptive Huffman
More informationTerm Frequency Normalisation Tuning for BM25 and DFR Models
Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter
More informationQuery Phrase Expansion using Wikipedia for Patent Class Search
Query Phrase Expansion using Wikipedia for Patent Class Search 1 Bashar Al-Shboul, Sung-Hyon Myaeng Korea Advanced Institute of Science and Technology (KAIST) December 19 th, 2011 AIRS 11, Dubai, UAE OUTLINE
More informationAdvanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016
Advanced Topics in Information Retrieval Learning to Rank Vinay Setty vsetty@mpi-inf.mpg.de Jannik Strötgen jannik.stroetgen@mpi-inf.mpg.de ATIR July 14, 2016 Before we start oral exams July 28, the full
More informationIITH at CLEF 2017: Finding Relevant Tweets for Cultural Events
IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events Sreekanth Madisetty and Maunendra Sankar Desarkar Department of CSE, IIT Hyderabad, Hyderabad, India {cs15resch11006, maunendra}@iith.ac.in
More informationChapter 8. Evaluating Search Engine
Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can
More informationA Tree-based Inverted File for Fast Ranked-Document Retrieval
A Tree-based Inverted File for Fast Ranked-Document Retrieval Wann-Yun Shieh Tien-Fu Chen Chung-Ping Chung Department of Computer Science and Information Engineering National Chiao Tung University Hsinchu,
More informationXML ELECTRONIC SIGNATURES
XML ELECTRONIC SIGNATURES Application according to the international standard XML Signature Syntax and Processing DI Gregor Karlinger Graz University of Technology Institute for Applied Information Processing
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More information