Identifying and Ranking Relevant Document Elements

Size: px
Start display at page:

Download "Identifying and Ranking Relevant Document Elements"

Transcription

1 Identifying and Ranking Relevant Document Elements Andrew Trotman and Richard A. O Keefe Department of Computer Science University of Otago Dunedin, New Zealand andrew@cs.otago.ac.nz, ok@otago.ac.nz ABSTRACT A method of indexing and searching structured documents for element retrieval is discussed. Documents are indexed using a modified inverted file retrieval system. Modified postings include pointers into a collection-wide document structure tree (the corpus tree) describing the structure of every document in the collection. Retrieval topics are converted into Boolean queries. Queries are used to identify relevant documents. Documents are then ranked using Okapi BM25 and finally relevant elements are identified using coverage. Search results are presented sorted first by document then coverage. The design is presented in the context of the second annual INEX workshop. 1. INTRODUCTION Otago first entered INEX [2] during its second year. There were three objectives: understand the participation process, gain access to this and last year s judgments, and create a baseline for comparing future experiments. Participation involved design of six topics, generation and submission of search results, and online judging of three topics. Of these, generating the results was the most problematic as it required software changes. The chosen retrieval engine was designed from the onset for retrieval of whole academic documents in XML [1]. A predecessor can be seen on BioMedNet and ChemWeb [4]. This engine, like that used in the IEEE digital library, returns relevance ranked lists of whole documents the natural (citable) unit of information in an academic environment. From experience, information vendors are not interested in converting their documents from propriety DTDs into a common DTD or any other format so software was needed to handle documents in heterogeneous formats. Boolean searching, field restricting and relevance ranking were already supported, so modifications focused on identifying and ranking document elements. The modified retrieval engine can be thought of as working in three parts. Candidate documents are identified using a Boolean query. Candidates are then ranked using Okapi BM25 [7]. Finally, relevant non-overlapping elements are identified and presented as the result. Although it is easier to understand in three parts, in fact the most relevant elements of the most relevant documents are computed in a single pass of the indexes. 2. INDEXING Much of the index design has already been described elsewhere [8]. Inverted file retrieval is used. There is one dictionary file and each dictionary term points to a single inverted list of postings. An unstructured inverted list is usually represented {<d 1, f 1 >, <d 2, f 2 >,, <d n, f n >} where d n is a document ordinal number and f n is the frequency of the given term in the given document. For structured retrieval, each <d n, f n > pair is replaced by the triple <d n, p n, f n >, where p n is a position in the document. When phrase or proximity searching is required, this triple is replaced with the triple <d n, p n, w n > where w n is the ordinal number of the term in the collection (starting from 0 at the start of the collection, incrementing by 1 for each term, not incrementing for tags, and not reset at the beginning of each record). On disk the postings are stored compressed. The p n value in each posting is a position in the corpus tree. The tagging structure for any one document represents a tree walk. Start at the root of the tree. When an open tag is encountered, the branch labelled with the tag name is followed downwards. When a close tag is encountered, the walk backtracks one branch. For a well-formed XML document, the walk will start and end at the root. This tree-walking property also holds for a collection of well-formed documents. The tree they collectively describe is called the corpus tree and can be built during single pass indexing. As each node is encountered for the first time, a branch is added to the tree and labelled with a unique ordinal identifier, p n. Terms can lie either at the nodes or the leaves of this tree. The corpus tree includes every single path in every single document, but is unlikely to match the structure of any one document. In Figure 1, three well-formed documents are given, as is the corpus tree for those documents. For clarity, the branches of the tree are labelled with which document they describe although this information is not computed and not stored.

2 sec[1]:2 <doc> <sec><p c= red >Fox in Sox</p></sec> </doc> <doc> <sec><p>the Cat in the Hat</p></sec> <sec><p>comes Back</p></sec> </doc> <doc> <sec><p>green Eggs</p> <p>and Ham</p></sec> </doc> doc[1]:1 p[2]:7 sec[2]:5 p[1]:6 Figure 1: Three documents and the corpus tree including every path through every document, but not matching the structure of any one document. For the purpose of this figure each document is marked white, gray, or black and each node with which documents include that path. Each node is numbered with the instance of the tag (e.g. p[2]) and the node id, p n (after the colon). Term p 1 p 2 p 3 d p1,1, d p1,2,, d p1,n w p1,1, w p1,2,, w p1,n d p2,1, d p2,2,, d p2,n w p2,1, w p2,2,, w p2,n d p3,1, d p3,2,, d p3,n w p3,1, w p3,2,, w p3,n Figure 2: The in-memory postings structure allows quick access to only those postings relevant to the required document elements. The inverted lists are built and processed using the structure represented in Figure 2. Postings for each term are ordered by increasing p n. Each p n points to the list of document ids (the d-sublist) and word ids (the w-sublist) found at that point in the tree. Each list is held in increasing order and compressed. To search the collection for a given term, each d- sublist is examined in turn. By doing so, documents may not be examined in turn. This does not matter so long as all documents that would be examined are examined. Further, whole documents may not be examined in turn this, too, does not matter as many ranking functions can be computed piecewise 1. To field-restrict a term, a restricted set of sublists is examined. The w-sublists are used for proximity searching. Storing and processing the postings in this way has computational advantages. For a field-restricted search, postings not pertaining to the restriction can be skipped. As postings are stored compressed, they need not even be decompressed. Word postings are used only for proximity searching. On disk the w-sublists are collected together and stored after all d-sublists. They are not even loaded from disk if not needed. 3. SEARCHING As the retrieval engine starts up, the corpus tree is loaded and an additional structure is created from it, the field list. For each instance of each tag, the list of nodes at or below that node is collected. For each tag, the same is collected. These lists are then merged and sorted. Table 1: The field list for the corpus tree given in Figure 1. Field {4} doc {1, 2, 3, 4, 5, 6, 7} doc[1] {1, 2, 3, 4, 5, 6, 7} p {3, 4, 6, 7} p[1] {3, 4, 6} p[2] {7} sec {2, 3, 4, 5, 6, 7} sec[1] {2, 3, 4, 7} sec[2] {5, 6} The field list for the Figure 1 corpus tree is given in Table 1. From this, a search restricted to sec requires postings at or below all sec nodes of the corpus tree, or where p n ={2, 3, 4, 5, 6, 7}. To 1 BM25 cannot, so the lists are merged then processed.

3 search in p[1], the postings are needed where p n ={3, 4, 6}. For a search restricted to p[1] in sec, these two lists are ANDed together (giving p n ={3, 4, 6}), and the members of this list are checked against the corpus tree to ensure they satisfy p[1] in sec and not sec in p[1]. Equivalence tag restrictions are also computed from the field list. The restrictions for each equivalent tag are ORed giving the equivalent restriction. If, for example, p[2] were equivalent in Table 1, the restriction would be p n ={4, 7}. Several extensions were added to support element and attribute retrieval: Attributes are now distinguished from tags by prefixing attributes with symbol. This symbol was chosen because it makes for easy parsing of INEX queries, which use the same symbol. The attribute value is considered to be content lying not only within the attribute, but also the tag. For example, <tag att= number > term </tag>, is equivalent to <tag> <@att> number </@att> term </tag>. In this way, a search for number in tag will succeed. Tags can now be identified not only by their name and path, but also by the tag instance. Where before it was only possible to restrict to paragraph for example, it is now possible to restrict to the second paragraph. Trotman [8] suggests the corpus tree will be small for real data. In this extended model this no longer holds true. In the TREC [3] Wall Street Journal collection there are only 20 nodes, for INEX there are 198,041 nodes after noise nodes are removed (4,789 with attributes and instances also removed). Table 2: Tags ignored during indexing. ariel en item-text ss art entry label stanza b enum large sub bi f li super bq it line tbody bu item math tf bui item-bold proof tfoot cen item-both rm tgroup colspec item-bullet rom thead couplet item-diamond row theorem dd item-letpara scp tmath ddhd item-mdash sgmlf tt dt item-numpara sgmlmath u dthd item-roman spanspec ub Many tags are used to mark elements too small to be relevant. An example of such a tag is ref, used to mark references in the text. This tag cannot be relevant to any topic as the contents are simply reference numbers. Some tags were used for visual appearance such as b used to mark text in bold. Others were used as typesetting hints such as art used to specify the size of an image. If any of these tags, or those in Table 2 were encountered during indexing, tagging was ignored (until the matching close tag), but the content still indexed. Tags in this group were hand selected even though automated systems for choosing such tags have been proposed [5]. 4. QUERY FORMATION The title of the topic is extracted and converted into a Boolean query. This query is used to determine which documents to retrieve. Ranking is computed from the postings for the search terms. For content and structure (CAS) topics, the target element is computed and stored for later use. The complete path for each about-function is computed by concatenating the about-path to the contextelement restricting it. All equivalent paths are then computed by permuting this path with the equivalence tags. This fully specified path now replaces the original about-path and the contextelement is removed. At this point, the topic has been transformed from INEX topic syntax into a query whereby each about-clause is Boolean separated and explicitly field restricted. Create mandatory by ANDing each mandatory term (+) Create optional by ORing each optional term Create exclusion by ORing each exclusion term (-) If all three sub-expressions are non-null, combine: mandatory AND (* OR optional) NOT exclusion If two sub-expressions are non-null, combine using one of: mandatory AND (* OR optional) optional NOT exclusion mandatory NOT exclusion If only one sub-expression is non-null, use one of: mandatory optional * NOT exclusion Where * finds all documents Figure 3: Algorithm to convert an about phrase into a Boolean expression. Examining the about-string, optional, mandatory (+), and exclusion (-) terms are allowed. These terms are converted into a Boolean expression. Optional terms are collected and converted into a sub-expression by ORing ( a b c a OR b OR

4 c ). Likewise, exclusion terms are also ORed. Mandatory terms are collected and ANDed ( +d +e +f d AND e AND f ). These three subexpressions are then combined to form a complete about-query. The whole algorithm is presented in figure 3. Separate about clauses are already Boolean separated so these operators are preserved. Finally, all context-elements must be satisfied so these are ANDed together. For content only (CO) topics, a Boolean expression is computed exactly as for one about-string using the algorithm presented in Figure RANKING The retrieval engine is a Boolean ranking hybrid. Result sets are computed in two parts; a bit-string of documents satisfying a strict interpretation of the query, and a set of accumulators holding document weights. 5.1 Document Ranking The Boolean expression constructed above is converted into a parse tree then evaluated. At each leaf, the posting are loaded and converted into a bitstring, one bit per document. If a given leaf in the parse tree is not tag-restricted, each posting is examined in turn, and the bit at position d n of each posting is set. Should the leaf be tag-restricted, only those postings for the given tags are examined (see Section 3) and converted. The bit-strings are combined at the nodes of the parse tree using the operator there. At the root of the tree, the bit-string has set bits for all documents exactly satisfying the query and unset for those that do not. The accumulator values are the sum of Okapi BM25 scores computed at each leaf of the parse tree. Scores are summed regardless of the operators in the parse tree. For AND and OR nodes scores are summed because the influence at these nodes is the sum of influences of the children. For NOT nodes, they are also summed. If a document is excluded from the result set, the accumulator value is irrelevant. If a document is not, it is either re-included through other terms (e.g. mammal OR (dog NOT cat)), or there is a double negative in the query (e.g. cat NOT (dog NOT cat)). In both cases, the document has successfully satisfied a query leaf so receives a positive weight. 5.2 Element Ranking and Selection The Boolean ranking hybrid engine was extended to include element ranking. Although whole documents are valid as results for CO topics, SCAS topics specify a target element. This targeting establishes the retrieval unit. If the target element is sec, this tag must be returned. It essentially directs the retrieval engine to search and rank each given tag instance separately. Wilkinson [9] suggests that ranking whole documents then extracting elements from these is a poor ranking strategy. The opposite may hold for this collection. A relevant element lies in the greater context of a relevant document. A relevant document will lie in a relevant journal, which, in turn, lies in a relevant collection. To this end, every paragraph of every section of every document is contextually placed so extracting elements from relevant documents may be a good approach. The coverage of any one posting is computed as those nodes in the document tree at or above the posting. Each posting is already annotated with a pointer into the tree, p n. To compute the coverage, the tree is traversed upwards from p n to the root. Coverage is computed for each document with respect to each search term. sec[1] doc[1] p[2] sec[2] p[1] Figure 4: Coverage of a term occurring at p[2]. The coverage includes all those nodes at and above the occurrence node; those parts of the tree that cover the term. In figure 4, the term ham occurs at p[2]. The coverage includes all nodes above that point in the document tree. In this example that is sec[1] and doc[1]; all nodes that cover the search term those highlighted in grey. For each document in the result set, the weighted coverage is computed as the covered branches of the document tree and how many search terms cover that branch. This is computed during a single pass of the indexes by storing the weighted coverage as part of each accumulator. For the query eggs and ham against the documents in Figure 1, the weighted coverage is shown in Figure 5. doc[1] and sec[1] have a weight of 2, while p[1] and p[2] each have a weight of 1.

5 1 sec[1] 2 doc[1] p[2] 2 1 sec[2] p[1] Figure 5: Weighted coverage of each node is the number of search terms that occur at or below that point in the tree. Weighted coverage is shown in the bottom right of those nodes with weights greater than zero. In any given document, the document root must have the highest weighted coverage, but this can be equal to that of other nodes. For CO topics, all branches of the document tree with coverage less than the root are pruned. The remaining leaves are presented as the result set for that document (in Figure 5, the result is //doc[1]//sec[1]). In this way, the most information dense elements in the document are considered most relevant and no part of any document is returned more than once (overlapping is eliminated). If a target element is specified in a SCAS topic, all non-target branches are pruned. From the remains, those branches with the highest weighted coverage are presented as the result set for that document. 5.3 Ranking summary Recall is determined by evaluation of the Boolean expression, documents are then ranked using Okapi BM25, and elements are selected by weighted coverage. As all the metrics needed for ranking are available at search time, the search and rank process is computed in a single pass of the postings. 6. RESULTS Evaluation results are presented in Table 3. Table 3: INEX performance measures Strict Precision Rank CO nd SCAS th CO-ng-o th CO-ng-s Unknown Not top 10 Generalized Precision Rank CO th SCAS th CO-ng-o st CO-ng-s th The retrieval engine performed badly using the INEX_EVAL measure. This is most likely because this measure treats each tag in a hierarchy as relevant but coverage eliminates overlapping tags the measure is inappropriate for this retrieval technique. Good results were shown when performance is measured using INEX_EVAL_NG. NG measures the ratio of relevant to irrelevant information returned. Coverage finds those parts of the document that contain most of the search terms. The correlation between information density and coverage is reflected in the result. The results show the best performance when generalized quantization is used. This suggests the ordering of the results is not optimal for strict quantization or the most relevant documents are not ranked before less relevant documents. This may be a consequence of sorting into document order before coverage order. 7. OTAGO AT INEX The participation process involved the design and contribution of six topics. Of these, four were selected for inclusion in the final topic set. Otago was assigned three of these to assess. The assessment took three people one week each; this was one week per topic. The retrieval engine described herein was used for designing the contributed topics. This was somewhat problematic as the topic parser was written at the same time the topics were being written, each with few examples. From the final CAS topic set, 19 required corrections, corrections finally running to 12 rounds! This suggests the topic syntax is unnecessarily complex. See our further contribution [6] for a discussion on a possible language to use for future workshops. The assessors were overburdened by the multitude of obviously irrelevant documents to assess. Examining some of these documents suggests many retrieval engines were aiming at high recall by retrieving any document containing any of the title terms. In particular, the word java appeared in one topic; this was a somewhat popular research area over the years included in the IEEE collection. The assessment task could be reduced by carefully designing topics (and retrieval engines) to avoid this problem. 8. CONCLUSIONS Element ranking was added to a Boolean ranking hybrid retrieval engine. Relevant documents were identified using Boolean searching. Documents were ranked using Okapi BM25. Finally coverage was used to rank elements within documents.

6 The results suggest coverage is a good method of identifying relevant and non-overlapping elements. Performance was best for generalized quantization, so ordering is not ideal. This may be a consequence of presenting results in document order. 9. ACKNOWLEDGEMENTS In addition to the authors, Yerin Yoo contributed to the assessment task. Without her contribution we would not have been able to complete the task. This work was supported by University of Otago Research Grant (UORG) funding. REFERENCES [1] Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., Yergeau, F., & Cowan, J. (2003). Extensible markup language (XML) 1.1 W3C proposed recommendation. The World Wide Web Consortium. Available: [2003. [2] Fuhr, N., Gövert, N., Kazai, G., & Lalmas, M. (2002). INEX: Initiative for the evaluation of XML retrieval. In Proceedings of the ACM SIGIR 2000 Workshop on XML and Information Retrieval. [3] Harman, D. (1993). Overview of the first TREC conference. In Proceedings of the 16th ACM SIGIR Conference on Information Retrieval, (pp ). [4] Hitchcock, S., Quek, F., Carr, L., Hall, W., Witbrock, A., & Tarr, I. (1988). Towards universal linking for electronic journals. Serials Review, 24(1), [5] Kazai, G., & Rölleke, T. (2002). A scalable architecture for XML retrieval. In Proceedings of the 1st workshop of the initiative for the evaluation of XML retrieval (INEX), (pp ). [6] O'Keefe, R. A., & Trotman, A. (2003). The simplest query language that could possibly work. In Proceedings of the 2nd workshop of the initiative for the evaluation of XML retrieval (INEX). [7] Robertson, S. E., Walker, S., Beaulieu, M. M., Gatford, M., & Payne, A. (1995). Okapi at TREC-4. In Proceedings of the 4th Text REtrieval Conference (TREC-4), (pp ). [8] Trotman, A. (2003). Searching structured documents. Information Processing & Management, (to appear) doi: /s (03) , available on ScienceDirect since 6 June [9] Wilkinson, R. (1994). Effective retrieval of structured documents. In Proceedings of the 17th ACM SIGIR Conference on Information Retrieval, (pp ).

The Interpretation of CAS

The Interpretation of CAS The Interpretation of CAS Andrew Trotman 1 and Mounia Lalmas 2 1 Department of Computer Science, University of Otago, Dunedin, New Zealand andrew@cs.otago.ac.nz, 2 Department of Computer Science, Queen

More information

Processing Structural Constraints

Processing Structural Constraints SYNONYMS None Processing Structural Constraints Andrew Trotman Department of Computer Science University of Otago Dunedin New Zealand DEFINITION When searching unstructured plain-text the user is limited

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom,

Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom, XML Retrieval Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom, mounia@acm.org Andrew Trotman, Department of Computer Science, University of Otago, New Zealand,

More information

Passage Retrieval and other XML-Retrieval Tasks. Andrew Trotman (Otago) Shlomo Geva (QUT)

Passage Retrieval and other XML-Retrieval Tasks. Andrew Trotman (Otago) Shlomo Geva (QUT) Passage Retrieval and other XML-Retrieval Tasks Andrew Trotman (Otago) Shlomo Geva (QUT) Passage Retrieval Information Retrieval Information retrieval (IR) is the science of searching for information in

More information

A Universal Model for XML Information Retrieval

A Universal Model for XML Information Retrieval A Universal Model for XML Information Retrieval Maria Izabel M. Azevedo 1, Lucas Pantuza Amorim 2, and Nívio Ziviani 3 1 Department of Computer Science, State University of Montes Claros, Montes Claros,

More information

Component ranking and Automatic Query Refinement for XML Retrieval

Component ranking and Automatic Query Refinement for XML Retrieval Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge

More information

A Comparative Study Weighting Schemes for Double Scoring Technique

A Comparative Study Weighting Schemes for Double Scoring Technique , October 19-21, 2011, San Francisco, USA A Comparative Study Weighting Schemes for Double Scoring Technique Tanakorn Wichaiwong Member, IAENG and Chuleerat Jaruskulchai Abstract In XML-IR systems, the

More information

Passage Retrieval and other XML-Retrieval Tasks

Passage Retrieval and other XML-Retrieval Tasks Passage Retrieval and other XML-Retrieval Tasks Andrew Trotman Department of Computer Science University of Otago Dunedin, New Zealand andrew@cs.otago.ac.nz Shlomo Geva Faculty of Information Technology

More information

Overview of the INEX 2009 Link the Wiki Track

Overview of the INEX 2009 Link the Wiki Track Overview of the INEX 2009 Link the Wiki Track Wei Che (Darren) Huang 1, Shlomo Geva 2 and Andrew Trotman 3 Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia 1,

More information

Retrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents

Retrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents Retrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents Norbert Fuhr University of Duisburg-Essen Mohammad Abolhassani University of Duisburg-Essen Germany Norbert Gövert University

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

The Utrecht Blend: Basic Ingredients for an XML Retrieval System

The Utrecht Blend: Basic Ingredients for an XML Retrieval System The Utrecht Blend: Basic Ingredients for an XML Retrieval System Roelof van Zwol Centre for Content and Knowledge Engineering Utrecht University Utrecht, the Netherlands roelof@cs.uu.nl Virginia Dignum

More information

A Fusion Approach to XML Structured Document Retrieval

A Fusion Approach to XML Structured Document Retrieval A Fusion Approach to XML Structured Document Retrieval Ray R. Larson School of Information Management and Systems University of California, Berkeley Berkeley, CA 94720-4600 ray@sims.berkeley.edu 17 April

More information

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML CS276B Text Retrieval and Mining Winter 2005 Plan for today Vector space approaches to XML retrieval Evaluating text-centric retrieval Lecture 15 Text-centric XML retrieval Documents marked up as XML E.g.,

More information

Heading-aware Snippet Generation for Web Search

Heading-aware Snippet Generation for Web Search Heading-aware Snippet Generation for Web Search Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp Web Search Result Snippets Are short

More information

XPath Inverted File for Information Retrieval

XPath Inverted File for Information Retrieval XPath Inverted File for Information Retrieval Shlomo Geva Centre for Information Technology Innovation Faculty of Information Technology Queensland University of Technology GPO Box 2434 Brisbane Q 4001

More information

Accessing XML documents: The INEX initiative. Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen)

Accessing XML documents: The INEX initiative. Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen) Accessing XML documents: The INEX initiative Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen) XML documents Book Chapters Sections World Wide Web This is only only another

More information

Relevance in XML Retrieval: The User Perspective

Relevance in XML Retrieval: The User Perspective Relevance in XML Retrieval: The User Perspective Jovan Pehcevski School of CS & IT RMIT University Melbourne, Australia jovanp@cs.rmit.edu.au ABSTRACT A realistic measure of relevance is necessary for

More information

Articulating Information Needs in

Articulating Information Needs in Articulating Information Needs in XML Query Languages Jaap Kamps, Maarten Marx, Maarten de Rijke and Borkur Sigurbjornsson Gonçalo Antunes Motivation Users have increased access to documents with additional

More information

Applying the IRStream Retrieval Engine to INEX 2003

Applying the IRStream Retrieval Engine to INEX 2003 Applying the IRStream Retrieval Engine to INEX 2003 Andreas Henrich, Volker Lüdecke University of Bamberg D-96045 Bamberg, Germany {andreas.henrich volker.luedecke}@wiai.unibamberg.de Günter Robbert University

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 13 Structured Text Retrieval with Mounia Lalmas Introduction Structuring Power Early Text Retrieval Models Evaluation Query Languages Structured Text Retrieval, Modern

More information

Controlling Overlap in Content-Oriented XML Retrieval

Controlling Overlap in Content-Oriented XML Retrieval Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science, University of Waterloo, Canada claclark@plg.uwaterloo.ca ABSTRACT The direct application of standard

More information

THE weighting functions of information retrieval [1], [2]

THE weighting functions of information retrieval [1], [2] A Comparative Study of MySQL Functions for XML Element Retrieval Chuleerat Jaruskulchai, Member, IAENG, and Tanakorn Wichaiwong, Member, IAENG Abstract Due to the ever increasing information available

More information

XML Information Set. Working Draft of May 17, 1999

XML Information Set. Working Draft of May 17, 1999 XML Information Set Working Draft of May 17, 1999 This version: http://www.w3.org/tr/1999/wd-xml-infoset-19990517 Latest version: http://www.w3.org/tr/xml-infoset Editors: John Cowan David Megginson Copyright

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

A Voting Method for XML Retrieval

A Voting Method for XML Retrieval A Voting Method for XML Retrieval Gilles Hubert 1 IRIT/SIG-EVI, 118 route de Narbonne, 31062 Toulouse cedex 4 2 ERT34, Institut Universitaire de Formation des Maîtres, 56 av. de l URSS, 31400 Toulouse

More information

XML Retrieval More Efficient Using Compression Technique

XML Retrieval More Efficient Using Compression Technique XML Retrieval More Efficient Using Compression Technique Tanakorn Wichaiwong and Chuleerat Jaruskulchai Abstract In this paper, we report experimental results of our approach for retrieval large-scale

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

LexisNexis PatentOptimizer Quick Reference Card. August, 13, 2010

LexisNexis PatentOptimizer Quick Reference Card. August, 13, 2010 LexisNexis PatentOptimizer Quick Reference Card August, 13, 2010 Contents How do I activate the LexisNexis PatentOptimizer Service?...1 How do I access service features?... 2 How do I submit feedback?...4

More information

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1 COMPUTER ENGINEERING DEPARTMENT BILKENT UNIVERSITY Assignment No. 1 Abdurrahman Yasar June 10, 2014 1 QUESTION 1 Consider the following search results for two queries Q1 and Q2 (the documents are ranked

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

Formulating XML-IR Queries

Formulating XML-IR Queries Alan Woodley Faculty of Information Technology, Queensland University of Technology PO Box 2434. Brisbane Q 4001, Australia ap.woodley@student.qut.edu.au Abstract: XML information retrieval systems differ

More information

Comparative Analysis of Clicks and Judgments for IR Evaluation

Comparative Analysis of Clicks and Judgments for IR Evaluation Comparative Analysis of Clicks and Judgments for IR Evaluation Jaap Kamps 1,3 Marijn Koolen 1 Andrew Trotman 2,3 1 University of Amsterdam, The Netherlands 2 University of Otago, New Zealand 3 INitiative

More information

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides Addison Wesley, 2008 Table of Content Inverted index with positional information

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Sound and Complete Relevance Assessment for XML Retrieval

Sound and Complete Relevance Assessment for XML Retrieval Sound and Complete Relevance Assessment for XML Retrieval Benjamin Piwowarski Yahoo! Research Latin America Santiago, Chile Andrew Trotman University of Otago Dunedin, New Zealand Mounia Lalmas Queen Mary,

More information

Sound ranking algorithms for XML search

Sound ranking algorithms for XML search Sound ranking algorithms for XML search Djoerd Hiemstra 1, Stefan Klinger 2, Henning Rode 3, Jan Flokstra 1, and Peter Apers 1 1 University of Twente, 2 University of Konstanz, and 3 CWI hiemstra@cs.utwente.nl,

More information

Evaluating the effectiveness of content-oriented XML retrieval

Evaluating the effectiveness of content-oriented XML retrieval Evaluating the effectiveness of content-oriented XML retrieval Norbert Gövert University of Dortmund Norbert Fuhr University of Duisburg-Essen Gabriella Kazai Queen Mary University of London Mounia Lalmas

More information

Melbourne University at the 2006 Terabyte Track

Melbourne University at the 2006 Terabyte Track Melbourne University at the 2006 Terabyte Track Vo Ngoc Anh William Webber Alistair Moffat Department of Computer Science and Software Engineering The University of Melbourne Victoria 3010, Australia Abstract:

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology More on indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Plan

More information

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Evaluation" Evaluation is key to building

More information

A Practical Passage-based Approach for Chinese Document Retrieval

A Practical Passage-based Approach for Chinese Document Retrieval A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Evaluation Metrics. Jovan Pehcevski INRIA Rocquencourt, France

Evaluation Metrics. Jovan Pehcevski INRIA Rocquencourt, France Evaluation Metrics Jovan Pehcevski INRIA Rocquencourt, France jovan.pehcevski@inria.fr Benjamin Piwowarski Yahoo! Research Latin America bpiwowar@yahoo-inc.com SYNONYMS Performance metrics; Evaluation

More information

COSC 431 Information Retrieval. Phrase Search & Structured Search

COSC 431 Information Retrieval. Phrase Search & Structured Search COSC 431 Information Retrieval Phrase Search & Structured Search 1 Phrase Searching What are Structured documents Meta-data Structured documents Outline Searching Structured documents Meta-data Semi-structured

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

Indexing and Query Processing. What will we cover?

Indexing and Query Processing. What will we cover? Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries

More information

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa Term Distillation in Patent Retrieval Hideo Itoh Hiroko Mano Yasushi Ogawa Software R&D Group, RICOH Co., Ltd. 1-1-17 Koishikawa, Bunkyo-ku, Tokyo 112-0002, JAPAN fhideo,mano,yogawag@src.ricoh.co.jp Abstract

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Verbose Query Reduction by Learning to Rank for Social Book Search Track Verbose Query Reduction by Learning to Rank for Social Book Search Track Messaoud CHAA 1,2, Omar NOUALI 1, Patrice BELLOT 3 1 Research Center on Scientific and Technical Information 05 rue des 03 frères

More information

Efficient query processing

Efficient query processing Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:

More information

Retrieval Evaluation

Retrieval Evaluation Retrieval Evaluation - Reference Collections Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, Chapter

More information

Structured Queries in XML Retrieval

Structured Queries in XML Retrieval Structured Queries in XML Retrieval Jaap Kamps 1,2 Maarten Marx 2 Maarten de Rijke 2 Börkur Sigurbjörnsson 2 1 Archives and Information Studies, University of Amsterdam, Amsterdam, The Netherlands 2 Informatics

More information

Overview of the INEX 2008 Ad Hoc Track

Overview of the INEX 2008 Ad Hoc Track Overview of the INEX 2008 Ad Hoc Track Jaap Kamps 1, Shlomo Geva 2, Andrew Trotman 3, Alan Woodley 2, and Marijn Koolen 1 1 University of Amsterdam, Amsterdam, The Netherlands {kamps,m.h.a.koolen}@uva.nl

More information

Document Structure Analysis in Associative Patent Retrieval

Document Structure Analysis in Associative Patent Retrieval Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Overview of the INEX 2008 Ad Hoc Track

Overview of the INEX 2008 Ad Hoc Track Overview of the INEX 2008 Ad Hoc Track Jaap Kamps 1, Shlomo Geva 2, Andrew Trotman 3, Alan Woodley 2, and Marijn Koolen 1 1 University of Amsterdam, Amsterdam, The Netherlands {kamps,m.h.a.koolen}@uva.nl

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

Classification and Comparative Analysis of Passage Retrieval Methods

Classification and Comparative Analysis of Passage Retrieval Methods Classification and Comparative Analysis of Passage Retrieval Methods Dr. K. Saruladha, C. Siva Sankar, G. Chezhian, L. Lebon Iniyavan, N. Kadiresan Computer Science Department, Pondicherry Engineering

More information

Configurable Indexing and Ranking for XML Information Retrieval

Configurable Indexing and Ranking for XML Information Retrieval Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCL Computer Science Department, Los ngeles, C, US 90095 {sliu, zou, wwc}@cs.ucla.edu BSTRCT

More information

Score Region Algebra: Building a Transparent XML-IR Database

Score Region Algebra: Building a Transparent XML-IR Database Vojkan Mihajlović Henk Ernst Blok Djoerd Hiemstra Peter M. G. Apers Score Region Algebra: Building a Transparent XML-IR Database Centre for Telematics and Information Technology (CTIT) Faculty of Electrical

More information

Relevancy Workbench Module. 1.0 Documentation

Relevancy Workbench Module. 1.0 Documentation Relevancy Workbench Module 1.0 Documentation Created: Table of Contents Installing the Relevancy Workbench Module 4 System Requirements 4 Standalone Relevancy Workbench 4 Deploy to a Web Container 4 Relevancy

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc Analyzing the performance of top-k retrieval algorithms Marcus Fontoura Google, Inc This talk Largely based on the paper Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indices, VLDB

More information

Efficiency vs. Effectiveness in Terabyte-Scale IR

Efficiency vs. Effectiveness in Terabyte-Scale IR Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval Stefan Büttcher Charles L. A. Clarke University of Waterloo, Canada November 17, 2005 1 2 3 4 5 6 What is Wumpus? Multi-user file system

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Research (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa

More information

Using XML Logical Structure to Retrieve (Multimedia) Objects

Using XML Logical Structure to Retrieve (Multimedia) Objects Using XML Logical Structure to Retrieve (Multimedia) Objects Zhigang Kong and Mounia Lalmas Queen Mary, University of London {cskzg,mounia}@dcs.qmul.ac.uk Abstract. This paper investigates the use of the

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson Search Engines Informa1on Retrieval in Prac1ce Annotations by Michael L. Nelson All slides Addison Wesley, 2008 Indexes Indexes are data structures designed to make search faster Text search has unique

More information

CS646 (Fall 2016) Homework 1

CS646 (Fall 2016) Homework 1 CS646 (Fall 2016) Homework 1 Deadline: 11:59pm, Sep 28th, 2016 (EST) Access the following resources before you start working on HW1: Download the corpus file on Moodle: acm corpus.gz (about 90 MB). Check

More information

INEX a Broadly Accepted Data Set for XML Database Processing?

INEX a Broadly Accepted Data Set for XML Database Processing? INEX a Broadly Accepted Data Set for XML Database Processing? Pavel Loupal and Michal Valenta Dept. of Computer Science and Engineering FEE, Czech Technical University Karlovo náměstí 13, 121 35 Praha

More information

Heading-Based Sectional Hierarchy Identification for HTML Documents

Heading-Based Sectional Hierarchy Identification for HTML Documents Heading-Based Sectional Hierarchy Identification for HTML Documents 1 Dept. of Computer Engineering, Boğaziçi University, Bebek, İstanbul, 34342, Turkey F. Canan Pembe 1,2 and Tunga Güngör 1 2 Dept. of

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information

Informativeness for Adhoc IR Evaluation:

Informativeness for Adhoc IR Evaluation: Informativeness for Adhoc IR Evaluation: A measure that prevents assessing individual documents Romain Deveaud 1, Véronique Moriceau 2, Josiane Mothe 3, and Eric SanJuan 1 1 LIA, Univ. Avignon, France,

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Static Pruning of Terms In Inverted Files

Static Pruning of Terms In Inverted Files In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy

More information

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY Rashmi Gadbail,, 2013; Volume 1(8): 783-791 INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK EFFECTIVE XML DATABASE COMPRESSION

More information

My name is Brian Pottle. I will be your guide for the next 45 minutes of interactive lectures and review on this lesson.

My name is Brian Pottle. I will be your guide for the next 45 minutes of interactive lectures and review on this lesson. Hello, and welcome to this online, self-paced lesson entitled ORE Embedded R Scripts: SQL Interface. This session is part of an eight-lesson tutorial series on Oracle R Enterprise. My name is Brian Pottle.

More information

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2

Information Technology Department, PCCOE-Pimpri Chinchwad, College of Engineering, Pune, Maharashtra, India 2 Volume 5, Issue 5, May 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Adaptive Huffman

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Query Phrase Expansion using Wikipedia for Patent Class Search

Query Phrase Expansion using Wikipedia for Patent Class Search Query Phrase Expansion using Wikipedia for Patent Class Search 1 Bashar Al-Shboul, Sung-Hyon Myaeng Korea Advanced Institute of Science and Technology (KAIST) December 19 th, 2011 AIRS 11, Dubai, UAE OUTLINE

More information

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016 Advanced Topics in Information Retrieval Learning to Rank Vinay Setty vsetty@mpi-inf.mpg.de Jannik Strötgen jannik.stroetgen@mpi-inf.mpg.de ATIR July 14, 2016 Before we start oral exams July 28, the full

More information

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events Sreekanth Madisetty and Maunendra Sankar Desarkar Department of CSE, IIT Hyderabad, Hyderabad, India {cs15resch11006, maunendra}@iith.ac.in

More information

Chapter 8. Evaluating Search Engine

Chapter 8. Evaluating Search Engine Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can

More information

A Tree-based Inverted File for Fast Ranked-Document Retrieval

A Tree-based Inverted File for Fast Ranked-Document Retrieval A Tree-based Inverted File for Fast Ranked-Document Retrieval Wann-Yun Shieh Tien-Fu Chen Chung-Ping Chung Department of Computer Science and Information Engineering National Chiao Tung University Hsinchu,

More information

XML ELECTRONIC SIGNATURES

XML ELECTRONIC SIGNATURES XML ELECTRONIC SIGNATURES Application according to the international standard XML Signature Syntax and Processing DI Gregor Karlinger Graz University of Technology Institute for Applied Information Processing

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information