Passage Retrieval and other XML-Retrieval Tasks. Andrew Trotman (Otago) Shlomo Geva (QUT)

Similar documents
Passage Retrieval and other XML-Retrieval Tasks

Sound and Complete Relevance Assessment for XML Retrieval

Overview of the INEX 2009 Link the Wiki Track

The Interpretation of CAS

Relevance in XML Retrieval: The User Perspective

Comparative Analysis of Clicks and Judgments for IR Evaluation

From Passages into Elements in XML Retrieval

Processing Structural Constraints

Accessing XML documents: The INEX initiative. Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen)

Information Retrieval

Evaluation Metrics. Jovan Pehcevski INRIA Rocquencourt, France

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Overview of the INEX 2008 Ad Hoc Track

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

Overview of the INEX 2008 Ad Hoc Track

Retrieval Evaluation

Information Retrieval

A Voting Method for XML Retrieval

Using XML Logical Structure to Retrieve (Multimedia) Objects

Component ranking and Automatic Query Refinement for XML Retrieval

Introduction to Information Retrieval

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Focused Retrieval. Kalista Yuki Itakura. A thesis. presented to the University of Waterloo. in fulfillment of the

Informativeness for Adhoc IR Evaluation:

Retrieval Evaluation. Hongning Wang

Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom,

Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database

Overview of the INEX 2010 Ad Hoc Track

Formulating XML-IR Queries

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Part 7: Evaluation of IR Systems Francesco Ricci

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

CSCI 5417 Information Retrieval Systems. Jim Martin!

Heading-aware Snippet Generation for Web Search

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Part 11: Collaborative Filtering. Francesco Ricci

Information Retrieval

Mobile Applications of Focused Link Discovery

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

INEX REPORT. Report on INEX 2008

Controlling Overlap in Content-Oriented XML Retrieval

Information Retrieval. Lecture 7

INEX REPORT. Report on INEX 2010

Modern Information Retrieval

Automated Cross-lingual Link Discovery in Wikipedia

Building Test Collections. Donna Harman National Institute of Standards and Technology

ResPubliQA 2010

Identifying and Ranking Relevant Document Elements

The Heterogeneous Collection Track at INEX 2006

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

INEX REPORT. Report on INEX 2011

Chapter 27 Introduction to Information Retrieval and Web Search

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan

Report on the SIGIR 2008 Workshop on Focused Retrieval

Specificity Aboutness in XML Retrieval

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

CriES 2010

Book Recommendation based on Social Information

Phrase Detection in the Wikipedia

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

TopX AdHoc and Feedback Tasks

Overview of INEX 2005

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Midterm Exam Search Engines ( / ) October 20, 2015

INEX REPORT. Report on INEX 2013

The Effect of Structured Queries and Selective Indexing on XML Retrieval

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Using Coherence-based Measures to Predict Query Difficulty

VIDEO SEARCHING AND BROWSING USING VIEWFINDER

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Seven years of INEX interactive retrieval experiments lessons and challenges

AN SEO GUIDE FOR SALONS

DATA MINING - 1DL105, 1DL111

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

Significance Testing for INEX Ad Hoc Track

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Chapter 8. Evaluating Search Engine

Introduction to Information Retrieval

Improving Results for the INEX 2009 and 2010 Relevant in Context Tasks

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Search Engines. Information Retrieval in Practice

Information Retrieval

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

The overlap problem in content-oriented XML retrieval evaluation

Aggregation for searching complex information spaces. Mounia Lalmas

Search Engines. Information Retrieval in Practice

DCU and 2010: Ad-hoc and Data-Centric tracks

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

INEX REPORT. Report on INEX 2012

Information Retrieval

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

arxiv: v1 [cs.ir] 20 Nov 2007

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

The Utrecht Blend: Basic Ingredients for an XML Retrieval System

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Information Retrieval Spring Web retrieval

Structured Queries in XML Retrieval

VK Multimedia Information Systems

Transcription:

Passage Retrieval and other XML-Retrieval Tasks Andrew Trotman (Otago) Shlomo Geva (QUT)

Passage Retrieval

Information Retrieval Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data. (Wikipedia, 2006)

XML-Retrieval Information Retrieval from document collections in which at least one component is marked up in XML What is a document? Yes, well, lets not go there What is a component? Lets not go there either

XML Element Retrieval Aim: To identify a more focused result to a query than a whole document, that is, to identify a relevant fragment of a document This involves identifying two factors: The location of a relevant fragment The size of the relevant fragment Where the fragment is an XML element

An XML Document <?xml version="1.0" encoding="utf-8"?> <article> <name id="59186">dreamland, Michigan</name> <conversionwarning>0</conversionwarning> <body> <redirectlink src="torch Lake Township, Houghton County, Michigan"> Torch Lake Township, Houghton County, Michigan </redirectlink> </body> </article>

Why XML Element Retrieval? XML is used to identify semantic elements The user s information need is semantic Given: XML is used correctly The query is sensible Reasonable Assumption: The best fragments are whole XML elements The task is to identify these elements

Really? Are XML Elements really the best fragments? We draw our conclusion from other studies

Agreement Levels Pehcevski, Thom & Vercoustre (2005) Topics used in INEX 2004 interactive experiments Agreement only at the ends of the relevance scale E0S0 and E3S3 This is also true of the Cystic Fibrosis collection Conclusion Obviously relevant and obviously not is obvious Levels of relevance are debatable The 10 point relevance scale isn t needed

Trotman (2005) Agreement Levels 12 topics double judged at INEX 2004 Pehcevski & Thom (2005) 5 topics double judged at INEX 2005 Evaluation TREC 4 P/B TREC 4 A/B TREC 4 P/A TREC 6 INEX 2004 documents INEX 2004 elements INEX 2005 documents INEX 2005 elements Agreement ( / ) 0.49 0.43 0.42 0.33 0.27 0.16 0.39 0.24 Please, someone, Kappa these

Passages and Elements At INEX 2005 judges Were presented with pool documents Marked relevant passages in those documents Set exhaustivity values to elements in the passages The exhaustivity of an element was this score The specificity was the proportion of text highlighted Conclusions Judging passages is easier than judging elements

Which Elements are Relevant? Trotman & Lalmas (2006) INEX 2005 judgments Regardless of query (target element) Most relevant elements were paragraphs Regardless of thorough / focused Most relevant elements were paragraphs Conclusion Assessors are identifying relevant sequences of paragraphs

How Relevant? Piwowarski, Trotman & Lalmas (2006) INEX 2005 judgments Examined average specificity of element and found: Element Paragraph Section Body Article Average Specificity 0.94 0.51 0.15 0.12 Conclusion Paragraphs are either relevant or not Judges are identifying collections of paragraphs

Elemental Passages Piwowarski, Trotman & Lalmas (2006) INEX 2005 judgments Elemental passages A passage that is an element Starts and ends on the boundaries of a single element Found: 36% of passages were elemental 64% of passages were not elemental Conclusion Judges identified relevant passages, not relevant elements

Elements and Assessment Ogilvie & Lalmas (2006) INEX 2005 judgments Examine the stability of metrics using just specificity Discover Remove exhaustivity from the assessments and The relative performance of search engines remains stable Conclusion Using specificity alone is sufficient for assessment purpose Specificity is based on highlighting passages not elements

Passage Agreement? INEX 2005 Piwowarski, Trotman & Lalmas (2006) Passage agreement Pehcevski & Thom (2005) Document and element agreement Evaluation Documents (binary) Elements (exact) Passages Agreement ( / ) 0.39 0.24 0.23

XML Passage Retrieval The case for passage retrieval is compelling: Element agreement level is higher with highlighting The relevant text is a collection of paragraphs Not an element Assessment is stable with highlighting Passage agreement levels look fine Some problems vanish: too small elements can t occur Conversion of highlights to elements isn t needed

Three New Tasks Focused Retrieval The identification of non-overlapping passages of text relevant to the user s information need Sorted by passage relevance Relevant in Context The identification of non-overlapping passages of text relevant to the user s information need Sorted by document and sequential within each document Best in Context The identification of the best entry point (BEP) within a document Sorted by document one s this time

Transitioning to Passages The transition from elements to passages can gradual To convert from an element to a passage is straightforward Use the element as a passage Possibly merge adjacent passages Clarke (2005) XML range specification

Passages at TREC TREC HARD 2003 and 2004 TREC Genomics 2006 Perhaps we should be sharing resources Same documents different formants? Metrics? TREC and INEX both already have passage metrics

Other XML-Retrieval Tasks

The Performance Task What are the best ranking algorithms for: The web? Hits? PageRank? Whatever Google actually uses? Unstructured text? BM25? Pivoted length-normalized Cosine? Language Models? XML? Suggestions from the floor please (focused or thorough)

Five Years How do we define best? But the metrics change from year to year! Work to be done Identify the current state of the art Change methodology so we compare to that Perform statistical tests Homework Trawl through old runs with new metrics Future Can we find a way to measure incremental improvements? Withhold some judgments for re-use in future years?

Multiple Document Formats The premise is that XML is helpful in retrieval Experiment on different granularities of XML might be performed to show this. Does XML actually help? Either reduce the density of tags or use the same document collection in different formats (XML / HTML / TEXT)

Related Articles Inserting and maintaining links in web pages and the Wikipedia is time-consuming and tedious Can we identify methods of predicting which pages in a closed set of pages should link to each other? The Wikipedia offers a unique environment The pages are already cross linked We can measure performance by similarity to these Remove them, then try and predict them Topics would be Wikipedia documents chosen at random

Question Answering The Wikipedia is an obvious collection to do question answering XML element might directly contain the answers to questions (Wikipedia templates) XML elements might be used to identify relevant parts of documents from which answers are extracted

Conclusions We see that So Passage retrieval is well suited to XML Element retrieval is not well suited to XML Passage tasks are easy to specify Passages are easier to judge XML retrieval should focus on passages not elements And There s plenty of other tasks too