CS47300: Web Information Search and Management

Similar documents
CS54701: Information Retrieval

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Ad-hoc IR: Introduction. Ad-hoc IR: Introduction. Basic Concepts of IR: Outline. Content Based Filtering Filtering. Ad-hoc IR: Terminologies

Retrieval Evaluation. Hongning Wang

Evaluation of Retrieval Systems

Information Retrieval and Web Search

Information Retrieval. (M&S Ch 15)

Chapter 8. Evaluating Search Engine

Modern Information Retrieval

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

CS54701: Information Retrieval

RMIT University at TREC 2006: Terabyte Track

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Information Retrieval

CSCI 5417 Information Retrieval Systems. Jim Martin!

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

CS47300: Web Information Search and Management

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Building Test Collections. Donna Harman National Institute of Standards and Technology

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Informativeness for Adhoc IR Evaluation:

Information Retrieval. Lecture 7

Evaluation of Retrieval Systems

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Information Retrieval

Information Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

CS47300: Web Information Search and Management

VK Multimedia Information Systems

Information Retrieval

Web Information Retrieval. Exercises Evaluation in information retrieval

Exam IST 441 Spring 2011

CSA4020. Multimedia Systems:

European Web Retrieval Experiments at WebCLEF 2006

Modern Retrieval Evaluations. Hongning Wang

Comparative Analysis of Clicks and Judgments for IR Evaluation

The Security Role for Content Analysis

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

The TREC 2005 Terabyte Track

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

modern database systems lecture 4 : information retrieval

Information Retrieval

Retrieval Evaluation

CS 6320 Natural Language Processing

CS34800 Information Systems

Midterm Exam Search Engines ( / ) October 20, 2015

CS105 Introduction to Information Retrieval

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Real-time Query Expansion in Relevance Models

Exam IST 441 Spring 2014

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Content-based search in peer-to-peer networks

Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

Search Engine Architecture. Hongning Wang

Part 7: Evaluation of IR Systems Francesco Ricci

Information Retrieval

Information Retrieval

Introduction to Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Information Retrieval

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Evaluation of Retrieval Systems

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Information Retrieval Term Project : Incremental Indexing Searching Engine

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Query Phrase Expansion using Wikipedia for Patent Class Search

Heading-aware Snippet Generation for Web Search

Lecture 5: Evaluation

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Boolean Queries. Keywords combined with Boolean operators:

Information Retrieval and Web Search

Information Retrieval and Organisation

Federated Text Search

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Full-Text Indexing For Heritrix

CS47300: Web Information Search and Management

Information Retrieval. Information Retrieval and Web Search

TREC-10 Web Track Experiments at MSRA

Information Retrieval and Web Search

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CS6322: Information Retrieval Sanda Harabagiu. Lecture 13: Evaluation

dr.ir. D. Hiemstra dr. P.E. van der Vet

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Evaluation of Retrieval Systems

CS/INFO 1305 Summer 2009

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Indexing and Query Processing. What will we cover?

CS47300 Web Information Search and Management

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS 533 Information Retrieval Systems

Transcription:

CS47300: Web Information Search and Management Prof. Chris Clifton 27 August 2018 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 AD-hoc IR: Basic Process Information Need Representation Representation Query Retrieval Model Indexed Objects Retrieved Objects Evaluation Jan-17 20 Christopher W. Clifton 1

Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful fields, useful tokens (lex/yacc) Text Preprocess Indexer Full Text Indexing Term Dictionary Inverted Lists Document Attributes Text Representation: Inverted Lists Inverted lists are one of the most common indexing techniques Source file: collection organized by documents Inverted list file: collection organized by term one record per term, the lists of documents that contain the specific term Possible actions with inverted lists OR: the union of lists AND: the intersection of lists Jan-17 20 Christopher W. Clifton 2

Text Representation: Inverted Lists Documents Inverted Lists Many engineering details Text Representation: Inverted Lists Update inverted lists: delete/insert a term or document Compression: trade off between I/O time and CPU time Add more information such as position information Take CS34800 and CS44800 for even more Jan-17 20 Christopher W. Clifton 3

AD-hoc IR: Basic Process Information Need Representation Representation Query Retrieval Model Indexed Objects Retrieved Objects Evaluation Evaluation: What do we Evaluate? Effectiveness How do we define effective? Where can we find the correct answers? Efficiency Retrieval speed? Storage space? Particularly important for large-scale real-world system Usability What do real users really want? Is user interface important to IR evaluation? 8 Jan-17 20 Christopher W. Clifton 4

Evaluation Criteria Effectiveness Favor returned document ranked lists with more relevant documents at the top Objective measures Recall and Precision Mean-average precision Rank based precision For documents in a subset of a ranked lists, if we know the truth Relevant docs retrieved Precision= Retrieved docs Relevant docs retrieved Recall= Relevant docs 9 Evaluation: Ground Truth Question: How to find all relevant documents? Difficult for Web, but possible on controllable corpus How to find all relevant documents? (difficult to check one by one) Judgers may have inconsistent decisions (subjective judgment) The Pooling process Jan-17 20 Christopher W. Clifton 5

Evaluation: Inconsistent Judgement People may not agree on the right answer Some think document is relevant to query, others don t Discussion among multiple judgers to reduce bias Combine judgments from multiple judgers Majority vote If it is hard to decide for human judges, it is likely to be hard for an automatic system 11 Evaluation: Pooling Strategy Retrieve documents using multiple methods Judge top n documents from each method Whole retrieved set is the union of top retrieved documents from all methods Problems: the judged relevant documents may not be complete It is possible to estimate the total number of relevant documents by random sampling 12 Jan-17 20 Christopher W. Clifton 6

Evaluation: Pooling Strategy System 1 System N CS47300: Web Information Search and Management Prof. Chris Clifton 29 August 2018 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 14 Jan-17 20 Christopher W. Clifton 7

Precision : Unranked Measures: # Relevant Retrieved # Retrieved # Relevant Retrieved Recall : # Relevant F1 score : 2PR P+R 15 Evaluate a ranked list Precision at Recall Evaluation Evaluate at every relevant document 16 Jan-17 20 Christopher W. Clifton 8

Ranked Metrics Single number Mean average precision Calculate precision at each relevant document; average over all precision values 11-point interpolated average precision Calculate precision at standard recall points (e.g., 10%, 20%...); smooth the values; estimate 0 % by interpolation Average the results 18 Evaluation: Single Value Metrics Rank based precision Calculate precision at top ranked documents (e.g., 5, 10, 15 ) Desirable when users care more for top ranked documents Mean Reciprocal Rank Reciprocal Rank: 1/rank (position in list) of first relevant document MRR: Average Reciprocal Rank over many queries 19 Jan-17 20 Christopher W. Clifton 9

Evaluation: Example 20 Evaluation: TREC TREC collections with queries and relevance judgment TREC CDs 1-5: 1.5 millions docs, 5GB, news and government reports (e.g., AP, WSJ, Dept of Energy abstracts) TREC WT10g: crawled from Web (open domain), 1.7 million docs, 10GB TREC Terabyte: crawled from U.S. government Web pages, 25 million docs, 426 GB All have more than 100 queries with relevance judgment 21 Jan-17 20 Christopher W. Clifton 10

Evaluation: TREC TREC query example <title> airport security <desc> Description: What security measures are in effect or are proposed to go into effect in airports? <narr> Narrative: A relevant document could identify a specific airport and describe the security measures already in effect or proposed for use at that airport. Relevant items could also describe a failure of security that was cited as a contributing cause of a tragedy which came to pass or which was later averted. Comparisons between and among airports based on the effectiveness of the security of each are also relevant. 22 TREC relevance judgment example 451 WTX058-B50-85 0 451 WTX059-B06-411 0 451 WTX059-B07-154 0 451 WTX059-B09-203 0 451 WTX059-B11-245 0 451 WTX059-B30-262 1 451 WTX059-B37-11 0 451 WTX059-B37-149 1 451 WTX059-B37-217 0 451 WTX059-B37-268 0 451 WTX059-B37-27 0 Evaluation: TREC 23 Jan-17 20 Christopher W. Clifton 11

Review to date: Basic Concepts of Information Retrieval: Task Definition of Ad-hoc IR Terminologies and Concepts Overview of Retrieval Models Text representation Indexing Text preprocessing Evaluation Evaluation methodology Evaluation metrics Jan-17 20 Christopher W. Clifton 12