Performance Evaluation of Information Retrieval Systems

Similar documents
Optimizing Document Scoring for Query Retrieval

Query Clustering Using a Hybrid Query Similarity Measure

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

CS47300: Web Information Search and Management

Problem Set 3 Solutions

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Active Contours/Snakes

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

An Entropy-Based Approach to Integrated Information Needs Assessment

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

UB at GeoCLEF Department of Geography Abstract

Relevance Feedback Document Retrieval using Non-Relevant Documents

TN348: Openlab Module - Colocalization

KIDS Lab at ImageCLEF 2012 Personal Photo Retrieval

Information Retrieval

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Simulation Based Analysis of FAST TCP using OMNET++

Module Management Tool in Software Development Organizations

The Effect of Similarity Measures on The Quality of Query Clusters

CSCI 5417 Information Retrieval Systems Jim Martin!

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

Machine Learning: Algorithms and Applications

ELEC 377 Operating Systems. Week 6 Class 3

Programming in Fortran 90 : 2017/2018

GSLM Operations Research II Fall 13/14

Wishing you all a Total Quality New Year!

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Virtual Machine Migration based on Trust Measurement of Computer Node

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Solutions to Programming Assignment Five Interpolation and Numerical Differentiation

Behavioral Model Extraction of Search Engines Used in an Intelligent Meta Search Engine

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Intelligent Information Acquisition for Improved Clustering

CS 534: Computer Vision Model Fitting

SAO: A Stream Index for Answering Linear Optimization Queries

Private Information Retrieval (PIR)

Related-Mode Attacks on CTR Encryption Mode

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Load Balancing for Hex-Cell Interconnection Network

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Efficient Distributed File System (EDFS)

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Unsupervised Learning

K-means and Hierarchical Clustering

Sorting: The Big Picture. The steps of QuickSort. QuickSort Example. QuickSort Example. QuickSort Example. Recursive Quicksort

EVALUATION OF THE PERFORMANCES OF ARTIFICIAL BEE COLONY AND INVASIVE WEED OPTIMIZATION ALGORITHMS ON THE MODIFIED BENCHMARK FUNCTIONS

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Data Mining: Model Evaluation

CE 221 Data Structures and Algorithms

Learning-Based Top-N Selection Query Evaluation over Relational Databases

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

CSE 326: Data Structures Quicksort Comparison Sorting Bound

A KIND OF ROUTING MODEL IN PEER-TO-PEER NETWORK BASED ON SUCCESSFUL ACCESSING RATE

Analysis of Continuous Beams in General

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

3. CR parameters and Multi-Objective Fitness Function

Keyword-based Document Clustering

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Shape Representation Robust to the Sketching Order Using Distance Map and Direction Histogram

Support Vector Machines

Unsupervised Learning and Clustering

Greedy Technique - Definition

Information Retrieval. (M&S Ch 15)

A Knowledge Management System for Organizing MEDLINE Database

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Report on On-line Graph Coloring

Adjustment methods for differential measurement errors in multimode surveys

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Improving Web Image Search using Meta Re-rankers

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Research of Dynamic Access to Cloud Database Based on Improved Pheromone Algorithm

Priority-Based Scheduling Algorithm for Downlink Traffics in IEEE Networks

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

A Novel Term_Class Relevance Measure for Text Categorization

Biostatistics 615/815

S1 Note. Basis functions.

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval

Intro. Iterators. 1. Access

An efficient iterative source routing algorithm

Fitting: Deformable contours April 26 th, 2018

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Effectiveness of Information Retraction

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

LS-TaSC Version 2.1. Willem Roux Livermore Software Technology Corporation, Livermore, CA, USA. Abstract

Transcription:

Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence and Tech, Hong Kong) There are many retreval models/ algorthms/ systems, whch one s the best? What s the best component for: Rankng functon (dot-product, cosne, ) Term selecton (stopword removal, stemmng ) Term weghtng (TF, TF-IDF, ) How far down the ranked lst wll a user need to look to fnd some/all relevant documents? Dffcultes n Evaluatng IR Systems Effectveness s related to the relevancy of retreved tems. Relevancy s not typcally bnary but contnuous. Even f relevancy s bnary, t can be a dffcult judgment to make. Relevancy, from a human standpont, s: Subjectve: Depends upon a specfc user s judgment. Stuatonal: Relates to user s current needs. Cogntve: Depends on human percepton and behavor. Dynamc: Changes over tme. Human Labeled Corpora (Gold Standard) Start wth a corpus of documents. Collect a set of queres for ths corpus. Have one or more human experts exhaustvely label the relevant documents for each query. Typcally assumes bnary relevance judgments. Requres consderable human effort for large document/query corpora. 3 4 Entre document collecton Relevant documents Precson and Recall Retreved documents umber of relevant documents retreved recall = Total number of relevant documents relevant rrelevant retreved & rrelevant retreved & relevant retreved ot retreved & rrelevant not retreved but relevant not retreved Precson and Recall Precson The ablty to retreve top-ranked documents that are mostly relevant. Recall The ablty of the search to fnd all of the relevant tems n the corpus. precson = umber of relevant documents Total number of retreved documents retreved 5 6

Determnng Recall s Dffcult Total number of relevant tems s sometmes not avalable: Sample across the database and perform relevance judgment on these tems. Apply dfferent retreval algorthms to the same database for the same query. The aggregate of relevant tems s taken as the total relevant set. Trade-off between Recall and Precson Returns relevant documents but msses many useful ones too Precson 0 Recall The deal Returns most relevant documents but ncludes lots of junk 7 8 Computng Recall/Precson Ponts For a gven query, produce the ranked lst of retrevals. Adjustng a threshold on ths ranked lst produces dfferent sets of retreved documents, and therefore dfferent recall/precson measures. Mark each document n the ranked lst that s relevant accordng to the gold standard. Compute a recall/precson par for each poston n the ranked lst that contans a relevant document. 9 Computng Recall/Precson Ponts: An Example n doc # relevant 588 x 589 x 3 576 4 590 x 5 986 6 59 x 7 984 8 988 9 578 0 985 03 59 3 77 x 4 990 Let total # of relevant docs = 6 Check each new recall pont: R=/6=0.67; P=/= R=/6=0.333; P=/= R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 R=5/6=0.833; p=5/3=0.38 Mssng one relevant document. ever reach 00% recall 0 Interpolatng a Recall/Precson Curve Interpolate a precson value for each standard recall level: r j {0.0, 0., 0., 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,.0} r 0 = 0.0, r = 0.,, r 0 =.0 The nterpolated precson at the j-th standard recall level s the maxmum known precson at any recall level between the j-th and (j + )-th level: P( r ) = max P( r) j rj r r j + Precson Interpolatng a Recall/Precson Curve: An Example.0 0.8 0.6 0.4 0. 0. 0.4 0.6 0.8.0 Recall

3 Average Recall/Precson Curve Typcally average performance over a large set of queres. Compute average precson at each standard recall level across all queres. Plot average precson/recall curves to evaluate overall system performance on a document/query corpus. Compare Two or More Systems The curve closest to the upper rght-hand corner of the graph ndcates the best performance Precson 0.8 0.6 0.4 0. 0 ostem Stem 0. 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall 3 4 Sample RP Curve for CF Corpus Problems wth Recall/Precson Recall/Precson and ts related measures need a par of numbers, not very ntutve Sngle-value measures R-precson F-measure E-measure Fallout rate ESL ASL 5 6 R- Precson F-Measure Precson at the R-th poston n the rankng of results for a query that has R relevant documents. n doc # relevant 588 x 589 x 3 576 4 590 x 5 986 6 59 x 7 984 8 988 9 578 0 985 03 59 3 77 x 4 990 R = # of relevant docs = 6 R-Precson = 4/6 = 0.67 7 One measure of performance that takes nto account both recall and precson. Harmonc mean of recall and precson: PR = = F P + R + R P Compared to arthmetc mean, both need to be hgh for harmonc mean to be hgh. 8

4 E Measure (parameterzed F Measure) A varant of F measure that allows weghtng emphass on precson over recall: ( + β ) PR ( + β ) E = = β P + R β + R P Value of β controls trade-off: β = : Equally weght precson and recall (E=F). β > : Weght precson more. β < : Weght recall more. Fallout Rate Problems wth both precson and recall: umber of rrelevant documents n the collecton s not taken nto account. Recall s undefned when there s no relevant document n the collecton. Precson s undefned when no document s retreved. no.of nonrelevant tems retreved Fallout = total no.of nonrelevant tems n the collecton 9 0 Other Measures Fve Types of ESL Expected Search Length: [Cooper 968] average number of documents that must be examned to retreve a gven number of relevant documents : maxmum number of relevant documents e : expected search length for ESL = = * e = Type : A user may just want the answer to a very specfc factual queston or a sngle statstcs. Only one relevant document s needed to satsfy the search request. Type : A user may actually want only a fxed number, for example, sx of relevant documents to a query. Type 3: A user may wsh to see all documents relevant to the topc. Type 4: A user may want to sample a subject area as n, but wsh to specfy the deal sze for the sample as some proporton, say one-tenth, of the relevant documents. Type 5: A user may wsh to read all relevant documents n case there should be less than fve, and exactly fve n case there exst more than fve. Other Measures (cont.) Average Search Length: [Losee 998] expected poston of a relevant document n the ordered lst of all documents : total number of documents Q: probablty that the rankng s optmal (perfect) A: expected proporton of all documents examned n order to reach the average poston of a relevant document n an optmal rankng ASL= [ QA+ ( Q)( A)] Problems Whle they are sngle value measurements (F-measure, E-measure, ESL, ASL) They are not easy to measure (compute) Or they are not ntutve Or the data requred for the measure are typcally not avalable (e.g. ASL) They don t work well n web search envronment 3 4

5 RankPower We propose a sngle, effectve measure for nteractve nformaton search systems such as the web. Take nto consderaton both the placement of the relevant documents and the number of relevant documents n a set of retreved documents for a gven query. Some defntons For a gven query, documents are returned Among the returned documents, R are relevant documents, R = C < Each of the relevant document n R s placed at L Average rank of returned relevant documents R avg () R avg ( ) C = = C L 5 6 Some propertes A functon of two varables, ndvdual ranks of relevant documents, and the number of relevant documents For a fxed C, the more documents lsted earler, the more favorte the value s (smaller values are favored). If the sze of returned documents ncreases and the number of relevant documents n also ncreases, the average rank ncreases (unbounded). In the dea case where every sngle returned document s relevant, the average rank s smply (+)/ RankPower defnton C L Ravg( ) = RankPower ( ) = = C C 7 8 RankPower propertes It s a decreasng functon of snce the rate of ncrease of the denomnator (C ) s faster than the numerator It s bounded below by ½ so the measure can be used as a benchmark to compare dfferent systems It weghs the placement very heavly (see an example for explanaton later), the ones placed earler n the lst are much favored. If two sets of returned documents have the same average rank, the one wth more document s favored. Examples Compare two systems each of whch returns a lst of 0 documents. System A has two relevant documents lsted as st and nd, wth a RankPower of 0.75. Let s examne some scenaro n whch system B can match or surpass system A. If system B returns 3 relevant documents, unless two of the three are lsted st and nd, t s less favored than A snce the two best cases (+3+4)/3 =0.89 and (+3+4)/3 = whch are greater than that of A (0.75). System B needs to have 6 relevant documents n ts top- 0 lst to beat A f t doesn t capture st and nd places. 9 30

6 Examples (cont.) The measure (RankPower) s tested n a real web search envronment. We compare the results of sendng 7 queres to AltaVsta and MARS (one of our ntellgent web search projects), lmtng to the frst 0 returned results. R avg C RankPower MARS 6.59 9.33 0.7 AltaVsta 0.4 8.50. RankPower A Varaton R avg () : average rank of relevant docs among retreved docs C : count of relevant docs among retreved docs S : poston of the th relevant document S Ravg( ) RankPower ( ) = = C C RP( ) 0. 5 = C ( C + ) RankPowerAlt( ) = C C ( C + ) = C S S = = RPAlt( ) 0 3 3 Subjectve Relevance Measure ovelty Rato: The proporton of tems retreved and judged relevant by the user and of whch they were prevously unaware. Ablty to fnd new nformaton on a topc. Coverage Rato: The proporton of relevant tems retreved out of the total relevant documents known to a user pror to the search. Relevant when the user wants to locate documents whch they have seen before (e.g., the budget report for Year 000). 33