T MULTIMEDIA RETRIEVAL SYSTEM EVALUATION

Similar documents
Image retrieval based on bag of images

Multimedia Information Retrieval

Evaluation and image retrieval

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Chapter 8. Evaluating Search Engine

Evaluating Machine Learning Methods: Part 1

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Evaluating Machine-Learning Methods. Goals for the lecture

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos

CS145: INTRODUCTION TO DATA MINING

Supporting Information

Large-Scale Validation and Analysis of Interleaved Search Evaluation

Use of Synthetic Data in Testing Administrative Records Systems

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

CSCI 5417 Information Retrieval Systems. Jim Martin!

Louis Fourrier Fabien Gaie Thomas Rolf

A Simple but Effective Approach to Video Copy Detection

Evaluating Classifiers

Deep Face Recognition. Nathan Sun

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Modern Retrieval Evaluations. Hongning Wang

Data Mining and Knowledge Discovery: Practice Notes

Further Thoughts on Precision

Adaptive Search Engines Learning Ranking Functions with SVMs

Hyperplane Ranking in. Simple Genetic Algorithms. D. Whitley, K. Mathias, and L. Pyeatt. Department of Computer Science. Colorado State University

Building Test Collections. Donna Harman National Institute of Standards and Technology

Learning Ranking Functions with SVMs

Evaluating Classifiers

From Passages into Elements in XML Retrieval

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Learning Ranking Functions with SVMs

Category-level localization

Retrieval Evaluation. Hongning Wang

Classification Algorithms in Data Mining

Data Mining and Knowledge Discovery Practice notes 2

A Survey on Multimedia Information Retrieval System

Information Retrieval. Lecture 7

MEASURING CLASSIFIER PERFORMANCE

Data Mining and Knowledge Discovery: Practice Notes

Evaluating Segmentation

Binary Diagnostic Tests Clustered Samples

Pattern recognition (4)

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

VK Multimedia Information Systems

Evaluating Classifiers

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Extended Performance Graphs for Cluster Retrieval

Text Categorization (I)

Information Retrieval

Information Retrieval. Lecture 7 - Evaluation in Information Retrieval. Introduction. Overview. Standard test collection. Wintersemester 2007

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

Content Based Image Retrieval: Survey and Comparison between RGB and HSV model

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

CS4491/CS 7265 BIG DATA ANALYTICS

CS4670: Computer Vision

Screening Design Selection

An efficiency comparison of two content-based image retrieval systems, GIFT and PicSOM

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

Fast or furious? - User analysis of SF Express Inc

CS249: ADVANCED DATA MINING

Knowledge Discovery and Data Mining

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Overview of the INEX 2009 Link the Wiki Track

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Large scale object/scene recognition

User Strategies in Video Retrieval: a Case Study

Query Processing and Alternative Search Structures. Indexing common words

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

CS 584 Data Mining. Classification 3

Benchmarks, Performance Evaluation and Contests for 3D Shape Retrieval

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al.

Grammar Rule Extraction and Transfer in Buildings

ECS 289H: Visual Recognition Fall Yong Jae Lee Department of Computer Science

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Object Recognition II

Machine Problem 8 - Mean Field Inference on Boltzman Machine

Experiment Design and Evaluation for Information Retrieval Rishiraj Saha Roy Computer Scientist, Adobe Research Labs India

Multimodal Information Spaces for Content-based Image Retrieval

Machine Learning Techniques for Data Mining

A Comparative Study Weighting Schemes for Double Scoring Technique

Algebraic Iterative Methods for Computed Tomography

Climbing the Kaggle Leaderboard by Exploiting the Log-Loss Oracle

Performance Measures for Multi-Graded Relevance

Similarity of Medical Images Computed from Global Feature Vectors for Content-based Retrieval

Relevance in XML Retrieval: The User Perspective

COMP 102: Computers and Computing

Unstructured Data. CS102 Winter 2019

humor... May 3, / 56

BENCHMARKING OF STILL IMAGE WATERMARKING METHODS: PRINCIPLES AND STATE OF THE ART

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search

Introduction to Statistical Learning Theory

Statistical Performance Comparisons of Computers

Information Retrieval

Transcription:

T-61.6030 MULTIMEDIA RETRIEVAL SYSTEM EVALUATION Pauli Ruonala pruonala@niksula.hut.fi 25.4.2008

Contents 1. Retrieve test material 2. Sources of retrieval errors 3. Traditional evaluation methods 4. Evaluation and machine learning 5. Statistical methods 6. Conclusions 2

Retrieve test material Researchers test their new methods with: Own holiday pictures Historical images School yearbooks Corel Photo-CD TRECVID data sets Web images and videos - http://vaunut.org There are no standardized test data sets in general use Data sets are big, few hundred images is not large enough, and should contain query tasks and correct answers (ground truth set). 3

Efforts towards standard test collections: TRECVID: 133 hrs news and documentary video with standard query types and answers IAPR TC-12 (International Association fo Pattern Recognition Technical Committee 12) uses still images ImageCLEF (Cross-Language Evaluation Forum) combines text and image retrieval. Historical photographs and image captions. Also medical image collection (X-rays, CT-scans, etc.) with captions. INEX (Initiative for the Evaluation of XML Retrieval) studies XML retrieval systems. 4

Everyone uses his own data sets generalization of findings is restricted Image type determines best metric no general search algorithm. - One must learn to search. Example: Try to find a tumor in X-ray-image In same data set samples are not equal; true random subject samples are rare 5

Sources of retrieval errors Main multimedia retrieval problems are selection of features and metrics for the features. Night and day by Tom of Finland What it is exactly that your feature detector is detecting? 6

How to combine several feature detectors optimally? Negative indication of Cheshire Cat fur does not indicate that Cheshire Cat is absent This is a decision theory problem and ex. Charles Dodgson (alias Lewis Carroll) did not find optimal solution 7

Traditional evaluation methods Precision relevant items retrieveditems precision P = = no. of relevant items retrieved retrieveditems total no.items retrieved proportion of items relevant within a result set P = 100% means that all items retrieved are relevant. Because ground truth set is essentially a random selection (or hard to replicate) 100% result is random too Also called 'average precision' when 'precision' is given when 10, 20, 30, etc. items are retrieved. 8

Idea is to give measures for different types of users. At 'precision at 10' is user that is satisfied with any relevant object. The least stable of evaluation measures; total number of items in set has a strong influence on precision at x. Recall relevant items retrieveditems recall R = = relevant items no.of relevant itemsretrieved total no.relevant itemsin set Proportion of relevant items that have been retrieved 100% = all relevant items are retrieved 9

Average precision average precision k i AP = i = 1r 1 i Nr k = no of items system has found N r = total no. relevant items in set r i = rank that system gave to item i Several algorithms have been suggested Ex. a query produces list of five items. - Put item that system thinks is most relevant first in list - First item gets rank number 1 - If only three of the items, 1st, 3rd and 5th in list, are relevant so AP = 1 + 1 2 3 + 3 5 1 3 0,76 10

Items at top of the list have more weight precision is sensitive to small change of ordering of items generalization performance suffers Some authors call this Mean Average Precision. Mean Average Precision (MAP) Overall performance measure. Mean of all average precisions measured after all queries are done. 11

Precision/Recall graph Usually P/R-graph has sudden jumps Graph gives indication of how many items it is best to retrieve 12

F-measure Weighted harmonic mean of precision (P) and recall (R) F = 2 P R P + R 13

Receiver operating characteristic (ROC) curve 14

ROC curve plots TPR vs. FPR - true positives (TPR = true positive rate) - false positives (FPR = false positive rate) Used also in medicine, radiology, data mining, etc. ROCArea is area at right side of the curve. Large ROCArea can be indication of good retrieval performance Rank normalization Used to even out effects of different data set sizes in each query Several methods proposed but none has been taken into use 15

Evaluation and machine learning Retrieval results can be improved with machine learning But simple methods do not produce good results because - AP is computed from discrete values; ranking of items - Ex. 99% of items in dataset are non-relevant and learning algorithm soon learns the method that produces 99% correct results: classify all items as non-relevant Good precision is easy to get: find a good order if items - This does not lead to good retrieval results. Optimizing ROCArea has shown that big ROCArea is not necessarily proof of good precision. 16

Using precision for automatic feature detector weight adjustment is difficult. - Recall might suffer - Gradient-based optimization methods can not be used A new method for optimizing average precision with software from http://www.yisongyue.com Statistical methods Used for comparing performance of two systems Recommended methods: - Wilcoxon signed rank test - paired t-test - paired sign test 17

Example: Wilcoxon signed rank test Make two queries A and B from same data set by using two different methods. Results are different. Is difference statistically significant? Formulate a null hypothesis = difference is not statistically significant. Take one result from A and one from B and make a pair out of them. Arrange rest in similar way. Compute absolute differences (D) between pairs. Create an ordered list of results - smallest value first. If several pairs have same D replace their ranks with mean of rank values. Remove pairs where D == 0. Replace differences with rank values - retain the sign. 18

Smallest D gets rank 1, R(D) is rank value. W - = R( D i ) W + = R( D i ) D < 0 Compute W + and W -. i is number of pairs. Take W - or W +, whichever is smallest, and use tables or computer to compute value 'p'. - Ex. if p is 0,01 then there is 1% possibility of error if we say that null hypotheses is valid = difference is not statistically significant. Same with Matlab: [p, h] = ranksum (a,b); Query results in vectors a and b D > 0 19

If h == 1 it means rejection of null hypotheses at 5% significance level Wilcoxon test does not produce good results if many ranksums tie with each other. Wilcoxon test does not care about distribution of errors or result values. Student t-test is more vulnerable to deviations from the normal distribution and can be used when number of pairs is larger than 50. 20

Conclusions Current evaluation methods are incompatible - there is no clear connection between ex. ROCArea and MAP You are free to add your own. Evaluation methods can help in selecting appropriate features and metrics for the features, but there are difficulties There are no solved problems - some niche problems have a working solution. Nobody has measured user satisfaction. Human similarity judgments do not follow any metric; they are user- and task-dependent. Humans attach meanings to pictures which can not yet be modelled. 21

References Henning Müller, Wolfgang Müller, David McG. Squire, Stéphane Marchand-Maillet, Thierry Pun. Performance evaluation in content-based image retrieval: overview and proposals. Pattern recognition letters 22 (2001). Alexander G. Hauptmann, Michael G. Christel. Successful Approaches in the TREC Video Retrieval Evaluations. In International Multimedia Conference New York, NY, USA 2004. Michael S. Lew, Nicu Sebe, Chabane Djeraba, Ramesh Jain. Content-based Multimedia Information Retrieval: State of the Art anc Challenges. ACM Transactions on Multimedia Computing, Communications and Applications, Feb 2006. Mathias Lux, Gisela Dösinger, Günter Beham. Empirical Studies in Multimedia Retrieval Evaluation. In BTW Workshops 2007: 199-217, Aachen, Germany. Yisong Yue, Thomas Finley, Filip Radlinski, Thorsten Joachims. A Support Vector Method for Optimizing Average Precision. Proceedings of SIGIR 2007. http://www.yisongyue.com 22

Thank you! 23