Multimedia Information Retrieval

Size: px

Start display at page:

Download "Multimedia Information Retrieval"

Clara Paul
5 years ago
Views:

1 Multimedia Information Retrieval Prof Stefan Rüger Multimedia and Information Systems Knowledge Media Institute The Open University

2 Multimedia Information Retrieval 1. What are multimedia queries? 2. Fingerprinting 3. Metadata & piggy-back retrieval 4. Automated image annotation 5 Visual content-based retrieval I 6 Visual content-based retrieval II 7. Evaluation - Simple metrics - Evaluation campaigns - Multilabel evaluation for image annotation 8. Browsing, search and geography [Slides for simple metrics and evaluation campaigns partially prepared by Suzanne Little]

3 Evaluation How do we know if our MIR system is effective? Why do we care about quantifying the performance? If you can not measure it, you can not improve it. Lord Kelvin

4 What is performance? Be specific: - Processing speed (database sense) - Precision (web search casual user) - Recall (e.g., secret service user) - Other derived measures: accuracy, cdg,... - Memory usage (resource awareness) - Workflow integration (how does the retrieval task support the overall user task?) - Sense of empowerment (just to mention a whacky but justifiable one!) - Usability / user friendliness

5 Cranfield Paradigm Ground truth data test train tune query set train relevance judgement results list (M)IR system evaluation metrics

6 Small, unbalanced data? Cross-validation test train train train test test Randomise data and divide Train-test 4 times Average all metrics 4-fold cross-validation test train Extreme is leave-one-out (test size = 1)

7 Model selection requires validation set M2178 Best performance on test set? Could be by chance! Must report performance of best model on validation set!

8 Relevance? Find me pictures of triumph * *

9 Retrieved-set-based metrics Relevant Irrelevant Retrieved True Positive (tp) False Positive (fp) Not Retrieved False Negative (fn) True Negative (tn) Precision (P) = fraction retrieved that are relevant Recall (R) = fraction relevant that are retrieved

10 Accuracy? Why is accuracy rarely used for information retrieval?

11 Precision or Recall? Is precision or recall more useful/important - if I am doing a web search on Tokyo accommodation? - if I am a paralegal researching case precedence? How could I make a system with 100% recall?

12 F measure: compromise between precision&recall F1-measure (weighted harmonic mean of P & R)

13 Exercise An IR system returns 8 relevant documents and 10 irrelevant documents. There are a total of 20 relevant documents in the collection. Calculate the precision and recall.

14 Ranked Retrieval Which is better? There are 5 relevant documents. System A 1. Relevant 2. Relevant 3. Irrelevant 4. Irrelevant 5. Relevant 6. Relevant System B 1. Relevant 2. Irrelevant 3. Relevant 4. Relevant 5. Relevant 6. Irrelevant Precision = 4/6 = 0.66 Recall = 4/5 = 0.80 Precision = 4/6 = 0.66 Recall = 4/5 = 0.80

15 Ranked Retrieval Metrics n Properties of precision/recall graphs Mean average precision

16 Ranked Retrieval Which is better? There are 5 relevant documents. System A 1. Relevant 2. Relevant 3. Irrelevant 4. Irrelevant 5. Relevant 6. Relevant System B 1. Relevant 2. Irrelevant 3. Relevant 4. Relevant 5. Relevant 6. Irrelevant P@1 P@2 P@3 P@4 P@5

17 Average Precision System A 1. Relevant 2. Relevant 3. Irrelevant 4. Irrelevant 5. Relevant 6. Relevant P=1 P=1 P = 0.6 P = 0.67 System B 1. Relevant 2. Irrelevant 3. Relevant 4. Relevant 5. Relevant 6. Irrelevant P=1 P = 0.67 P = 0.75 P = ( )/4 = 0.82 ( )/4 = 0.69

18 Ranked Retrieval Which is better? There are 5 relevant documents. System A 1. Relevant 2. Relevant 3. Irrelevant 4. Irrelevant 5. Relevant 6. Relevant System B 1. Relevant 2. Irrelevant 3. Relevant 4. Relevant 5. Relevant 6. Irrelevant AP = 0.82 AP = 0.69

19 Mean average precision Computes the mean of AP over observed queries Corresponds to the area under the curve of a precision/recall graph (averaged over queries)

20 Precision Precision/recall curve Recall

21 - Jumps numerically in relatively large steps - Step-changes with small SE changes - Step-changes with small changes of n - Is not well suited for training even if it is the ultimate evaluation measure! MAP is often better suited for training (despite P@n being used for evaluation!)

22 The Dark Side of Evaluation... Overfitting to limited training data unbalanced, fragile system Unrealistic training data Difficulty in finding training data Comparison and competition Numbers not users

23 Evaluation Campaigns TRECVID ImageCLEF MediaEval MIREX

24 TRECVid Organised by NIST with support from other U.S. government agencies - Objective is to encourage research in video retrieval tasks by: - Providing a large test collection - Uniform, independent and external scoring procedures - Provide a forum for comparing results In spirit of Cranfield procedure tries to model real-world tasks (or components of tasks) Tasks and datasets got increasingly harder over the years

25 TRECVid: typical tasks - Semantic indexing (annotation, new: pairs) - Known-item search (text description of video clip in collection) - Interactive surveillance event detection (airport surveillance) - Instance search (find a person, object, place by example) - Multimedia event detection - Multimedia event recounting (text describing evidence) - Shot boundary detections (retired some time ago)

26 TRECVid: typical cycle - Feb: task specifications and call for participation Apply for participation and data sign permissions - Mar: final paper of previous year due - Apr: complete guidelines - May/June: training data being made available - July: test data plus challenges being made available - Aug/Sep: participants return their results to NIST - Sep/Oct: NIST evaluates results from participants NIST organises workshop, speaker list, agenda - Nov: TRECVid workshop in Gaithersburg, MD

27 TRECVid TRECVid example queries Find shots of a road taken from a moving vehicle through the front window Find shots of a person talking behind a microphone Find shots of a street scene at night

28 ImageCLEF CLEF = Cross Language Evaluation Forum Process is modelled from TREC ImageCLEF started in 2003 Tasks: Image retrieval (queries in different languages) Medical Image Annotation Annotation of photographs Geographic retrieval (GeoCLEF) Video retrieval (VideoCLEF/MediaEval)

29 Search Engine Quality? System issues Indexing speed Scalability Robustness Query expressiveness User issues Diversity, responsiveness Happiness, usability? The interface vs IR performance

Same, but can also compare sets of labels Fully correct, partly correct, fully wrong What about similar annotations?

30 Performance measures for multilabel evaluation GT = {Landscape, Outdoor, Day} Landscape Outdoor Day Citylife Outdoor Day Buildings Indoor Night 1) Evaluation per concept Precision, Recall, F-Measure, Accuracy 2) Evaluation per media item Same, but can also compare sets of labels Fully correct, partly correct, fully wrong What about similar annotations? [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

31 Types of evaluation measures Concept-based measures: Example-based measures: Precision (Pc) Precision (Pe) Recall (Rc) Recall (Re) F-Measure (Fc) F-measure (Fe) Area Under Curve (AUC) Accuracy Equal Error Rate (EER) Alpha evaluation Mean Average Precision (MAP) Hierarchical Score (HS) Ontology Score (OS) [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

32 Ontology Score (OS) Vocabulary Cost Map Landscape City: 0.67 Landscape Indoor: 0.87 Landscape StillLife: 0.71 Ontology. Evaluation Procedure Match Label Sets Agreements G = {Landscape, Outdoor, Day} P = {City, Indoor, Day, Plant} Sunny: 0.88 Aesthetic: 0.75 [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

33 Ontology Score (OS) 1) Depth-dependent distance-based misclassification costs Costs for each link: Cost Map Sum link costs on shortest path between labels Cut in halves for each deeper level of hierarchy Max costs of path between two labels = 1 2) Ontology-based penalty: Penalty for violations of ontology knowledge [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

34 Ontology Score (OS) 3) Annotator agreements a(l): Factor of annotation consensus per concept Matching Procedure: Final Score: Hierarchical Score (HS): OS without step 2) [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

35 Study setup 7 example-based evaluation measures 6 concept-based evaluation measures 73 runs of ImageCLEF 2009, 10 random runs ImageCLEF 2009 Photo Annotation Task: Annotation of 53 visual concepts in consumer photos MIR Flickr 25,000 Image Dataset: Training set: 5,000 photos + EXIF data + ground truth Test set: 13,000 photos + EXIF data Photo Tagging Ontology 19 research groups, 73 run configurations [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

36 Results Concept-based vs. example-based Precision, Recall, F-Measure [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

37 Results Concept-based vs. example-based Precision, Recall, F-Measure [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

Results AUC, EER, MAP based on ranked annotations confidence values binary annotations EER, AUC disadvantage binary decisions of systems MAP aligns with Pc,

38 Results AUC, EER, MAP based on ranked annotations confidence values binary annotations EER, AUC disadvantage binary decisions of systems MAP aligns with Pc, EER, but is more discriminative [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

39 Results Example-based evaluation: HS, OS, Alpha/Accuracy [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

40 Results Example-based evaluation: HS, OS, Alpha/Accuracy [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

41 Results Influence of number of annotations on measures OS: Random runs (10%, 20%, 30%) close to LD (0.17) get slightly better results Fc+Fe: high LD leads to slightly better results, critical in case of Fc [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

42 Conclusions for Image Annotation tasks Example-based evaluation: HS not suitable ( random runs) OS promising tends to advantage rule conforming systems Pe, Re and Fe higher scores than concept-based variants Alpha Distribution of values, best: alpha=1 (accuracy), alpha=0.5 Concept-based evaluation EER, AUC disadvantage systems with binary decisions Rc and Fc not suitable (random runs) MAP: good characteristics, stable (random runs, label density) Recommendation: Example-based evaluation: OS, Fe, Accuracy Concept-based evaluation: MAP [Nowak, Lukashevich, Dunker and Rüger: Performance measures for multilabel evaluation, MIR 2010, pp35-44 slides prepared by Nowak]

A Consumer Photo Tagging Ontology

A Consumer Photo Tagging Ontology Concepts and Annotations Stefanie Nowak Semantic Audiovisual Systems Fraunhofer IDMT Corfu, 29.09.2009 stefanie.nowak@idmt.fraunhofer.de slide 1 Outline Introduction Related