CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Similar documents
Classification Part 4

Evaluating Classifiers

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Evaluating Classifiers

Pattern recognition (4)

Weka ( )

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

DATA MINING OVERFITTING AND EVALUATION

Use of Synthetic Data in Testing Administrative Records Systems

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

CISC 4631 Data Mining

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

CS4491/CS 7265 BIG DATA ANALYTICS

Fast or furious? - User analysis of SF Express Inc

Classification. Instructor: Wei Ding

2. On classification and related tasks

CS145: INTRODUCTION TO DATA MINING

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

CS249: ADVANCED DATA MINING

CS 584 Data Mining. Classification 3

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Data Mining and Knowledge Discovery: Practice Notes

Evaluating Machine-Learning Methods. Goals for the lecture

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Evaluating Machine Learning Methods: Part 1

Machine Learning: Symbolische Ansätze

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Chapter 8. Evaluating Search Engine

Data Mining and Knowledge Discovery Practice notes 2

Traditional clustering fails if:

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data Mining and Knowledge Discovery: Practice Notes

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Information Retrieval. Lecture 7

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

INF 4300 Classification III Anne Solberg The agenda today:

Evaluating Classifiers

Data Mining Classification: Bayesian Decision Theory

Probabilistic Classifiers DWML, /27

K- Nearest Neighbors(KNN) And Predictive Accuracy

1 Machine Learning System Design

Large Scale Data Analysis Using Deep Learning

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

Artificial Intelligence. Programming Styles

Cyber-physical intrusion detection on a robotic vehicle

CLASSIFICATION JELENA JOVANOVIĆ. Web:

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

On the automatic classification of app reviews

Network Traffic Measurements and Analysis

Machine Learning Techniques for Data Mining

Data Mining Concepts & Techniques

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Ensemble Methods, Decision Trees

Clustering will not be satisfactory if:

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

CSE 573: Artificial Intelligence Autumn 2010

Classification: Feature Vectors

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Part II: A broader view

Supporting Information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Applying Improved Random Forest Explainability (RFEX 2.0) steps on synthetic data for variable features having a unimodal distribution

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

Context Change and Versatile Models in Machine Learning

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Retrieval Evaluation. Hongning Wang

Chapter 3: Supervised Learning

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Tutorials Case studies

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Introduction to Information Retrieval

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Information Management course

Multimedia Information Retrieval

EE795: Computer Vision and Intelligent Systems

MEASURING CLASSIFIER PERFORMANCE

Distress Image Library for Precision and Bias of Fully Automated Pavement Cracking Survey

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Part 11: Collaborative Filtering. Francesco Ricci

Recommender Systems 6CCS3WSN-7CCSMWAL

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

HOW TO TEST CROSS-DEVICE PRECISION & SCALE

Expectation Maximization!

CSEP 573: Artificial Intelligence

ECLT 5810 Evaluation of Classification Quality

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Gene Clustering & Classification

Chapter 6 Evaluation Metrics and Evaluation

Why MultiLayer Perceptron/Neural Network? Objective: Attributes:

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Transcription:

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems Leigh M. Smith Humtap Inc. leigh@humtap.com

Basic system overview Segmentation (Frames, Onsets, Beats, Bars, Chord Changes, etc) Feature Extraction (Time-based, spectral energy, MFCC, etc) Analysis / Decision Making (Classification, Clustering, etc)

Overview MIR Data Preparation Training & Test Data The Overfitting Problem Cross-validation Evaluation Metrics Precision, Recall, F-measure ROC AUC 3

Content Format Impacts all levels of system Data volume, storage options, analysis DSP, DB design, etc. Systems may or may not maintain original source content (vs. metadata). Systems may preserve several formats of source and metadata (n-tier). This is typically a given situation, rather than a design option. 4

Audio Content Formats Audio-based Properties/volume of source recordings MP3/AAC/WMA decoders needed? MIDI-based Problems with MIDI, assumptions to make. Human-performed vs. quantized MIDI Score image based Useful, but not treated here - genre specific. Formal language-based SCORE, SMDL, Smoke, etc. MusicXML 5

Database Technology Database Designs: Consider Application Requirements and Design Relational DB (e.g MySQL/Oracle/PostgreSQL) Fixed table-formatted data Few data types (number, string, date,...) One or more indices/table (part of DB design, application-specific, impacts performance) Cross-table indexing and joins Schema-less NoSQL (MongoDB, Cassandra, DynamoDB) Each record can differ. Handling of Large/Variable Feature Vectors Graph DB s (neo4j) Social-Graph oriented Schema-less, but models relationships between entities. Enables fast retrieval of cascaded relationships.

Media data Historically images, now video, audio Volume (large single items) Format Often items of no known, or variable structure. Require both content and metadata for usage. Scalability of storage. Cloud storage Accessed via web service (HTTP) API. Common online providers: Amazon Simple Storage Service (S3) rackspace.com etc. 7

Data preparation ( eat your greens ) Examine your data at every chance (means, max, min, std, NaN, Infs ). Sanity check: Try to visualize data when possible to see patterns and see if it makes sense. Eliminate noisy data Data preparation Cleaning Open up and examine Handle missing values Relevance / Feature analysis Remove irrelevant or redundant attributes Data Transformation Generalize or normalize data

Training and test data An overfit model matches every training example (now it s overtrained. ) Training Error AKA Class Loss Generalization The goal is to classify new, unseen data. The goal is NOT to fit the training data perfectly. An overfit model will not be wellgeneralized, and will make errors. Rule of thumb: favor simple solutions and more general solutions.

A bad evaluation metric How many training examples are classified correctly? Training error = Validation error = Best predictive model # training cycles Image from Wikipedia, Overfitting

Overfitting

Training and test data Training, Validation, and Test sets Partition randomly to ensure that relative proportion of files in each category was preserved for each set Weka or Netlab has sampling code Cross-validation Repeated partitioning. Reduces false measures from data variability within sets. Warnings: Don t test (or optimize, at least) with training data Don t train on test data

Cross-validation: Accuracy on held-out ( test ) examples Cross-validation: repeated train /test iterations:

Evaluation Measures True +ve False +ve, Type I error True -ve False -ve, Type II error Correct Incorrect, False alarm Correct Absent Classifier correctly predicted something in it's list of known positives. Classifier said that something was positive when it's actually negative. e.g. Error light flashes, but no error actually occurred. Rejecting the null hypothesis when the null hypothesis is true. Classifier correctly rejected something when it s actually negative. Classifier did not hit, for a known positive result. e.g Error actually occurred, but no error light flashed. Failed to reject the null hypothesis, when the null hypothesis is false.

Confusion Matrix/Contingency Table

Evaluation Measures (C. V. van Rijsbergen 1979) Accuracy is good Precision - Positive Predictive Value, Specificity = high F+ rate, the classifier is hitting all the time = low F+ rate, no extraneous hits Recall Missed Hits, Sensitivity = high F- rate, the classifier is missing good hits = low F- rate, great at negative discrimination always returns a negative when it should F-Measure a blend of precision and recall (harmonic-weighted mean) is good

Precision Metric from information retrieval: How relevant are the retrieved results? # TP / (# TP + # FP) In MIR, may involve precision at some threshold in ranked results. Mnemonic: Precision = Prediction measure = false Positive

Recall How complete are the retrieved results? # TP / (# TP + # FN) Number actually correct Number annotated (i.e. known to be correct) determines deletions (ratio of false negatives).

F-measure A combined measure of precision and recall (harmonic mean) Treats precision and recall as equally important

Accuracy metric summary From T. Fawcett, An introduction to ROC analysis

Example Results - Confusion Matrix Music/Speech/Other classification Score: 2163/2450 Correct, (0 additional partial matches) of 2761 files attempted to read. Precision = 0.8814, Recall = 0.8829, F-Measure = 0.8821 Confusion Matrix (rows = ground truth, columns = classification): Other Music Speech Other: 431 68 110 Music: 17 775 18 Speech: 46 28 957 Recall by class: Other: 0.7077 Music: 0.9568 Speech: 0.9282 Mean class recall: 0.8642 21

ROC Graph Receiver operating characteristics curve. A richer method of comparing model performance than classification accuracy alone. Plots true positive rate vs. false positive rate for different classifier threshold parameter settings. Depicts relative trade-offs between true positive (benefits) and false positive (costs).

ROC plot for discrete (binary) classifiers Better Perfect Classification Worse Random Guess Comparing Classifiers: C < B A < D Each classifier output is either right or wrong Discrete classifier has single point on ROC plot. Each point a confusion matrix. The Northwest is better Best sub-region may be task-dependent (conservative or liberal may be better)

ROC curves for probabilistic/ tuneable classifiers Plot TP/FP points for different thresholds of one classifier Here, indicates that threshold of.505 is not optimal (0.54 is better)

Area under ROC (AUC) Compute AUC to compare different classifiers across parameter spaces. AUC = probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. AUC not always better for a particular problem.