Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Similar documents
Evaluating Classifiers

Evaluating Classifiers

CS145: INTRODUCTION TO DATA MINING

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

CS4491/CS 7265 BIG DATA ANALYTICS

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Classification Part 4

Evaluating Machine-Learning Methods. Goals for the lecture

CS249: ADVANCED DATA MINING

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

MEASURING CLASSIFIER PERFORMANCE

Evaluating Machine Learning Methods: Part 1

2. On classification and related tasks

Weka ( )

EE795: Computer Vision and Intelligent Systems

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Network Traffic Measurements and Analysis

Fast or furious? - User analysis of SF Express Inc

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

Information Management course

Classification. Instructor: Wei Ding

Part II: A broader view

CS 584 Data Mining. Classification 3

Data Mining and Knowledge Discovery Practice notes 2

Topics in Machine Learning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Machine learning in fmri

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Use of Synthetic Data in Testing Administrative Records Systems

Lecture 25: Review I

Tutorials Case studies

Data Mining and Knowledge Discovery: Practice Notes

Supervised vs unsupervised clustering

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Classification Algorithms in Data Mining

Data Mining Classification: Bayesian Decision Theory

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

Evaluating Classifiers

Large Scale Data Analysis Using Deep Learning

Data Mining and Knowledge Discovery: Practice Notes

List of Exercises: Data Mining 1 December 12th, 2015

1 Machine Learning System Design

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Chapter 8. Evaluating Search Engine

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Package hmeasure. February 20, 2015

Statistics 202: Statistical Aspects of Data Mining

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Constrained Classification of Large Imbalanced Data

Pattern recognition (4)

I211: Information infrastructure II

Machine Learning and Bioinformatics 機器學習與生物資訊學

Global Probability of Boundary

ECE 5470 Classification, Machine Learning, and Neural Network Review

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Lecture 9: Support Vector Machines

Information Retrieval. Lecture 7

Boolean Classification

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Chuck Cartledge, PhD. 23 September 2017

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

DATA MINING OVERFITTING AND EVALUATION

Climbing the Kaggle Leaderboard by Exploiting the Log-Loss Oracle

Machine Learning: Symbolische Ansätze

Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio

Louis Fourrier Fabien Gaie Thomas Rolf

CS5670: Computer Vision

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Business Club. Decision Trees

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Interpretation and evaluation

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

An Empirical Study on Lazy Multilabel Classification Algorithms

Probabilistic Classifiers DWML, /27

Fraud Detection using Machine Learning

Artificial Neural Networks (Feedforward Nets)

Artificial Intelligence. Programming Styles

Web Information Retrieval. Exercises Evaluation in information retrieval

Climbing the Kaggle Leaderboard by Exploiting the Log-Loss Oracle

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Package StatMeasures

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Further Thoughts on Precision

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Face Recognition using Eigenfaces SMAI Course Project

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

K- Nearest Neighbors(KNN) And Predictive Accuracy

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Transcription:

Evaluation Metrics (Classifiers) CS Section Anand Avati

Topics Why? Binary classifiers Metrics Rank view Thresholding Confusion Matrix Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity, F-score Summary metrics: AU-ROC, AU-PRC, Log-loss. Class Imbalance Failure scenarios for each metric Multi-class Only math necessary: Counting and ratios!

Why are metrics important? - Training objective (cost function) is only a proxy for real world objective. - Metrics help capture a business goal into a quantitative target (not all errors are equal). - Helps organize ML team effort towards that target. - Generally in the form of improving that metric on the dev set. - Useful to quantify the gap between: - Desired performance and baseline (estimate effort initially). - Desired performance and current performance. - Measure progress over time (No Free Lunch Theorem). - Useful for lower level tasks and debugging (like diagnosing bias vs variance). - Ideally training objective should be the metric, but not always possible. Still, metrics are useful and important for evaluation.

Binary Classification X is Input Y is binary Output (0/) Model is Yhat = h(x) Two types of models Models that output a categorical class directly (K Nearest neighbor, Decision tree) Models that output a real valued score (SVM, Logistic Regression) Score could be margin (SVM), probability (LR, NN) Need to pick a threshold We focus on this type (the other type can be interpreted as an instance)

Score based models Score = Positive labelled example Negative labelled example Prevalence = #positives #positives + #negatives Score = 0 Dots are examples: Green = positive label, gray = negative label. Consider score to be probability output of Logistic Regression. Position of example on scoring line based on model output. For most of the metrics, only the relative rank / ordering of positive/negative matter. Prevalence decides class imbalance. If too many examples, plot class-wise histogram instead. Rank view useful to start error analysis (start with the topmost negative example, or bottom-most positive example).

Score based models : Classifier Label positive Label negative Th Predict Negative Predict Positive Th=0.5 0.5 Only after picking a threshold, we have a classifier. Once we have a classifier, we can obtain all the Point metrics of that classifier.

Point metrics: Confusion Matrix Label Positive Label Negative Predict Negative Predict Positive Th=0.5 Th 0.5 All point metrics can be derived from the confusion matrix. Confusion matrix captures all the information about a classifier performance, but is not a scalar! Properties: - Total Sum is Fixed (population). - Column Sums are Fixed (class-wise population). - Quality of model & selected threshold decide how columns are split into rows. - We want diagonals to be heavy, off diagonals to be light.

Point metrics: True Positives Label positive Label negative Predict Negative Predict Positive Th=0.5 Th 0.5 TP TP = Green examples correctly identified as green.

Point metrics: True Negatives Label positive Label negative Predict Negative Predict Positive Th=0.5 Th 0.5 TP TN TN = Gray examples correctly identified as gray. Trueness is along the diagonal.

Point metrics: False Positives Label positive Label negative Predict Negative Predict Positive Th=0.5 Th 0.5 TP TN FP FP = Gray examples falsely identified as green.

Point metrics: False Negatives Label positive Label negative Predict Negative Predict Positive Th=0.5 Th 0.5 TP TN FP FN FN = Green examples falsely identified as gray.

FP and FN also called Type- and Type- errors Could not find true source of image to cite

Point metrics: Accuracy Label positive Label negative Predict Negative Predict Positive Th=0.5 Th 0.5 TP TN FP FN Acc.5 Accuracy = what fraction of all predictions did we get right? Improving accuracy is equivalent to having a 0- Loss function.

Point metrics: Precision Label positive Label negative Predict Negative Predict Positive Th=0.5 Th 0.5 TP TN FP FN Acc.5 Pr. Precision = quality of positive predictions. Also called PPV (Positive Predictive Value).

Point metrics: Positive Recall (Sensitivity) Label positive Label negative Predict Negative Predict Positive Th=0.5 Th 0.5 TP TN FP FN Acc.5 Pr. Recall. Recall (sensitivity) - What fraction of all greens did we pick out? Terminology from lab tests: how sensitive is the test in detecting disease? Somewhat related to Precision ( both recall and precision involve TP) Trivial 00% recall = pull everybody above the threshold. Trivial 00% precision = push everybody below the threshold except green on top (hopefully no gray above it!). Striving for good precision with 00% recall = pulling up the lowest green as high as possible in the ranking. Striving for good recall with 00% precision = pushing down the top most gray as low as possible in the ranking.

Point metrics: Negative Recall (Specificity) Label positive Label negative Predict Negative Predict Positive Th=0.5 Th 0.5 TP TN FP FN Acc.5 Pr. Recall. Spec 0. Negative Recall (Specificity) - What fraction of all negatives in the population were we able to keep away? Note: Sensitivity and specificity operate in different sub-universes.

Point metrics: F score Label positive Label negative Predict Negative Predict Positive Th=0.5 Th 0.5 TP TN FP FN Acc.5 Pr. Recall. Spec. F.57 Summarize precision and recall into single score F-score: Harmonic mean of PR and REC G-score: Geometric mean of PR and REC

Point metrics: Changing threshold Label positive Label negative Predict Negative Predict Positive Th=0.6 7 3 Th 0.6 TP 7 TN FP FN 3 Acc.75 Pr.77 Recall.7 Spec. F.733 Changing the threshold can result in a new confusion matrix, and new values for some of the metrics. Many threshold values are redundant (between two consecutively ranked examples). Number of effective thresholds = M + (# of examples + )

Threshold Scanning Score = Threshold =.00 Threshold = 0.00 Score = 0 Threshold TP TN FP FN Accuracy Precision Recall Specificity F.00 0 0 0 0 0.50 0 0 0.5 0 0 0.55 0. 0. 0.0 0 0 0.60 0. 0.333 0.5 0.55 0.667 0. 0. 0.30 0.0 3 7 0.60 0.750 0.3 0. 0.4 0.75 4 6 0.65 0.00 0.4 0. 0.533 0.70 5 5 0.70 0.33 0.5 0. 0.65 0.65 5 5 0.65 0.74 0.5 0. 0.5 0.60 6 4 0.70 0.750 0.6 0. 0.667 0.55 7 3 0.75 0.77 0.7 0. 0.737 0.50 0.0 0.00 0. 0. 0.00 0.45 0.5 0. 0. 0. 0.57 0.40 7 3 0.0 0.750 0. 0.7 0. 0.35 6 4 0.75 0.6 0. 0.6 0.73 0.30 5 5 0.70 0.643 0. 0.5 0.750 0.5 4 6 0.65 0.600 0. 0.4 0.70 0.0 3 7 0.60 0.56 0. 0.3 0.6 0.5 0.55 0.5 0. 0. 0.667 0.0 0.50 0.500 0. 0. 0.643 0.05 0 0 0.55 0.56 0. 0.60 0.00 0 0 0 0 0.50 0.500 0 0.667 As we scan through all possible effective thresholds, we explore all the possible values the metrics can take on for the given model. Each row is specific to the threshold. Table is specific to the model (different model = different effective thresholds). Observations: Sensitivity monotonic, Specificity monotonic in opposite direction. Precision somewhat correlated with specificity, and not monotonic. Tradeoff between Sensitivity and Specificity, Precision and Recall.

How to summarize the trade-off? {Precision, Specificity} vs Recall/Sensitivity We may not want to pick a threshold up front, but still want to summarize the overall model performance.

Summary metrics: ROC (rotated version) Score = Score = 0 AUC = Area Under Curve. Also called C-Statistic (concordance score). Represents how well the results are ranked. If you pick a positive e.g by random, and negative e.g by random, AUC = probability that positive eg is ranked > negative example. Measure of quality of discrimination. Some people plot TPR (= Sensitivity) vs FPR (= - Specificity). Thresholds are points on this curve. Each point represents one set of point metrics. Diagonal line = random guessing.

Summary metrics: PRC Score = Score = 0 Precision Recall curve represents a different tradeoff. Think search engine results ranked. End of curve at right cannot be lower than prevalence. Area under PRC = Average Precision. Intuition: By randomly picking the threshold, what s the expected precision? While sensitivity monotonically decreased with increasing recall (the number of correctly classified negative examples can only go down), precision can be jagged. Getting a bunch of positives in the sequence increases both precision and recall (hence curve climbs up slowly), but getting a bunch of negatives in the sequence drops the precision without increasing recall. Hence, by definition, curve has only slow climbs and vertical drops.

Summary metrics: Log-Loss motivation Score = Score = Model A Model B Score = 0 Score = 0 Two models scoring the same data set. Is one of them better than the other? Somehow the second model feels better.

Summary metrics: Log-Loss These two model outputs have same ranking, and therefore the same AU-ROC, AU-PRC, accuracy! Gain = p(x)*y + (-p(x))*(-y) Log loss rewards confident correct answers and heavily penalizes confident wrong answers. exp(e[log-loss]) is G.M. of gains, in [0,]. Gaining popularity as an evaluation metric (Kaggle) Score = Score = Score = 0 Score = 0 The metrics so far fail to capture this. Accuracy optimizes for: Model gets $ for correct answer, $0 for wrong. Log-loss: Model makes a bet for $0.xy dollars that it s prediction is correct. If correct, gets $0.xy dollars, else gets $(-0.xy) dollar. Incentive structure is well suited to produce calibrated models. Take home is the GM of earned amounts. Low-bar for exp(e[log-loss]) is 0.5 Log-loss on validation set measure how much did this betting engine earn.

Class Imbalance: Problems Symptom: Prevalence < 5% (no strict definition) Metrics: may not be meaningful. Learning: may not focus on minority class examples at all (majority class can overwhelm logistic regression, to a lesser extent SVM) Scenarios: Mostly retrieval-like. Generally one of the patterns: - High precision is hard constraint, optimize recall (e.g search engine results, grammar correction) -- intolerant to FP - High recall is hard constraint, optimize precision (e.g medical test) -- intolerant to FN

Class Imbalance: Metrics (pathological cases) Accuracy: Blindly predict majority class. Log-Loss: Majority class can dominate the loss. AUROC: Easy to keep AUC high by scoring most negatives very low. AUPRC: Somewhat more robust than AUROC. But other challenges. - What kind of interpolation? AUCNPR? In general: Accuracy << AUROC << AUPRC For accuracy, prevalence of majority class should be the low bar! AUROC: 0% prevalence. top-0% are all negatives, next K are all positives, followed by rest of the negatives. AUPRC = 0. even though practically useless.

Multi-class (few remarks) Confusion matrix will be NxN (still want heavy diagonals, light off-diagonals) Most metrics (except accuracy) generally analysed as multiple -vs-many. Class imbalance is common (both in absolute, and relative sense) Cost sensitive learning techniques (also helps in Binary Imbalance) Assign $$ value for each block in the confusion matrix, and incorporate those into the loss function.

Thank You!