Model s Performance Measures

Similar documents
Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Classification Part 4

Classification. Instructor: Wei Ding

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

DATA MINING OVERFITTING AND EVALUATION

10 Classification: Evaluation

CS 584 Data Mining. Classification 3

CISC 4631 Data Mining

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Data Mining Classification: Bayesian Decision Theory

CS4491/CS 7265 BIG DATA ANALYTICS

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Classification Salvatore Orlando

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

CS145: INTRODUCTION TO DATA MINING

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Ester Bernadó-Mansilla. Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain

CSE 5243 INTRO. TO DATA MINING

Lecture Notes for Chapter 4

Evaluating Machine-Learning Methods. Goals for the lecture

Evaluating Classifiers

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Evaluating Classifiers

Evaluating Machine Learning Methods: Part 1

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

List of Exercises: Data Mining 1 December 12th, 2015

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Statistics 202: Statistical Aspects of Data Mining

Chapter 3: Supervised Learning

CS249: ADVANCED DATA MINING

K- Nearest Neighbors(KNN) And Predictive Accuracy

Weka ( )

2. On classification and related tasks

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

Data Mining D E C I S I O N T R E E. Matteo Golfarelli

Classification and Regression

Machine Learning: Symbolische Ansätze

Data Mining and Knowledge Discovery Practice notes 2

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Data Mining and Knowledge Discovery: Practice Notes

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Network Traffic Measurements and Analysis

Chapter 6 Evaluation Metrics and Evaluation

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

ECLT 5810 Evaluation of Classification Quality

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Evaluating Classifiers

Classification of Data Streams with Skewed Distribution

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Introduction to Automated Text Analysis. bit.ly/poir599

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Further Thoughts on Precision

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Information Management course

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Business Club. Decision Trees

Data Mining Concepts & Techniques

MEASURING CLASSIFIER PERFORMANCE

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

1 Machine Learning System Design

Oliver Dürr. Statistisches Data Mining (StDM) Woche 5. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

CLASSIFICATION JELENA JOVANOVIĆ. Web:

SYED ABDUL SAMAD RANDOM WALK OVERSAMPLING TECHNIQUE FOR MI- NORITY CLASS CLASSIFICATION

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Classification using Weka (Brain, Computation, and Neural Learning)

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Lecture Notes for Chapter 4 Part III. Introduction to Data Mining

WEKA homepage.

Pattern recognition (4)

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Cyber attack detection using decision tree approach

Data Mining Algorithms: Basic Methods

Part II: A broader view

Predict Employees Computer Access Needs in Company

A Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

IN ACTION. Peter Harrington MANNING

Contents. Preface to the Second Edition

Data Mining: STATISTICA

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Artificial Intelligence. Programming Styles

Boolean Classification

Machine learning in fmri

CHAPTER 3 MACHINE LEARNING MODEL FOR PREDICTION OF PERFORMANCE

โครงการอบรมเช งปฏ บ ต การ น กว ทยาการข อม ลภาคร ฐ (Government Data Scientist)

PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE DATA M. Athitya Kumaraguru 1, Viji Vinod 2, N.

INF 4300 Classification III Anne Solberg The agenda today:

Data Linkages - Effect of Data Quality on Linkage Outcomes

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

Transcription:

Model s Performance Measures Evaluating the performance of a classifier Section 4.5 of course book. Taking into account misclassification costs Class imbalance problem Section 5.7 of course book. TNM033: Introduction to Data Mining 1 # Evaluation of a Classifier How predictive is the model we learned? Which performance measure to use? Natural performance measure for classification problems: error rate on a test set Success: instance s class is predicted correctly Error: instance s class is predicted incorrectly Error rate: proportion of errors made over the whole set of instances Accuracy: proportion of correctly classified instances over the whole set of instances accuracy = 1 error rate

Evaluation of a Classifier Is the accuracy measure enough to evaluate the performance of a classifier? Accuracy can be of little help, if classes are severely unbalanced Classifiers are biased to predict well the majority class e.g. decision trees based on information gain measure e.g. prunning rules for the rare class can be difficult, resulting in rules involving large number of attributes and low coverage (overfitting) Predicting correctly one of the classes may be more important How frequently records of class are correctly classified? Confusion Matrix PREDICTED Class=Yes Class=No Class=Yes a b Class=No c d a: (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Other Metrics PREDICTED Class=Yes Class=No Class=Yes a b Class=No c d a: (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) Precision (p) = Recall (r) = FP FN = R Fmeasure (F) = 2 r p r p F tends to be closer to the smaller of precision and recall. Example: A good medical test s TNs FNs FP Accuracy: high Sensitivity (R): high of class, i.e. few FNs Specificity (TNR): high of class, i.e. few FPs

Example: A bad medical test FPs s Accuracy: very high Sensitivity (R): 100% of class, i.e. zero FNs Specificity (TNR): 0% Classifies correctly half of the records Example: A better medical test TNs FPs s Accuracy: high Sensitivity (R): 100% of class, i.e. zero FNs Specificity (TNR): 50% Bad at classifying records of class, FPR = 50%

Example: classify documents (about X or not) TNs Accuracy: very high Sensitivity (R): 100% of class, i.e. zero FNs Specificity (TNR): very high of class, i.e. few FPs s FPs Bad classifier: about half of the documents classified as being about X have nothing to do with X Example: classify documents (about X or not) Precision = Recall (R) = ( FP) ( FN) s FPs FNs = 0 TNs Accuracy: very high Recall (R): 100% of class, i.e. zero FNs Precision: 50% (low) Too many FPs compared to the s

Example: classify documents (about X or not) Precision = Recall (R) = ( FP) ( FN) TNs FNs s Accuracy: very high Recall (R): very high of class, i.e. few FNs Precision: 100% No FPs Classification Costs Associate a cost with misclassification Class : with brain tumour Class : without brain tumour A FN error can be highly costly: a patient with brain tumour is missed and may die A FP error is not so serious: a healthy patient is sent to extra screening unnecessarily A classifier biased toward to the class is preferred

Cost Sensitive Learning How to build a classifier that takes into account the misclassification costs? FN errors cost more than FP errors C( ) > C( ) Cost Matrix PREDICTED C(i j) Class=Yes Class=No Class=Yes C(Yes Yes) C(No Yes) Class=No C(Yes No) C(No No) C(i j): Cost of misclassifying class j example as class i How to use the cost matrix? Costsensitive evaluation of classifiers (average cost of the model) Costsensitive classification Costsensitive learning

Costsensitive Evaluation Cost Matrix PREDICTED C(i j) 1 100 1 0 Model M 1 PREDICTED Model M 2 PREDICTED 150 40 60 250 250 45 5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255 Costsensitive Learning Cost information can be incorporated into classification Decision trees Costsensitive splitting criteria Costsensitive pruning General way to make any learning method costsensitive Generate data samples with different proportions of Yes () and No () instances Ex: instances are increased by a factor of 10 Oversample the Yes class Classifier skewed toward avoidance of errors on the Yes instances Less false negatives than false positives Read paper about MetaCost Available in WEKA

Cost Sensitive Learning Oversample the positive class FN errors cost more than FP errors C( ) > C( ) Class Imbalance Problem Find models that describe a rare class Why? Correct classification of the rare class may have greater value than correct classification of the majority class Problems: Classifiers tend to be biased to the majority class Highly specialized patterns susceptible to the presence of noise Accuracy (error rate) measure can be misleading Example: consider a 2class problem Number of class examples = 9990 Number of class examples = 10 If model predicts everything to be class, then accuracy is 9990/10000 = 99.9 % How can we tackle the problem?

Class Imbalance Problem Oversampling of the rare class () Problem: more sensitive to noise and overfitting Undersampling of the majority class Problem: useful cases of the class may not be chosen for training Costsensitive learning C( ) > C( ) (cost of FN higher than cost of FP) Use MetaCost algorithm Oversampling

Undersampling Weka: Sampling Preprocess Supervised Instance Resample NoReplacement = false biastouniform = 1 spreadsubsample distributionspread = 1 (undersamples the majority class) Undersamples the majority class and oversamples the rare class