Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Similar documents
I211: Information infrastructure II

ECLT 5810 Evaluation of Classification Quality

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Evaluating Classifiers

Evaluating Classifiers

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Mining: STATISTICA

CS249: ADVANCED DATA MINING

Knowledge Discovery and Data Mining

CS145: INTRODUCTION TO DATA MINING

Function Algorithms: Linear Regression, Logistic Regression

CS4491/CS 7265 BIG DATA ANALYTICS

Evaluating Classifiers

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Data Mining and Knowledge Discovery: Practice Notes

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

MS&E 226: Small Data

Lecture 25: Review I

Algorithms: Decision Trees

Tutorials Case studies

Seminars of Software and Services for the Information Society

List of Exercises: Data Mining 1 December 12th, 2015

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery: Practice Notes

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Introduction to Automated Text Analysis. bit.ly/poir599

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Predict the box office of US movies

K- Nearest Neighbors(KNN) And Predictive Accuracy

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

Weka ( )

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Information Management course

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Non-trivial extraction of implicit, previously unknown and potentially useful information from data

Business Club. Decision Trees

S2 Text. Instructions to replicate classification results.

Logarithmic Time Prediction

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Lecture 03 Linear classification methods II

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

WEKA homepage.

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Evaluating Machine-Learning Methods. Goals for the lecture

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Pattern Recognition for Neuroimaging Data

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Development of Prediction Model for Linked Data based on the Decision Tree for Track A, Task A1

Machine Learning and Bioinformatics 機器學習與生物資訊學

MEASURING CLASSIFIER PERFORMANCE

A Systematic Overview of Data Mining Algorithms

Classification Part 4

Information Retrieval and Web Search Engines

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Performance Evaluation

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Creating a Classifier for a Focused Web Crawler

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Supervised Learning Classification Algorithms Comparison

Finding Tiny Faces Supplementary Materials

Classification and Regression Trees

1 Machine Learning System Design

Information Retrieval and Web Search Engines

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Machine Learning. Cross Validation

University of Florida CISE department Gator Engineering. Clustering Part 2

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Artificial Intelligence. Programming Styles

A Constrained-Syntax Genetic Programming System for Discovering Classification Rules: Application to Medical Data Sets

Lecture 9: Support Vector Machines

Model s Performance Measures

Data Clustering. Danushka Bollegala

On The Value of Leave-One-Out Cross-Validation Bounds

Data Mining Concepts & Techniques

Jarek Szlichta

Machine Learning: Algorithms and Applications Mockup Examination

Generating the Reduced Set by Systematic Sampling

Stat 342 Exam 3 Fall 2014

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Randomized Response Technique in Data Mining

Pattern recognition (4)

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Performance Evaluation of Various Classification Algorithms

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Network Traffic Measurements and Analysis

Distress Image Library for Precision and Bias of Fully Automated Pavement Cracking Survey

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Section A. 1. a) Explain the evolution of information systems into today s complex information ecosystems and its consequences.

Machine Learning Part 1

2. On classification and related tasks

Clustering and Visualisation of Data

Stat 602X Exam 2 Spring 2011

Credit card Fraud Detection using Predictive Modeling: a Review

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Random Forest A. Fornaser

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Transcription:

Data Mining: Classifier Evaluation CSCI-B490 Seminar in Computer Science (Data Mining)

Predictor Evaluation 1. Question: how good is our algorithm? how will we estimate its performance? 2. Question: what is the performance measure we should use? 1. Classification: we will use accuracy 2. Regression: we will use mean square error and R 2 measure

4-Fold Cross-Validation D 20 data points

4-Fold Cross-Validation D 20 data points Randomly and evenly split into 4 non-overlapping partitions

How to make a random split? D 5 data points Use random number generator!!! >> x = rand(1, 5); >> [a b] = sort(x) a = 0.1576 0.4854 0.8003 0.9572 0.9706 b = 1 4 5 3 2 >> p1 = b(1 : 2 : length(b)) p1 = 1 5 2 >> p2 = b(2 : 2 : length(b)) p2 = 4 3 partition 1 partition 2

4-Fold Cross-Validation D 20 data points Randomly and evenly split into 4 non-overlapping partitions Partition 1. Data points: 1, 3, 5, 15, 16 Partition 2. Data points: 6, 10,, 14, 17 Partition 3. Data points: 4, 9, 12, 19, 20 Partition 4. Data points: 2, 7, 8, 13, 17

4-Fold Cross-Validation Step 1: Use partition 1 as test and partitions 2, 3 and 4 as training Test Training Pretend class (target) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation Step 2: Use partition 2 as test and partitions 1, 3 and 4 as training Test Training Pretend class (target) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation Step 3: Use partition 3 as test and partitions 1, 2 and 4 as training Test Training Pretend class (target) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation Step 4: Use partition 4 as test and partitions 1, 2 and 3 as training Test Training Pretend class (target) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation True class: 1 Predicted class: 1 True class: 0 Predicted class: 0... True class: 1 Predicted class: 0 For each of 20 data points there is true and predicted class

Confusion Matrix (Binary Class) True class Predicted class 0 1 0 1 00 01 10 Accuracy 00 00 10 01 umber of data points whose true class was 0 but predicted class was 1. Error 1 Accuracy

Classification Accuracy and Error Accuracy correct correct : number of correctly classified data points : total number of data points Error 1 correct Another naming convention: 00 = number of true negatives 01 = number of false negatives 10 = number of false positives = number of true positives

More on Accuracy Binary Class Case sn 01 Sensitivity or accuracy on data points whose class is 1. Also called true positive rate. sp 10 00 00 Specificity or accuracy on data points whose class is 0. Also called true negative rate. 1 sn 1 01 False negative rate. 1 sp 1 10 00 00 False positive rate. Accuracy B sn 2 sp Balanced-sample accuracy

More on Accuracy Binary Class Case rc 01 Recall is accuracy on data points whose class is 1. Same as sensitivity or true positive rate. pr 10 Precision is accuracy on data points that were predicted as 1. Also called positive predictive value. 1 rc 1 01 False negative rate. 1 pr 1 10 False discovery rate. Important: it is very different from the false positive rate!!! F pr rc (1 2 ) 2 pr rc F-measure.

Trivial and Random Classifiers 16 data points have class 0 (majority class) 4 data points have class 1 (minority class) 0 Trivial classifier: always predict majority class Accuracy of a trivial classifier is: 16/20 = 80% 1 Random classifier: predict class 0 with probability 0.8 and class 1 with probability 0.2 Accuracy of the random classifier: 68% (0.8 2 + 0.2 2 = 0.68)

-Fold Cross-Validation Is there a chance that our accuracy estimate depends on the random partition we did at the beginning? Yes! One way to overcome this is to generate several random splits into folds, estimate accuracy for each fold and finally average this number.

Leave-One-Out Performance Estimation Test Step 1 Step 2 Step D... Leave-one-out is an extreme case of cross-validation when is equal to the number of data points in data set D.

In Short... 1. Whatever you use for -fold cross-validation, every training data point will be used exactly once as test. 2. Each data points will thus have one true value and one predicted value. 3. From these two values we compute accuracy. 4. Terminology: we always say that we are estimating classifier s performance. If the data set is large enough and representative this will be a good estimate. The truth is, we can only hope that our test cases will be representative and that our estimate is meaningful.

Error Bars on Accuracy Estimates For example, data set contains 20 data points and the confusion matrix looks like True class Predicted class 0 1 0 1 10 1 4 5 Accuracy 15 20 0.75 Accuracy (1 n Accuracy) 0.750.25 20 0.10

How to Report Accuracy Estimates For example, the estimated accuracy is 0.75 Let s see what the estimate is for = 10, 100, and 1000. 10 0.750.25 10 0.14 100 0.750.25 100 0.04 For = 100 we will report accuracy as: Accuracy = 0.75 0.04 1000 0.750.25 1000 0.01