ECLT 5810 Evaluation of Classification Quality

Similar documents
Machine Learning and Bioinformatics 機器學習與生物資訊學

How do we obtain reliable estimates of performance measures?

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Machine Learning Techniques for Data Mining

CS4491/CS 7265 BIG DATA ANALYTICS

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

List of Exercises: Data Mining 1 December 12th, 2015

Classification Part 4

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Evaluating Machine-Learning Methods. Goals for the lecture

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

10 Classification: Evaluation

Evaluating Classifiers

Noise-based Feature Perturbation as a Selection Method for Microarray Data

CS145: INTRODUCTION TO DATA MINING

Evaluating Classifiers

I211: Information infrastructure II

Machine Learning. Cross Validation

Evaluating Machine Learning Methods: Part 1

Weka ( )

Evaluating Classifiers

Classification. Instructor: Wei Ding

CS249: ADVANCED DATA MINING

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Mining di Dati Web. Lezione 3 - Clustering and Classification

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Data Mining and Knowledge Discovery: Practice Notes

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Machine Learning for. Artem Lind & Aleskandr Tkachenko

MEASURING CLASSIFIER PERFORMANCE

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Information Management course

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

CS 584 Data Mining. Classification 3

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery: Practice Notes

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

DATA MINING OVERFITTING AND EVALUATION

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Artificial Intelligence. Programming Styles

Data Mining Classification: Bayesian Decision Theory

K- Nearest Neighbors(KNN) And Predictive Accuracy

Classification with Decision Tree Induction

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Intrusion detection in computer networks through a hybrid approach of data mining and decision trees

Evaluating the Explanatory Value of Bayesian Network Structure Learning Algorithms

Random Forest A. Fornaser

Cross-validation for detecting and preventing overfitting

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Evaluation Strategies for Network Classification

Network Traffic Measurements and Analysis

Machine learning in fmri

DETECTING INDOOR SOUND EVENTS

Feature-weighted k-nearest Neighbor Classifier

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Model s Performance Measures

A Bagging Method using Decision Trees in the Role of Base Classifiers

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Logistic Model Tree With Modified AIC

CS570: Introduction to Data Mining

Classification Using Unstructured Rules and Ant Colony Optimization

CS 229 Midterm Review

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

INF 4300 Classification III Anne Solberg The agenda today:

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

2. On classification and related tasks

Short instructions on using Weka

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

ECLT 5810 Clustering

CISC 4631 Data Mining

B-kNN to Improve the Efficiency of knn

Machine Learning: Symbolische Ansätze

International Journal of Computer Techniques Volume 4 Issue 4, July August 2017

CSE 158. Web Mining and Recommender Systems. Midterm recap

Large Scale Data Analysis Using Deep Learning

Probabilistic Classifiers DWML, /27

A Novel Approach to Compute Confusion Matrix for Classification of n-class Attributes with Feature Selection

Data Mining Concepts & Techniques

Decision trees. For this lab, we will use the Carseats data set from the ISLR package. (install and) load the package with the data set

Speeding up Logistic Model Tree Induction

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Chapter 8 The C 4.5*stat algorithm

Enterprise Miner Tutorial Notes 2 1

Performance Optimization of Data Mining Application Using Radial Basis Function Classifier

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

Business Club. Decision Trees

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Transcription:

ECLT 5810 Evaluation of Classification Quality Reference: Data Mining Practical Machine Learning Tools and Techniques, by I. Witten, E. Frank, and M. Hall, Morgan Kaufmann

Testing and Error Error rate: proportion of errors made over the whole set of instances. Test set: set of independent instances that have played no part in formation of classifier Assumption: both training data and test data are representative samples of the underlying problem ECLT 5810 Classification Evaluation 2

Holdout estimation The holdout method reserves a certain amount for testing and uses the remainder for training Usually: one third for testing, the rest for training Problem: the samples might not be representative Example: class might be missing in the test data Advanced version uses stratification Ensures that each class is represented with approximately equal proportions in both subsets ECLT 5810 Classification Evaluation 3

Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratification) The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method Still not optimum: the different test sets overlap Can we prevent overlapping? ECLT 5810 Classification Evaluation 4

Cross-validation Cross-validation avoids overlapping test sets First step: data is split into k subsets of equal size Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before the crossvalidation is performed The error estimates are averaged to yield an overall error estimate ECLT 5810 Classification Evaluation 5

Cross-validation Split the available data set into k equal partitions, namely, P 1, P k Training set Testing set Accuracy P 2,,P k P 1 A 1 P 1,P 3,, P k P 2 A 2 : : P 1,P 2,, P k-1 P k A k Average Accuracy A ECLT 5810 Classification Evaluation 6

More on cross-validation Standard method for evaluation: stratified ten-fold crossvalidation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate There is also some theoretical evidence for this Stratification reduces the estimate s variance Even better: repeated stratified cross-validation e.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) ECLT 5810 Classification Evaluation 7

Binary Classification For each testing instances, there are only four possible situations: predicted: yes, actual: yes predicted: yes, actual: no predicted: no, actual: yes predicted: no, actual: no The contingency table records the total number of testing instances for each situation Predicated Class YES NO Actual Class YES True Positive False Negative NO False Positive True Negative ECLT 5810 Classification Evaluation 8

Binary Classification Predicated Class YES NO Actual Class YES True Positive (TP) False Negative (FN) NO False Positive (FP) True Negative (TN) Error rate = Accuracy rate = 1 Error rate ECLT 5810 Classification Evaluation 9

A Marketing Application Scenario In a direct mailing business, a mass mailout of a promotional offer to a million households (1,000,000). Let the response rate is 0.1% (i.e., 1,000 respondents). Suppose a random selection of a subset of 100,000 households for mailing. The number of respondent is 100. Suppose a data mining method is used and the response rate is 0.4% (400 respondents) ECLT 5810 Classification Evaluation 10

Undesirable Effect of Accuracy Random Prediction Predicated Class Actual Class YES NO total YES 100 900 1,000 NO 99,900 899,100 999,000 total 100,000 900,000 1,000,000 Accuracy = 0.8992 (Error = 0.1008) A Data Mining Method Predicated Class Actual Class YES NO total YES 400 600 1,000 NO 99,600 899,400 999,000 total 100,000 900,000 1,000,000 Accuracy = 0.8998 (Error = 0.1002) ECLT 5810 Classification Evaluation 11

Lift Factor The random response rate is 0.1% (i.e., 100 respondents). The response rate of a certain data mining method is 0.4% (400 respondents) The increase in response factor, is known as the lift factor (a factor of 4 in this case). ECLT 5810 Classification Evaluation 12