Classification. Instructor: Wei Ding

Similar documents
Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Data Mining Classification: Bayesian Decision Theory

10 Classification: Evaluation

Classification Part 4

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

CS 584 Data Mining. Classification 3

CISC 4631 Data Mining

Lecture Notes for Chapter 4 Part III. Introduction to Data Mining

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

DATA MINING OVERFITTING AND EVALUATION

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Lecture Notes for Chapter 4

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

CS4491/CS 7265 BIG DATA ANALYTICS

Model s Performance Measures

Lecture outline. Decision-tree classification

CSE 5243 INTRO. TO DATA MINING

Classification Salvatore Orlando

Evaluating Classifiers

Evaluating Classifiers

CS570: Introduction to Data Mining

Data Mining and Knowledge Discovery Practice notes 2

Evaluating Machine Learning Methods: Part 1

Data Mining and Knowledge Discovery: Practice Notes

Evaluating Machine-Learning Methods. Goals for the lecture

Machine Learning Techniques for Data Mining

CS Machine Learning

CS145: INTRODUCTION TO DATA MINING

Data Mining D E C I S I O N T R E E. Matteo Golfarelli

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

Machine Learning and Bioinformatics 機器學習與生物資訊學

Information Management course

CS249: ADVANCED DATA MINING

Network Traffic Measurements and Analysis

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Statistics 202: Statistical Aspects of Data Mining

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

List of Exercises: Data Mining 1 December 12th, 2015

ECLT 5810 Evaluation of Classification Quality

Data Mining Concepts & Techniques

Data Mining and Knowledge Discovery: Practice Notes

Weka ( )

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Probabilistic Classifiers DWML, /27

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Chapter 3: Supervised Learning

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Stages of (Batch) Machine Learning

Artificial Intelligence. Programming Styles

Data Mining: Model Evaluation

Classification. Instructor: Wei Ding

MEASURING CLASSIFIER PERFORMANCE

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

7. Decision or classification trees

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Part II: A broader view

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Introduction to Machine Learning

Classification Algorithms in Data Mining

2. On classification and related tasks

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Data Mining Classification - Part 1 -

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Classification and Regression

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

Nearest neighbor classification DSE 220

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Pattern recognition (4)

Part I. Instructor: Wei Ding

Evaluating Classifiers

Sample 1. Dataset Distribution F Sample 2. Real world Distribution F. Sample k

Decision Tree CE-717 : Machine Learning Sharif University of Technology

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

How do we obtain reliable estimates of performance measures?

Classification. Slide sources:

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Classification with Decision Tree Induction

Lecture 5: Decision Trees (Part II)

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

8. Tree-based approaches

โครงการอบรมเช งปฏ บ ต การ น กว ทยาการข อม ลภาคร ฐ (Government Data Scientist)

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

Classification and Regression Trees

CSC411/2515 Tutorial: K-NN and Decision Tree

COMP 465: Data Mining Classification Basics

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Classification with PAM and Random Forest

Extra readings beyond the lecture slides are important:

INF 4300 Classification III Anne Solberg The agenda today:

Transcription:

Classification Part II Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1 Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004

Underfitting and Overfitting (Example) 500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x 1 +x ) 1 Triangular points: sqrt(x 1 +x ) > 0.5 or sqrt(x 1 +x ) < 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 3 Underfitting and Overfitting Overfitting Overfitting: once the tree becomes too large, its test error rate begins to increase even through its training error rate continues to decrease. Underfitting: when model is too simple, both training i and test t errors are large Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 4

Understand the overfitting phenomenon Training error of a model can be reduced by increasing the model complexity. The leaf nodes of the tree can be expanded unitl it perfectly fits the training data The training error for such a complex tree could be zero, but the test error can be large because the tree may contain nodes that accidently fit some of the noise points in the training data. Some noise nodes can degrade the performance of the tree because they do not generalize well to the test example. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 5 Overfitting due to Noise Decision boundary is distorted by noise point Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 6

Overfitting due to lack of representative samples Problem: a small number of training records lack of representative samples in the training data Learning algorithms continue to refine their models even when few training records are available Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 7 Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels l of that t region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 8

Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 9 Overfitting and the Multiple Comparison Procedure Model overfitting is related with a methodology known as multiple comparison procedure. Prediction is correct in the next ten trading days at least 8 out of 10 times for one stock analyst: Very low At least one of 50 stock analyst whose prediction is correct in the next ten trading days at least 8 out of 10 times for one stock analyst: Very high Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 10

Overfitting and the Multiple Comparison Procedure During decision tree induction, attribute x can be added to the tree if the observed gain, (T 0, T x ), I greater than some predefined threshold α. In practice, more than one test condition is available and the decision tree algorithm must choose the best attribute xmax from a set of candidates,,{x 1, x,, x k k), to partition the data. In this situation, the algorithm is actually using a multiple comparison procedure to decide whether a decision tree should be extended. Unless the gain function or threshold α is modified to account for k, the algorithm may inadvertently add spurious nodes to the model, which leads to model overfitting. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 11 Estimating Generalization Errors Re-substitution errors: error on training (Σ e(t) ) Generalization errors: error on testing (Σ e (t)) e(t)) Methods for estimating generalization errors: Optimistic approach: e (t) e(t) = e(t) Pessimistic approach: As low as possible For each leaf node: e (t) = (e(t)+0.5) Total errors: e (T) = e(t) + N 0.5 (N: number of leaf nodes) For a tree with 30 leaf nodes and 10 errors on training (out of 1000 instances): Training error = 10/1000 = 1% Generalization error = (10 + 30 0.5)/1000 =.5% Reduced error pruning (REP): uses validation data set to estimate generalization error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

Occam s Razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model For complex pe models, odes, there e is a greater chance ce that it was fitted accidentally by errors in data Therefore, one should include model complexity when evaluating a model Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 13 How to Address Overfitting Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive ti conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using χ test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 14

How to Address Overfitting Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 15 Example of Post-Pruning Class = Yes 0 Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Class = No 10 Training Error (After splitting) = 9/30 Error = 10/30 Pessimistic st error (After splitting) = (9 + 4 0.5)/30 = 11/30 A? PRUNE! A1 A A3 A4 Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5 Class = No 4 Class = No 4 Class = No 1 Class = No 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 16

Examples of Post-pruning Optimistic error? Case 1: Don t prune for both cases Pessimistic error? C0: 11 C0: C1: 3 C1: 4 Don t prune case 1, prune case Reduced error pruning? Case : Depends on validation set C0: 14 C1: 3 C0: C1: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 17 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 18

Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 19 Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS Class=Yes Class=No Class=Yes a b a: TP (true positive) b: FN (false negative) ACTUAL c: FP (false positive) d: TN (true negative) CLASS Class=No c d Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 0

Metrics for Performance Evaluation PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes a (TP) c (FP) b (FN) d (TN) Class=No c d Most widely-used metric: Accuracy = a + d TP + TN = a + b + c + d TP + TN + FP + FN Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1 Limitation of Accuracy Consider a -class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything er to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004

Cost Matrix ACTUAL PREDICTED CLASS C(i j) Class=Yes Class=No Class=Yes C(Yes Yes) C(No Yes) CLASS Class=No C(Yes No) C(No No) C(i j): Cost of misclassifying class j example as class i Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 3 Computing Cost of Classification Cost Matrix ACTUAL CLASS PREDICTED CLASS C(i j) + - + -1 100-1 0 Model PREDICTED CLASS Model PREDICTED CLASS M 1 M ACTUAL CLASS + - + 150 40-60 50 ACTUAL CLASS + - + 50 45-5 00 Accuracy = 80% Accuracy = 90% Cost = 3910 Cost = 455 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 4

Cost vs Accuracy Count PREDICTED CLASS Class=Yes Class=No Accuracy is proportional to cost if 1. C(Yes No)=C(No Yes) ( ) = q. C(Yes Yes)=C(No No) = p Class=Yes a b ACTUAL CLASS Class=No c d N = a + b + c + d Cost ACTUAL CLASS PREDICTED CLASS Class=Yes Class=No Class=Yes p q Class=No q p Accuracy = (a + d)/n Cost=p(a+d)+q(b+c) + (b + = p (a + d) + q (N a d) = qn (q p)(a + d) = N [q (q-p) Accuracy] Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 5 Cost-Sensitive Measures a Precision (p) = a + c a Recall (r) = a + b rp a F - measure (F) = = r + p a + b + c Precision is biased towards C(Yes Yes) & C(Yes No) Recall is biased towards C(Yes Yes) & C(No Yes) F-measure is biased towards all except C(No No) Weighted Accuracy = w a + w d 1 4 a + w b + w c + w d w1 3 4 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 6

Interpretation of Precision and Recall PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Precision determines the fraction of records that actually turns out to be positive in the group the classifier has declared as a positive class. The higher h the precision i is, the lower the number of false positive errors committed by the classifier. Recall measures the fraction of positive examples correctly predicted by the classifier. Classifiers with large recall have very few positive examples misclassified as the negative class. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 7 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 8

Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 9 Learning Curve Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: Arithmetic sampling (Langley, et al) Geometric sampling (Provost et al) Effect of small sample size: - Bias in the estimate - Variance of estimate Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 30

Methods of Estimation Holdout Reserve e /3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Stratified sampling oversampling vs undersampling Bootstrap Sampling with replacement Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 31 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 3

ROC (Receiver Operating Characteristic) Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP rate (on the y-axis) against FP rate (on the x-axis) Performance of each classifier represented as a point on the ROC curve changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 33 ROC Curve - 1-dimensional data set containing classes (positive and negative) - any points located at x > t is classified as positive At threshold t: TP=0.5, FN=0.5, FP=0.1, TN=0.88 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 34

ROC Curve (TP-Rate, FP-Rate): (0,0): declare everything to be negative class (1,1): 1) declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of thetrueclass true Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 35 Using ROC for Model Comparison No model consistently outperform the other M 1 is better for small FPR M is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 36

How to Construct an ROC curve Instance P(+ A) True Class 1 0.95 + 0.93 + 3 0.87-4 0.85-5 085 0.85-6 0.85 + 7 0.76-8 0.53 + 9 0.43 - Use classifier that produces posterior probability for each test instance P(+ A) Sort the instances according to P(+ A) in decreasing order Apply threshold h at each unique value of P(+ A) Count the number of TP, FP, TN, FN at each threshold 10 0.5 + TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 37 How to construct an ROC curve Threshold >= Class + - + - - - + - + + 0.5 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 1 0 FP 5 5 4 4 3 1 1 0 0 0 TN 0 0 1 1 3 4 4 5 5 5 FN 0 1 1 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0. 0 FPR 1 1 0.8 0.8 0.6 0.4 0. 0. 0 0 0 ROC Curve: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 38

Test of Significance Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M? How much confidence can we place on accuracy of M1 and M? g Can the difference in performance measure be explained as a result of random fluctuations in the test set? Test the statistical significance of the observed deviation Estimate the confidence interval of a given model accuracy Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 39 Confidence Interval for Accuracy Prediction can be regarded as a Bernoulli trial A Bernoulli trial has possible outcomes Possible outcomes for prediction: correct or wrong Collection of Bernoulli trials has a Binomial distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = N p = 50 0.5 = 5 Given x (# of correct predictions) or equivalently, acc=x/n, and N (# of test instances), Can we predict p (true accuracy of model)? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 40

Confidence Interval for Accuracy For large test sets (N > 30), acc has a normal distribution with mean p and variance p(1-p)/np)/n Area = 1 - α acc ( < p P Z < Z ) α / 1 α (1 ) / / p p N =11 α Z α/ Z 1- α / Confidence interval for acc Z α/ = Z 1- α / Confidence Interval for p: N acc + Z ± Z + 4 N acc 4 N acc α / α / p = ( N + Z α / ) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 41 Confidence Interval for Accuracy Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1-α = 0.95 (95% confidence) From probability table, Z α/=1.96 N 50 100 500 1000 5000 p(lower) 0.670 0.711 0.763 0.774 0.789 p(upper) 0.888 0.866 0.833 0.84 0.811 1-α Z 0.99.58 098 0.98 33.33 0.95 1.96 0.90 1.65 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 4

Comparing Performance of Models Given two models, say M1 and M, which is better? M1 is tested on D1 (size=n1), found error rate = e 1 M is tested on D (size=n), found error rate = e Assume D1 and D are independent If n1 and n are sufficiently large, then Approximate: e1 ~ N ( μ1, σ1 ) e ~ N ( μ, σ ) ˆ σ = i e (1 e ) i i n n i Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 43 Comparing Performance of Models To test if performance difference is statistically significant: d = e1 e d ~ N(d t,σ t ) where d t is the true difference Since D1 and D are independent, their variance adds up: σ = σ + σ ˆ σ ˆ σ 1 1 e 1(1 e 1) e (1 e ) = + n1 n + t At (1-α) confidence level, d = d ± Z t α / σˆ t Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 44

An Illustrative Example Given: M1: n1 = 30, e1 = 0.15 M:n=5000 5000, e=05 0.5 d = e e1 = 0.1 (-sided test) 0.15(1 0.15) 0.5(1 0.5) ˆ = + = 30 5000 σ d, α/ At 95% confidence level, Z α/ =1.96 0.0043 d t = 0.100 ± 1.96 0.0043 = 0.100 ± 0.18 => Interval contains 0 => difference may not be statistically significant Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 45 Comparing Performance of Algorithms Each learning algorithm may produce k models: L1 may produce M11, M1,, M1k L may produce M1, M,, Mk If models are generated on the same test sets D1,D,,, Dk (e.g., via cross-validation) For each set: compute d j = e 1j e j d j has mean d t and variance σ t k Estimate: j 1 σ = ˆ d t t ( d j d) = k( k = d ± t 1) 1 α, k 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 46 ˆ σ t