Lecture Notes for Chapter 4 Part III. Introduction to Data Mining

Similar documents
Classification. Instructor: Wei Ding

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

10 Classification: Evaluation

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Data Mining Classification: Bayesian Decision Theory

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Classification Part 4

CISC 4631 Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

DATA MINING OVERFITTING AND EVALUATION

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Lecture Notes for Chapter 4

CS 584 Data Mining. Classification 3

Nearest Neighbor Learning

A Petrel Plugin for Surface Modeling

Lecture outline. Decision-tree classification

Mobile App Recommendation: Maximize the Total App Downloads

CS Machine Learning

Classification Salvatore Orlando

Language Identification for Texts Written in Transliteration

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

CSE 5243 INTRO. TO DATA MINING

Binarized support vector machines

As Michi Henning and Steve Vinoski showed 1, calling a remote

Resource Optimization to Provision a Virtual Private Network Using the Hose Model

Sensitivity Analysis of Hopfield Neural Network in Classifying Natural RGB Color Space

A Method for Calculating Term Similarity on Large Document Collections

CS4491/CS 7265 BIG DATA ANALYTICS

Evaluating Classifiers

A Memory Grouping Method for Sharing Memory BIST Logic

A NEW APPROACH FOR BLOCK BASED STEGANALYSIS USING A MULTI-CLASSIFIER

ACTIVE LEARNING ON WEIGHTED GRAPHS USING ADAPTIVE AND NON-ADAPTIVE APPROACHES. Eyal En Gad, Akshay Gadde, A. Salman Avestimehr and Antonio Ortega

Automatic Grouping for Social Networks CS229 Project Report

Evaluating Classifiers

Application of Automated Fault Detection and Diagnostics For Rooftop Air Conditioners in California

Using data flow analysis for the reliability assessment of safety-critical software systems

Distance Weighted Discrimination and Second Order Cone Programming

Data Mining and Knowledge Discovery Practice notes 2

FREE-FORM ANISOTROPY: A NEW METHOD FOR CRACK DETECTION ON PAVEMENT SURFACE IMAGES

Machine Learning Techniques for Data Mining

Outline. Introduce yourself!! What is Machine Learning? What is CAP-5610 about? Class information and logistics

A Novel Method for Early Software Quality Prediction Based on Support Vector Machine

Arithmetic Coding. Prof. Ja-Ling Wu. Department of Computer Science and Information Engineering National Taiwan University

Data Mining and Knowledge Discovery: Practice Notes

A probabilistic fuzzy method for emitter identification based on genetic algorithm

Model s Performance Measures

Transformation Invariance in Pattern Recognition: Tangent Distance and Propagation

Collaborative Approach to Mitigating ARP Poisoning-based Man-in-the-Middle Attacks

Optimization and Application of Support Vector Machine Based on SVM Algorithm Parameters

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Image Segmentation Using Semi-Supervised k-means

Data Mining and Knowledge Discovery: Practice Notes

Further Concepts in Geometry

Quality of Service Evaluations of Multicast Streaming Protocols *

Neural Network Enhancement of the Los Alamos Force Deployment Estimator

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Why Learn to Program?

A New Supervised Clustering Algorithm Based on Min-Max Modular Network with Gaussian-Zero-Crossing Functions

WATERMARKING GIS DATA FOR DIGITAL MAP COPYRIGHT PROTECTION

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

AN EVOLUTIONARY APPROACH TO OPTIMIZATION OF A LAYOUT CHART

Proceedings of the International Conference on Systolic Arrays, San Diego, California, U.S.A., May 25-27, 1988 AN EFFICIENT ASYNCHRONOUS MULTIPLIER!

Searching, Sorting & Analysis

On Upper Bounds for Assortment Optimization under the Mixture of Multinomial Logit Models

An Introduction to Design Patterns

Chapter Multidimensional Direct Search Method

Space-Time Trade-offs.

MACHINE learning techniques can, automatically,

A HYBRID FEATURE SELECTION METHOD BASED ON FISHER SCORE AND GENETIC ALGORITHM

Model-driven Collaboration and Information Integration for Enhancing Video Semantic Concept Detection

OF SCIENTIFIC DATABASES

Lecture outline Graphics and Interaction Scan Converting Polygons and Lines. Inside or outside a polygon? Scan conversion.

No connection establishment Do not perform Flow control Error control Retransmission Suitable for small request/response scenario E.g.

Self-Control Cyclic Access with Time Division - A MAC Proposal for The HFC System

M. Badent 1, E. Di Giacomo 2, G. Liotta 2

Evaluating Machine Learning Methods: Part 1

Filtering. Yao Wang Polytechnic University, Brooklyn, NY 11201

Hiding secrete data in compressed images using histogram analysis

Path-Based Protection for Surviving Double-Link Failures in Mesh-Restorable Optical Networks

Neural Networks. Aarti Singh. Machine Learning Nov 3, Slides Courtesy: Tom Mitchell

Statistics 202: Statistical Aspects of Data Mining

On-Chip CNN Accelerator for Image Super-Resolution

Identifying and Tracking Pedestrians Based on Sensor Fusion and Motion Stability Predictions

Automatic Hidden Web Database Classification

Backing-up Fuzzy Control of a Truck-trailer Equipped with a Kingpin Sliding Mechanism

Neural Networks. Aarti Singh & Barnabas Poczos. Machine Learning / Apr 24, Slides Courtesy: Tom Mitchell

file://j:\macmillancomputerpublishing\chapters\in073.html 3/22/01

Special Edition Using Microsoft Excel Selecting and Naming Cells and Ranges

Utility-based Camera Assignment in a Video Network: A Game Theoretic Framework

Fastest-Path Computation

Ad Hoc Networks 11 (2013) Contents lists available at SciVerse ScienceDirect. Ad Hoc Networks

Open Access CS-1-SVM: Improved One-class SVM for Detecting API Abuse on Open Network Service

A METHOD FOR GRIDLESS ROUTING OF PRINTED CIRCUIT BOARDS. A. C. Finch, K. J. Mackenzie, G. J. Balsdon, G. Symonds

Fuzzy Equivalence Relation Based Clustering and Its Use to Restructuring Websites Hyperlinks and Web Pages

AUTOMATIC gender classification based on facial images

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Probabilistic Classifiers DWML, /27

CS570: Introduction to Data Mining

Transcription:

Data Mining Cassification: Basic Concepts, Decision Trees, and Mode Evauation Lecture Notes for Chapter 4 Part III Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Practica Issues of Cassification Underfitting and Overfitting Missing Vaues Costs of Cassification Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Underfitting and Overfitting (Exampe) 500 circuar and 500 trianguar data points. Circuar points: 0.5 sqrt(x 12 +x 22 ) 1 Trianguar points: sqrt(x 12 +x 22 ) > 0.5 or sqrt(x 12 +x 22 ) < 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Underfitting and Overfitting Overfitting Underfitting: when mode is too simpe, both training and test errors are arge Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Overfitting due to Noise Decision boundary is distorted by noise point Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Notes on Overfitting Overfitting resuts in decision trees that are more compex than necessary Training error no onger provides a good estimate of how we the tree wi perform on previousy unseen records Need new ways for estimating errors Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Estimating Generaization Errors Re-substitution errors: error on training (Σ e(t) ) Generaization errors: error on testing (Σ e (t)) Methods for estimating generaization errors: Optimistic approach: e (t) = e(t) Pessimistic approach: For each eaf node: e (t) = (e(t)+0.5) Tota error counts: e (T) = e(t) + N 0.5 (N: number of eaf nodes) For a tree with 30 eaf nodes and 10 errors on training (out of 1000 instances): Training error = 10/1000 = 1% Generaization error = (10 + 30 0.5)/1000 = 2.5% Reduced error pruning (REP): uses vaidation data set to estimate generaization error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Occam s Razor Given two modes of simiar generaization errors, one shoud prefer the simper mode over the more compex mode For compex modes, there is a greater chance that it was fitted accidentay by errors in data Therefore, one shoud incude mode compexity when evauating a mode Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Minimum Description Length (MDL) X y X 1 1 X 2 0 X 3 0 X 4 1 X n 1 A Yes 0 A? No B? B 1 B 2 C? 1 C 1 C 2 0 1 B X y X 1? X 2? X 3? X 4? X n? Cost(Mode,Data) = Cost(Data Mode) + Cost(Mode) Cost is the number of bits needed for encoding. We shoud search for the east costy mode. Cost(Data Mode) encodes the errors on training data. Cost(Mode) estimates mode compexity, or future error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

How to Address Overfitting in Decision Trees Pre-Pruning (Eary Stopping Rue) Stop the agorithm before it becomes a fuy-grown tree Typica stopping conditions for a node: Stop if a instances beong to the same cass Stop if a the attribute vaues are the same More restrictive conditions: Stop if number of instances is ess than some user-specified threshod Stop if cass distribution of instances are independent of the avaiabe features (e.g., using χ 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

How to Address Overfitting Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generaization error improves after trimming, repace sub-tree by a eaf node. Heuristic: Cass abe of eaf node is determined from majority cass of instances in the sub-tree generaization error count = error count + 0.5*N, where N is the number of eaf nodes, This is a heuristic used in some agorithms, but there are other ways using statistics Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Post-Pruning based on eaves Training Error (Before spitting) = 10/30 Cass = Yes 20 Cass = No 10 Error = 10/30 A? Pessimistic error (Before spitting) = (10 + 1X 0.5)/30 = 10.5/30 Training Error (After spitting) = 9/30 Pessimistic error (After spitting) = (9 + 4 0.5)/30 = 11/30 Post-pruning decision: PRUNE! A1 A2 A3 A4 Cass = Yes 8 Cass = Yes 3 Cass = Yes 4 Cass = Yes 5 Cass = No 4 Cass = No 4 Cass = No 1 Cass = No 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Exampes of Post-pruning Optimistic error? Case 1: Don t prune for both cases Pessimistic error? C0: 11 C1: 3 C0: 2 C1: 4 Don t prune case 1, prune case 2 Case 2: C0: 14 C1: 3 C0: 2 C1: 2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Data Fragmentation Number of instances gets smaer as you traverse down the tree Number of instances at the eaf nodes coud be too sma to make any statisticay significant decision Soution: imit number of instances per eaf node >= a user given vaue n. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Decision Trees: Feature Construction x + y < 1 Cass = + Cass = Test condition may invove mutipe attributes, but hard to automate! Finding better node test features is a difficut research issue Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Mode Evauation Metrics for Performance Evauation How to evauate the performance of a mode? Methods for Performance Evauation How to obtain reiabe estimates? Methods for Mode Comparison How to compare the reative performance among competing modes? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Mode Evauation Metrics for Performance Evauation How to evauate the performance of a mode? Methods for Performance Evauation How to obtain reiabe estimates? Methods for Mode Comparison How to compare the reative performance among competing modes? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

Metrics for Performance Evauation Focus on the predictive capabiity of a mode Rather than how fast it takes to cassify or buid modes, scaabiity, etc. Confusion Matrix: count or percentage PREDICTED CLASS Cass=Yes Cass=No Cass=Yes a b ACTUAL CLASS Cass=No c d a: TP (true positive) b: FN (fase negative) c: FP (fase positive) d: TN (true negative) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Metrics for Performance Evauation PREDICTED CLASS Cass=Yes Cass=No ACTUAL Cass=Yes a (TP) CLASS Cass=No c (FP) b (FN) d (TN) Most widey-used metric: Accuracy = a a + b + + d c + d = TP TP + TN + TN + FP + FN Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Limitation of Accuracy Consider a 2-cass probem Number of Cass 0 exampes = 9990 Number of Cass 1 exampes = 10 If mode predicts everything to be cass 0, accuracy is 9990/10000 = 99.9 % Accuracy is miseading because mode does not detect any cass 1 exampe Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Cost Matrix PREDICTED CLASS C(i j) Cass=Yes Cass=No Cass=Yes C(Yes Yes) C(No Yes) ACTUAL CLASS Cass=No C(Yes No) C(No No) C(i j): Cost of miscassifying cass j exampe as cass I - medica diagnosis, customer segmentation Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

Computing Cost of Cassification Confusion matrix Cost Matrix ACTUAL CLASS PREDICTED CLASS C(i j) + - + -1 100-1 0 Mode M 1 PREDICTED CLASS Mode M 2 PREDICTED CLASS ACTUAL CLASS + - + 150 40-60 250 ACTUAL CLASS + - + 250 45-5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Information Retrieva Measures PREDICTED CLASS a Precision : p = a + c a Reca: r = a + b ACTUAL CLASS Cass=Yes Cass=No Cass=Yes a b Cass=No c d F - measure (F) = 2rp r + p = 2a 2a + b + c Let C be cost (can be count in our exampe) Precision is biased towards C(Yes Yes) & C(Yes No) Reca is biased towards C(Yes Yes) & C(No Yes) F-measure is biased towards a except C(No No) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

Mode Evauation Metrics for Performance Evauation How to evauate the performance of a mode? Methods for Performance Evauation How to obtain reiabe estimates? Methods for Mode Comparison How to compare the reative performance among competing modes? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24

Methods of Estimation Hodout Reserve 2/3 for training and 1/3 for testing Cross vaidation Partition data into k disjoint subsets k-fod: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25

Test of Significance (Sections 4.5,4.6 of TSK Book) Given two modes: Mode M1: accuracy = 85%, tested on 30 instances Mode M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we pace on accuracy of M1 and M2? Can the difference in performance measure be expained as a resut of random fuctuations in the test set? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26

Confidence Interva for Accuracy Prediction can be regarded as a Bernoui tria A Bernoui tria has 2 possibe outcomes Possibe outcomes for prediction: correct or wrong Coection of Bernoui trias has a Binomia distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads woud turn up? Expected number of heads = N p = 50 0.5 = 25 Given x (# of correct predictions) or equivaenty, acc=x/n, and N =# of test instances, Can we predict p (true accuracy of mode)? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27

Confidence Interva for Accuracy P For arge N, et 1 α be confidence acc has a norma distribution with mean p and variance p(1-p)/n ( Z < < Z α / 2 1 α / 2 = 1 α acc p p(1 p) / N Confidence Interva for p: ) Area = 1 - α Z α/2 Z 1- α /2 p = 2 N acc + Z 2 α / 2 ± Z 2 α / 2 2( N + 4 N + Z 2 α / 2 ) acc 4 N acc 2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

Confidence Interva for Accuracy Consider a mode that produces an accuracy of 80% when evauated on 100 test instances: N=100, acc = 0.8 Let 1-α = 0.95 (95% confidence) From probabiity tabe, Z α/2 =1.96 1-α Z 0.99 2.58 0.98 2.33 N 50 100 500 1000 5000 p(ower) 0.670 0.711 0.763 0.774 0.789 0.95 1.96 0.90 1.65 p(upper) 0.888 0.866 0.833 0.824 0.811 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29

ROC (Receiver Operating Characteristic) Page 298 of TSK book. Many appications care about ranking (give a queue from the most ikey to the east ikey) Exampes Which ranking order is better? ROC: Deveoped in 1950s for signa detection theory to anayze noisy signas Characterize the trade-off between positive hits and fase aarms ROC curve pots TP (on the y-axis) against FP (on the x-axis) Performance of each cassifier represented as a point on the ROC curve changing the threshod of agorithm, sampe distribution or cost matrix changes the ocation of the point Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30

How to Construct an ROC curve Instance P(+ A) True Cass 1 0.95 + 2 0.93 + 3 0.87-4 0.85-5 0.85-6 0.85 + 7 0.76-8 0.53 + 9 0.43-10 0.25 + Predicted by cassifier This is the ground truth Use cassifier that produces posterior probabiity for each test instance P(+ A) for instance A Sort the instances according to P(+ A) in decreasing order Appy threshod at each unique vaue of P(+ A) Count the number of TP, FP, TN, FN at each threshod TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31

How to construct an ROC curve Cass + - + - - - + - + + Threshod >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 ROC Curve: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32

Using ROC for Mode Comparison No mode consistenty outperform the other M 1 is better for sma FPR M 2 is better for arge FPR Area Under the ROC curve: AUC Idea: Area = 1 Random guess: Area = 0.5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33

ROC Curve (TP,FP): (0,0): decare everything to be negative cass (1,1): decare everything to be positive cass (1,0): idea Diagona ine: Random guessing Beow diagona ine: prediction is opposite of the true cass Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34