Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Similar documents
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Classification. Instructor: Wei Ding

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Classification Salvatore Orlando

Classification Part 4

Data Mining Concepts & Techniques

Classification: Basic Concepts, Decision Trees, and Model Evaluation

10 Classification: Evaluation

Part I. Instructor: Wei Ding

CSE4334/5334 DATA MINING

Lecture Notes for Chapter 4

Classification and Regression

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Data Mining Classification: Bayesian Decision Theory

Model s Performance Measures

CSE 5243 INTRO. TO DATA MINING

Classification. Instructor: Wei Ding

CISC 4631 Data Mining

CS 584 Data Mining. Classification 3

DATA MINING OVERFITTING AND EVALUATION

CS Machine Learning

Data Mining D E C I S I O N T R E E. Matteo Golfarelli

Tree-based methods for classification and regression

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Network Traffic Measurements and Analysis

CS4491/CS 7265 BIG DATA ANALYTICS

Data Mining Classification - Part 1 -

Extra readings beyond the lecture slides are important:

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Lecture 7: Decision Trees

Classification and Regression Trees

COMP 465: Data Mining Classification Basics

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Metrics Overfitting Model Evaluation Research directions. Classification. Practical Issues. Huiping Cao. lassification-issues, Slide 1/57

Lecture outline. Decision-tree classification

Given a collection of records (training set )

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CISC 4631 Data Mining

Classification and Regression Trees

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

8. Tree-based approaches

List of Exercises: Data Mining 1 December 12th, 2015

Part II: A broader view

Decision tree learning

Nearest neighbor classification DSE 220

Classification. Slide sources:

Business Club. Decision Trees

Classification: Decision Trees

Classification with Decision Tree Induction

Evaluating Classifiers

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

โครงการอบรมเช งปฏ บ ต การ น กว ทยาการข อม ลภาคร ฐ (Government Data Scientist)

Evaluating Classifiers

CS145: INTRODUCTION TO DATA MINING

ECLT 5810 Evaluation of Classification Quality

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Statistics 202: Statistical Aspects of Data Mining

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Decision Tree CE-717 : Machine Learning Sharif University of Technology

CS229 Lecture notes. Raphael John Lamarre Townshend

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Introduction to Classification & Regression Trees

CS 584 Data Mining. Classification 1

Random Forest A. Fornaser

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

The Basics of Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Data Warehousing & Data Mining

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Introduction to Machine Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Data Mining and Knowledge Discovery Practice notes 2

The Curse of Dimensionality

CS 229 Midterm Review

CS249: ADVANCED DATA MINING

7. Decision or classification trees

Data Mining and Knowledge Discovery: Practice Notes

Probabilistic Classifiers DWML, /27

Lecture 2 :: Decision Trees Learning

Classification Algorithms in Data Mining

Evaluating Machine-Learning Methods. Goals for the lecture

Artificial Intelligence. Programming Styles

Lecture Notes for Chapter 4 Part III. Introduction to Data Mining

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

How do we obtain reliable estimates of performance measures?

Supervised vs unsupervised clustering

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Introduction to Machine Learning

Classification/Regression Trees and Random Forests

Nesnelerin İnternetinde Veri Analizi

Evaluating Machine Learning Methods: Part 1

Transcription:

Week 4 Based in part on slides from textbook, slides of Susan Holmes Part I Classification & Decision Trees October 19, 2012 1 / 1 2 / 1 Classification Classification Problem description We are given a data matrix X with either continuous or discrete variables such that each row X i F and a set of labels Y L. For a k-class problem, #L = k and we can think of L = {1,..., k}. Our goal is to find a classifier f : F L that allows us to predict the label of a new observation given a new set of features. A supervised problem Classification is a supervised problem. Usually, we use a subset of the data, the training set to learn or estimate the classifier yielding ˆf = ˆf training. The performance of ˆf is measured by applying it to each case in the test set and computing L(ˆf training (X j ), Y j ) j test 3 / 1 4 / 1

Classification Classification Examples of classification tasks Predicting whether a tumor is benign or malignant. Classifying credit card transactions as fraudulent or legitimate. Predicting the type of a given tumor among several types. Cateogrizing a document or news story as one of {finance, weather, sports, etc.} 5 / 1 6 / 1 Classification Classification trees Common techniques Decision Tree based Methods Rule-based Methods Discriminant Analysis Memory based reasoning Neural Networks Naïve Bayes Support Vector Machines 7 / 1 8 / 1

Classification trees Applying a decision tree rule 9 / 1 10 / 1 Applying a decision tree rule Applying a decision tree rule 11 / 1 12 / 1

Applying a decision tree rule Applying a decision tree rule 13 / 1 14 / 1 Applying a decision tree rule Applying a decision tree rule 15 / 1 16 / 1

Decision boundary for tree Decision tree for iris data using all features Decision tree for iris data using petal.length, petal.width 17 / 1 Regions in petal.length, petal.width plane 18 / 1 3.0 2.5 2.0 Petal width 1.5 1.0 0.5 0.0 0.5 0 1 2 3 4 5 6 7 8 Petal length 19 / 1 20 / 1

Decision boundary for tree Figure : Trees have trouble capturing structure not parallel to axes 21 / 1 22 / 1 Hunt s algorithm (generic structure) Let D t be the set of training records that reach a node t If D t contains records that belong the same class y t, then t is a leaf node labeled as y t. If D t =, then t is a leaf node labeled by the default class, y d. If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. This splitting procedure is what can vary for different tree learning algorithms... 23 / 1 24 / 1

Different splits: ordinal / nominal Issues Greedy strategy: Split the records based on an attribute test that optimizes certain criterion. What is the best split: What criterion do we use? Previous example chose first to split on Refund... How to split the records: Binary or multi-way? Previous example split Taxable Income at 80K... When do we stop? Should we continue until each node if possible? Previous example stopped with all nodes being completely homogeneous... Figure : Binary or multi-way? 25 / 1 26 / 1 Different splits: continuous Choosing a variable to split on Figure : Binary or multi-way? Figure : Which should we start the splitting on? 27 / 1 28 / 1

Choosing a variable to split on Choosing the best split Need some numerical criterion to choose among possible splits. Criterion should favor homogeneous or pure nodes. Common cost functions: Gini Index Entropy / Deviance / Information Misclassification Error 29 / 1 30 / 1 GINI Index Suppose we have k classes and node t has frequencies p t = (p 1,t,..., p k,t ). Criterion GINI (t) = (j,j ) {1,...,k}:j j p j,t p j,t = 1 Maximized when p j,t = 1/k with value 1 1/k l j=1 Minimized when all records belong to a single class. Minimizing GINI will favour pure nodes... p 2 j,t. Gain in GINI Index for a potential split Suppose t is to be split into j new child nodes (t l ) 1 l j. Each child node has a count n l and a vector of frequencies (p 1,tl,..., p k,tl ). Hence they have their own GINI index, GINI (t l ). The gain in GINI Index for this split is Gain(GINI, t (t l ) 1 l j ) = GINI (t) j l=1 n lgini (t l ) j l=1 n. l Greedy algorithm chooses the biggest gain in GINI index among a list of possible splits. 31 / 1 32 / 1

Decision tree for iris data using all features with GINI Entropy / Deviance / Information Suppose we have k classes and node t has frequencies p t = (p 1,t,..., p k,t ). Criterion H(t) = k p j,t log p j,t j=1 Maximized when p i,t = 1/k with value log k Minimized when one class has no records in it. Minimizing entropy will favour pure nodes... Decision tree for iris data using all features with Entropy 33 / 1 34 / 1 Gain in entropy for a potential split Suppose t is to be split into j new child nodes (t l ) 1 l j. Each child node has a count n l and a vector of frequencies (p 1,tl,..., p k,tl ). Hence they have their own entropy H(t l ). The gain in entropy for this split is j l=1 Gain(H, t (t l ) 1 l j ) = H(t) n lh(t l ) j l=1 n. l Greedy algorithm chooses the biggest gain in H among a list of possible splits. 35 / 1 36 / 1

Different criteria: GINI, H, MC Misclassification Error Suppose we have k classes and node t has frequencies p t = (p 1,t,..., p k,t ). The mode is Criterion ˆk(t) = argmax p k,t. k Misclassification Error(t) = 1 pˆk(t),t Not smooth in p t as GINI, H, can be more difficult to optimize numerically. 37 / 1 38 / 1 Choosing the split for a continuous variable Misclassification Error Example: suppose parent has 10 cases: {7D, 3R} A candidate split produces two nodes: {3D, 0R}, {4D, 3R}. The gain in MC is 0, but gain in GINI is 0.42 0.342 > 0. Similarly, entropy will also show an improvement... 39 / 1 40 / 1

Stopping training As trees get deeper, or if splits are multi-way the number of data points per leaf node drops very quickly. Trees that are too deep tend to overfit the data. A common strategy is to prune the tree by removing some internal nodes. 41 / 1 Figure : Underfitting corresponds to the left-hand side, overfit to the right 42 / 1 Cost-complexity pruning (tree library) Given a criterion Q like H or GINI, we define the cost-complexity of a tree with terminal nodes (t j ) 1 j m C α (T ) = m n j Q(t j ) + αm j=1 Given a large tree T L we might compute C α (T ) for any subtree T of T L. The optimal tree is defined as ˆT α = argmin T T L C α (T ). Pre-pruning (rpart library) These methods stop the algorithm before it becomes a fully-grown tree. Examples Stop if all instances belong to the same class (kind of obvious). Stop if number of instances is less than some user-specified threshold. Both tree, rpart have rules like this. Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). This relates to cp in rpart. Can be found by weakest-link pruning. See Elements of Statistical Learning for more... 43 / 1 44 / 1

Training and test error as a function of cp Metrics for Performance Evaluation PREDICTED CLASS l Class=Yes Class=Yes a ACTUAL (TP) CLASS Class=No c (FP) Most widely-used metric: Class=No b (FN) d (TN) 45 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 46 / 1 Measures of performance Simplest is accuracy TP + TN Accuracy = TP + TN + FP + FN = SMC(Actual, Predicted) = 1 Misclassification Rate Accuracy isn t everything Consider an unbalanced 2-class problem with # 1 s=10, # 0 s=9990. Simply labelling everything 0 yields 99.9% accuracy. But, this classifier misses all class 1. 47 / 1 48 / 1

Cost Matrix PREDICTED CLASS C(i j) Class=Yes Class=No ACTUAL Class=Yes C(Yes Yes) C(No Yes) CLASS Class=No C(Yes No) C(No No) Measures of performance Classification rule changes to Label(p, C) = argmin i C(i j)p j Accuracy is the same as cost if C(Y Y ) = C(N N) = c 1, C(Y N) = C(N Y ) = c 2. j C(i j): Cost of misclassifying class j example as class i Tan,Steinbach, Kumar Introduction to 4/18/2004 49 / 1 50 / 1 Computing Cost of Classification Model M 1 ACTUAL CLASS Cost Matrix ACTUAL CLASS PREDICTED CLASS Accuracy = 80% Cost = 3910 + - + 150 40-60 250 PREDICTED CLASS C(i j) + - + -1 100-1 0 Model M 2 ACTUAL CLASS PREDICTED CLASS + - + 250 45-5 200 Accuracy = 90% Cost = 4255 Tan,Steinbach, Kumar Introduction to 4/18/2004 Measures of performance Other common ones TP Precision = TP + FP TN Specificity = TN + FP = TNR TP Sensitivity = Recall = TP + FN = TPR 2 Recall Precision F = Recall + Precision 2 TP = 2 TP + FN + FP 51 / 1 52 / 1

Measures of performance Precision emphasizes P(p = Y, a = Y ) & P(p = Y, a = N). Recall emphasizes P(p = Y, a = Y ) & P(p = N, a = Y ). FPR = 1 TNR FNR = 1 TPR. Measure of performance We have done some simple training / test splits to see how well our classifier is doing. More accurately, this procedure measures how well our algorithm for learning the classifier is doing. How well this works may depend on Model: Are we using the right type of classifier model? Cost: Is our algorithm sensitive to the cost of misclassification? Data size: Do we have enough data to learn a model? 53 / 1 54 / 1 Learning Curve Figure : As data increases, our estimate of accuracy improves, as does the variability of our estimate... Tan,Steinbach, Kumar Introduction to 4/18/2004 l l Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: l Arithmetic sampling (Langley, et al) l Geometric sampling (Provost et al) Effect of small sample size: - Bias in the estimate - Variance of estimate Estimating performance Holdout: Split into test and training (e.g. 1/3 test, 2/3 training). Random subsampling: Repeated replicates of holdout, averaging results. Cross validation: Partition data into K disjoint subsets. For each subset S i, train on all but S i, then test on S i. Stratified sampling: May be helpful to sample so Y/N class is roughly equal in training data. 0.632 Bootstrap: Combine training error and bootstrap error... 55 / 1 56 / 1