Week 4 Based in part on slides from textbook, slides of Susan Holmes Part I Classification & Decision Trees October 19, 2012 1 / 1 2 / 1 Classification Classification Problem description We are given a data matrix X with either continuous or discrete variables such that each row X i F and a set of labels Y L. For a k-class problem, #L = k and we can think of L = {1,..., k}. Our goal is to find a classifier f : F L that allows us to predict the label of a new observation given a new set of features. A supervised problem Classification is a supervised problem. Usually, we use a subset of the data, the training set to learn or estimate the classifier yielding ˆf = ˆf training. The performance of ˆf is measured by applying it to each case in the test set and computing L(ˆf training (X j ), Y j ) j test 3 / 1 4 / 1
Classification Classification Examples of classification tasks Predicting whether a tumor is benign or malignant. Classifying credit card transactions as fraudulent or legitimate. Predicting the type of a given tumor among several types. Cateogrizing a document or news story as one of {finance, weather, sports, etc.} 5 / 1 6 / 1 Classification Classification trees Common techniques Decision Tree based Methods Rule-based Methods Discriminant Analysis Memory based reasoning Neural Networks Naïve Bayes Support Vector Machines 7 / 1 8 / 1
Classification trees Applying a decision tree rule 9 / 1 10 / 1 Applying a decision tree rule Applying a decision tree rule 11 / 1 12 / 1
Applying a decision tree rule Applying a decision tree rule 13 / 1 14 / 1 Applying a decision tree rule Applying a decision tree rule 15 / 1 16 / 1
Decision boundary for tree Decision tree for iris data using all features Decision tree for iris data using petal.length, petal.width 17 / 1 Regions in petal.length, petal.width plane 18 / 1 3.0 2.5 2.0 Petal width 1.5 1.0 0.5 0.0 0.5 0 1 2 3 4 5 6 7 8 Petal length 19 / 1 20 / 1
Decision boundary for tree Figure : Trees have trouble capturing structure not parallel to axes 21 / 1 22 / 1 Hunt s algorithm (generic structure) Let D t be the set of training records that reach a node t If D t contains records that belong the same class y t, then t is a leaf node labeled as y t. If D t =, then t is a leaf node labeled by the default class, y d. If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. This splitting procedure is what can vary for different tree learning algorithms... 23 / 1 24 / 1
Different splits: ordinal / nominal Issues Greedy strategy: Split the records based on an attribute test that optimizes certain criterion. What is the best split: What criterion do we use? Previous example chose first to split on Refund... How to split the records: Binary or multi-way? Previous example split Taxable Income at 80K... When do we stop? Should we continue until each node if possible? Previous example stopped with all nodes being completely homogeneous... Figure : Binary or multi-way? 25 / 1 26 / 1 Different splits: continuous Choosing a variable to split on Figure : Binary or multi-way? Figure : Which should we start the splitting on? 27 / 1 28 / 1
Choosing a variable to split on Choosing the best split Need some numerical criterion to choose among possible splits. Criterion should favor homogeneous or pure nodes. Common cost functions: Gini Index Entropy / Deviance / Information Misclassification Error 29 / 1 30 / 1 GINI Index Suppose we have k classes and node t has frequencies p t = (p 1,t,..., p k,t ). Criterion GINI (t) = (j,j ) {1,...,k}:j j p j,t p j,t = 1 Maximized when p j,t = 1/k with value 1 1/k l j=1 Minimized when all records belong to a single class. Minimizing GINI will favour pure nodes... p 2 j,t. Gain in GINI Index for a potential split Suppose t is to be split into j new child nodes (t l ) 1 l j. Each child node has a count n l and a vector of frequencies (p 1,tl,..., p k,tl ). Hence they have their own GINI index, GINI (t l ). The gain in GINI Index for this split is Gain(GINI, t (t l ) 1 l j ) = GINI (t) j l=1 n lgini (t l ) j l=1 n. l Greedy algorithm chooses the biggest gain in GINI index among a list of possible splits. 31 / 1 32 / 1
Decision tree for iris data using all features with GINI Entropy / Deviance / Information Suppose we have k classes and node t has frequencies p t = (p 1,t,..., p k,t ). Criterion H(t) = k p j,t log p j,t j=1 Maximized when p i,t = 1/k with value log k Minimized when one class has no records in it. Minimizing entropy will favour pure nodes... Decision tree for iris data using all features with Entropy 33 / 1 34 / 1 Gain in entropy for a potential split Suppose t is to be split into j new child nodes (t l ) 1 l j. Each child node has a count n l and a vector of frequencies (p 1,tl,..., p k,tl ). Hence they have their own entropy H(t l ). The gain in entropy for this split is j l=1 Gain(H, t (t l ) 1 l j ) = H(t) n lh(t l ) j l=1 n. l Greedy algorithm chooses the biggest gain in H among a list of possible splits. 35 / 1 36 / 1
Different criteria: GINI, H, MC Misclassification Error Suppose we have k classes and node t has frequencies p t = (p 1,t,..., p k,t ). The mode is Criterion ˆk(t) = argmax p k,t. k Misclassification Error(t) = 1 pˆk(t),t Not smooth in p t as GINI, H, can be more difficult to optimize numerically. 37 / 1 38 / 1 Choosing the split for a continuous variable Misclassification Error Example: suppose parent has 10 cases: {7D, 3R} A candidate split produces two nodes: {3D, 0R}, {4D, 3R}. The gain in MC is 0, but gain in GINI is 0.42 0.342 > 0. Similarly, entropy will also show an improvement... 39 / 1 40 / 1
Stopping training As trees get deeper, or if splits are multi-way the number of data points per leaf node drops very quickly. Trees that are too deep tend to overfit the data. A common strategy is to prune the tree by removing some internal nodes. 41 / 1 Figure : Underfitting corresponds to the left-hand side, overfit to the right 42 / 1 Cost-complexity pruning (tree library) Given a criterion Q like H or GINI, we define the cost-complexity of a tree with terminal nodes (t j ) 1 j m C α (T ) = m n j Q(t j ) + αm j=1 Given a large tree T L we might compute C α (T ) for any subtree T of T L. The optimal tree is defined as ˆT α = argmin T T L C α (T ). Pre-pruning (rpart library) These methods stop the algorithm before it becomes a fully-grown tree. Examples Stop if all instances belong to the same class (kind of obvious). Stop if number of instances is less than some user-specified threshold. Both tree, rpart have rules like this. Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). This relates to cp in rpart. Can be found by weakest-link pruning. See Elements of Statistical Learning for more... 43 / 1 44 / 1
Training and test error as a function of cp Metrics for Performance Evaluation PREDICTED CLASS l Class=Yes Class=Yes a ACTUAL (TP) CLASS Class=No c (FP) Most widely-used metric: Class=No b (FN) d (TN) 45 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 46 / 1 Measures of performance Simplest is accuracy TP + TN Accuracy = TP + TN + FP + FN = SMC(Actual, Predicted) = 1 Misclassification Rate Accuracy isn t everything Consider an unbalanced 2-class problem with # 1 s=10, # 0 s=9990. Simply labelling everything 0 yields 99.9% accuracy. But, this classifier misses all class 1. 47 / 1 48 / 1
Cost Matrix PREDICTED CLASS C(i j) Class=Yes Class=No ACTUAL Class=Yes C(Yes Yes) C(No Yes) CLASS Class=No C(Yes No) C(No No) Measures of performance Classification rule changes to Label(p, C) = argmin i C(i j)p j Accuracy is the same as cost if C(Y Y ) = C(N N) = c 1, C(Y N) = C(N Y ) = c 2. j C(i j): Cost of misclassifying class j example as class i Tan,Steinbach, Kumar Introduction to 4/18/2004 49 / 1 50 / 1 Computing Cost of Classification Model M 1 ACTUAL CLASS Cost Matrix ACTUAL CLASS PREDICTED CLASS Accuracy = 80% Cost = 3910 + - + 150 40-60 250 PREDICTED CLASS C(i j) + - + -1 100-1 0 Model M 2 ACTUAL CLASS PREDICTED CLASS + - + 250 45-5 200 Accuracy = 90% Cost = 4255 Tan,Steinbach, Kumar Introduction to 4/18/2004 Measures of performance Other common ones TP Precision = TP + FP TN Specificity = TN + FP = TNR TP Sensitivity = Recall = TP + FN = TPR 2 Recall Precision F = Recall + Precision 2 TP = 2 TP + FN + FP 51 / 1 52 / 1
Measures of performance Precision emphasizes P(p = Y, a = Y ) & P(p = Y, a = N). Recall emphasizes P(p = Y, a = Y ) & P(p = N, a = Y ). FPR = 1 TNR FNR = 1 TPR. Measure of performance We have done some simple training / test splits to see how well our classifier is doing. More accurately, this procedure measures how well our algorithm for learning the classifier is doing. How well this works may depend on Model: Are we using the right type of classifier model? Cost: Is our algorithm sensitive to the cost of misclassification? Data size: Do we have enough data to learn a model? 53 / 1 54 / 1 Learning Curve Figure : As data increases, our estimate of accuracy improves, as does the variability of our estimate... Tan,Steinbach, Kumar Introduction to 4/18/2004 l l Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: l Arithmetic sampling (Langley, et al) l Geometric sampling (Provost et al) Effect of small sample size: - Bias in the estimate - Variance of estimate Estimating performance Holdout: Split into test and training (e.g. 1/3 test, 2/3 training). Random subsampling: Repeated replicates of holdout, averaging results. Cross validation: Partition data into K disjoint subsets. For each subset S i, train on all but S i, then test on S i. Stratified sampling: May be helpful to sample so Y/N class is roughly equal in training data. 0.632 Bootstrap: Combine training error and bootstrap error... 55 / 1 56 / 1