Classification and Regression Trees

Similar documents
Chapter 5. Tree-based Methods

Tree-based methods for classification and regression

Classification and Regression Trees

Lecture 19: Decision trees

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Patrick Breheny. November 10

8. Tree-based approaches

Random Forest A. Fornaser

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

The Basics of Decision Trees

Lecture 2 :: Decision Trees Learning

Lecture 20: Classification and Regression Trees

Classification/Regression Trees and Random Forests

Notes based on: Data Mining for Business Intelligence

Decision Tree Learning

Knowledge Discovery and Data Mining

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

A Systematic Overview of Data Mining Algorithms

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Statistical Methods for Data Mining

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Network Traffic Measurements and Analysis

CS229 Lecture notes. Raphael John Lamarre Townshend

Business Club. Decision Trees

Lecture 7: Decision Trees

CS Machine Learning

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

Data Mining Concepts & Techniques

Classification. Instructor: Wei Ding

Classification with Decision Tree Induction

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

CSE4334/5334 DATA MINING

Topics in Machine Learning

7. Decision or classification trees

Data Mining Lecture 8: Decision Trees

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Sensor-based Semantic-level Human Activity Recognition using Temporal Classification

Classification and Regression

Classification: Decision Trees

Nonparametric Classification Methods

Classification: Basic Concepts, Decision Trees, and Model Evaluation

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Lecture 5: Decision Trees (Part II)

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

An introduction to random forests

Introduction to Classification & Regression Trees

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Lazy Decision Trees Ronny Kohavi

Statistical Machine Learning Hilary Term 2018

Lecture 20: Bagging, Random Forests, Boosting

Part I. Instructor: Wei Ding

Credit card Fraud Detection using Predictive Modeling: a Review

CARTWARE Documentation

Lecture outline. Decision-tree classification

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

An overview for regression tree

What is machine learning?

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Univariate and Multivariate Decision Trees

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

Decision Trees. Query Selection

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Cost-complexity pruning of random forests

Data Mining in Bioinformatics Day 1: Classification

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Machine Learning: An Applied Econometric Approach Online Appendix

Decision Trees Oct

Decision tree learning

Classification. Instructor: Wei Ding

Stochastic global optimization using random forests

The Curse of Dimensionality

Didacticiel - Études de cas. Comparison of the implementation of the CART algorithm under Tanagra and R (rpart package).

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Lecture 25: Review I

CS 229 Midterm Review

Cyber attack detection using decision tree approach

Data Mining Practical Machine Learning Tools and Techniques

arxiv: v1 [stat.ml] 12 Jun 2017

Data Mining Classification - Part 1 -

Algorithms: Decision Trees

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

COMP 465: Data Mining Classification Basics

Faculty of Sciences. Holger Cevallos Valdiviezo

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

1) Give decision trees to represent the following Boolean functions:

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Introduction to Machine Learning

Supervised vs. Unsupervised Learning

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Exam Advanced Data Mining Date: Time:

Transcription:

Classification and Regression Trees Matthew S. Shotwell, Ph.D. Department of Biostatistics Vanderbilt University School of Medicine Nashville, TN, USA March 16, 2018

Introduction trees partition feature space into a set of rectangles fit simple model in each partition (e.g., constant) start with CART: Classification And Regression Trees quantitative response Y and inputs X 1 and X 2, all with support in [0, 1]

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 R2 R5 t4 R3 t2 R4 R1 top-left partition lines parallel to axes (e.g., X 1 = c), but can t be described by tree top-right partitions are recursive, can be described by tree at bottom-left trees have (root, internal, and terminal) nodes and branches terminal nodes also called leaf node t1 t2 t3 t4 R1 R2 R3 t1 t3 R4 R5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, as used in CART, applied to some fake data. Top left panel shows a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel.

Fitting recursive binary trees first split space into two regions model the response by the mean of Y in each region choose feature and split-point to achieve the best fit one or both resulting regions are split into two more regions repeat until some stopping rule is applied

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 Top-right panel tree constructed as follows: 1. split at X 1 = t 1 R2 R5 t4 2. for X 1 t 1 split at X 2 = t 2 R3 3. for X 1 > t 1 split at X 1 = t 3 t2 R4 4. for X 1 > t 3, split at X 2 = 4 R1 Partition into 5 regions: R 1,..., R 5 ; predictor is mean of Y in each region: t1 t3 5 ˆf(x) = ĉ mi(x R m) m=1 t1 t2 t3 where t4 Ni=1 I(x i R m)y i ĉ m = Ni=1 I(x i R m) R1 R2 R3 R4 R5 Bottom-left is tree representation of top-right, and bottom-right is surface of ˆf(x) FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, as used in CART, applied to some fake data. Top left panel shows a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel.

Fitting (growing) regression trees Regression trees have this form: f(x) = M c m I(x R m ) m=1 Under squared-error loss, conditional on R m, ĉ m is the sample mean of Y in region R m. Finding partitions R m is more difficult. Greedy algorithm: consider a splitting variable j and split point s and define resulting regions as follows: R 1 (j, s) = {X : X j s} and R 2 (j, s) = {X : X j > s} The splitting task is then to find j and s that minimize: RSS(j, s) = (y i ĉ 1 ) 2 + (y i ĉ 2 ) 2 x i R 1 (j,s) x i R 2 (j,s)

Fitting (growing) regression trees The splitting task is then to find j and s that minimize: RSS(j, s) = (y i ĉ 1 ) 2 + (y i ĉ 2 ) 2 x i R 1 (j,s) x i R 2 (j,s) this may seem difficult at first, but s is only identifiable for up to N 1 distinct values for given j, RSS(j, s) is constant between X j1 and X j2 for tractable N p, can simply enumerate all RSS(j, s) split at optimal j and s, then repeat within each split proceed this way until stopping rule triggered

Stopping rules tree complexity is a tuning parameter one approach: grow a large tree, stopping only when minimum node size (i.e., minimum number of x i R m ) is reached, then prune tree back using a cost-complexity criterion

Pruning Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 R2 R5 t4 R3 t2 R4 R1 t1 t3 t1 t2 t3 t4 R1 R2 R3 R4 R5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, as used in CART, applied to some fake data. Top left panel shows a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel.

Cost-complexity pruning let T T 0 be a sub-tree obtained by pruning T 0 let T be the number of terminal nodes let N m be the node size N i=1 I(x i R m ) let ĉ m = 1 N m x i R m y i let Q m (T ) = 1 N m x i R m (y i ĉ m ) 2 be lack of fit (LOF) the cost-complexity criterion is C α (T ) = T m=1 N m Q m (T ) + α T

Cost-complexity pruning C α (T ) = T m=1 N m Q m (T ) + α T pruning increases cost, lowers complexity for given α, find T α that minimizes C α (T ) greedy algorithm (weakest-link pruning): 1. collapse internal node that produces smallest increase in C α (T ) 2. continue until just one node 3. select among sequence of trees; T α must be part of this sequence α selected using, e.g., cross validation

Classification trees for k = 1,..., K classes let ˆp mk = 1 N m x i R m I(y i = k) apply loss function-specific classification rule k(m) need new metric of node impurity and LOF (for cost-complexity pruning): misclassification rate - 1 ˆpmk(m) Gini index - K k=1 ˆp mk(1 ˆp mk ) cross-entropy deviance - K k=1 ˆp mk log ˆp mk

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 0.0 0.1 0.2 0.3 0.4 0.5 Gini index Misclassification error Entropy 0.0 0.2 0.4 0.6 0.8 1.0 p FIGURE 9.3. Node impurity measures for two-class classification, as a function of the proportion p in class 2. Cross-entropy has been scaled to pass through (0.5, 0.5).

Why prefer misclassification vs. Gini vs. cross-entropy? say we have a two-class problem with 400 in each class, denoted (400,400) consider two candidate splits: s1 : (300,100) and (100,300) s 2 : (200,400) and (200,0) both splits have misclassification rate 0.25 (assuming zero-one loss classifier), but second split produces a pure node, which is preferable in many cases both Gini and cross-entropy give preference to split with pure node; thus often used in growing trees often misclassification is used for pruning

Gini index consider classifying training observations (x i R m ) to class k with probability ˆp mk then the training error rate of this rule in node m is k k ˆp mk ˆp mk - the Gini index

Problems with trees instability; sample variability in tree structure lack of smoothness of prediction surface in feature space categorical features; 2 q 1 1 partitions of categorical predictor with q values into 2 parts; can be simplified for binary outcomes using Gini or entropy loss, and quantitative outcomes using squared-error loss

R package rpart R package rpart implements recursive partitioning for classification and regressoin trees great vignette written my Terry Therneau and others