Chapter 5. Tree-based Methods

Similar documents
Classification and Regression Trees

Classification and Regression Trees

Tree-based methods for classification and regression

Lecture 19: Decision trees

Random Forests and Boosting

Weighted Sample. Weighted Sample. Weighted Sample. Training Sample

Lecture 20: Classification and Regression Trees

Patrick Breheny. November 10

The Basics of Decision Trees

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lecture 7: Decision Trees

Classification: Decision Trees

8. Tree-based approaches

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Random Forest A. Fornaser

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Classification/Regression Trees and Random Forests

Statistical Methods for Data Mining

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Cyber attack detection using decision tree approach

Network Traffic Measurements and Analysis

CS229 Lecture notes. Raphael John Lamarre Townshend

Lecture 5: Decision Trees (Part II)

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Lecture 2 :: Decision Trees Learning

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

Knowledge Discovery and Data Mining

Sensor-based Semantic-level Human Activity Recognition using Temporal Classification

Comparing Univariate and Multivariate Decision Trees *

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Lazy Decision Trees Ronny Kohavi

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Introduction to Classification & Regression Trees

Decision Tree Learning

CARTWARE Documentation

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

arxiv: v1 [stat.ml] 12 Jun 2017

Alacart: Poor Man s Classification Trees

An introduction to random forests

Topics in Machine Learning

Lecture 20: Bagging, Random Forests, Boosting

Handling Missing Values using Decision Trees with. Branch-Exclusive Splits

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

An overview for regression tree

Chapter 11. Network Community Detection

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Notes based on: Data Mining for Business Intelligence

Classification. Instructor: Wei Ding

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

CART Bagging Trees Random Forests. Leo Breiman

Classification. Instructor: Wei Ding

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

arxiv: v1 [stat.ml] 25 Jan 2018

Didacticiel - Études de cas. Comparison of the implementation of the CART algorithm under Tanagra and R (rpart package).

Network. Department of Statistics. University of California, Berkeley. January, Abstract

7. Decision or classification trees

Fuzzy Partitioning with FID3.1

Lecture 25: Review I

Supervised Learning for Image Segmentation

List of Exercises: Data Mining 1 December 12th, 2015

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Induction of Multivariate Decision Trees by Using Dipolar Criteria

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Nonparametric Classification Methods

Part I. Instructor: Wei Ding

Univariate and Multivariate Decision Trees

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

7. Boosting and Bagging Bagging

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Classification Algorithms in Data Mining

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Cost-complexity pruning of random forests

Resampling methods (Ch. 5 Intro)

CSE4334/5334 DATA MINING

Bagging and Random Forests

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

8.11 Multivariate regression trees (MRT)

Decision tree learning

Machine Learning Lecture 11

Classification Using C5.0. UseR! 2013

Credit card Fraud Detection using Predictive Modeling: a Review

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Random Forest Classification and Attribute Selection Program rfc3d

Data Science with R. Decision Trees.

Improving Tree-Based Classification Rules Using a Particle Swarm Optimization

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Faculty of Sciences. Holger Cevallos Valdiviezo

Comparison of Optimization Methods for L1-regularized Logistic Regression

Transcription:

Chapter 5. Tree-based Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan

Regression And Classification Tree (CART) 9.2: Breiman et al (1984). C4.5 (Quinlan 1993). Main idea: approximate any f (x) by a piece-wise constant ˆf (x). Use recursive partitioning: Fig 9.2, 1) Partition the x space into two regions R 1 and R 2 by x j < c j ; 2) Partition R 1, R 2 ; 3) Then their sub-regions,... until the model fits data well. ˆf (x) = m c mi (x R m ). can be represented as a (decision) tree.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 R 2 R 5 t 4 X2 X2 t 2 R 3 R 4 R 1 X 1 t 1 t 3 X 1 X 1 t 1 X 2 t 2X 1 t 3 X 2 t 4 R 1 R 2 R 3 X 2 X 1 R 4 R 5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, as used in CART, applied to some fake data. Top left panel shows a general partition

Regression Tree Y : continuous. Key: 1) determin splitting variables and split points (e.g. x j < t j ); = R 1, R 2,...; 2) determine c m in each R m. in 1), use a sequential or greedy searchfor each j and s: find x j < s s.t. R 1 (j, s) = {x x j < s}, R 2 (j, s) = {x x j s}, min j,s [min c1 X i R 1 (j,s) (Y i c 1 ) 2 +min c2 X i R 2 (j,s) (Y i c 2 ) 2 ]. in 2), given R 1 and R 2, ĉ k = Ave(Y i X i R k } for k = 1, 2. Repeat the process on R 1 and R 2 respectively,... When to stop? Have to stop when having all equal or too few Y i s in R m ; Tree size gives a model complexity!

A strategy: first grow a large tree, then prune it. Cost-complexity criterion for tree T : C α (T ) = RSS(T ) + α T = (Y i ĉ m ) 2 + α T, m X i R m where T is # of terminal nodes (leaves) and α > 0 is a tuning parameter to be determined by CV.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 α 176 21 7 5 3 2 0 Misclassification Rate 0.0 0.1 0.2 0.3 0.4 0 10 20 30 40 Tree Size FIGURE 9.4. Results for spam example. The blue curve is the 10-fold cross-validation estimate of misclassification rate as a function of tree size, with standard error bars. The minimum occurs at a tree size with about 17 terminal nodes (using the one-standard- -error rule). The orange curve is the test error, which tracks the CV error quite closely. The cross-validation is indexed by values of α, shown above. The tree sizes shown below refer to T α, the size of the original tree

Classification Tree Y i {1, 2,..., K}. Classify obs s in node m to the majority class: ˆp mk = X i R m I (Y i = k)/n m, k(m) = arg max k ˆp mk. Impurity measure Q m (T ): Used squarted error in regression trees. 1. Misclassification error: 1 n m X i R m I (Y i k(m)) = 1 ˆp m,k(m). 2. Gini index: K k=1 ˆp mk(1 ˆp mk ). 3. Cross-entropy or deviance: K k=1 ˆp mk log ˆp mk. For K = 2, 1-3 reduce to 1 max(ˆp, 1 ˆp), 2ˆp(1 ˆp), ˆp log ˆp (1 ˆp) log(1 ˆp). Look similar; see Fig 9.3. Example: ex5.1.r

Advantages: 1. Easy to incorporate unequal losses of misclassifications: 1 n m X i R m w i I (Y i k(m)) with w i = C k if Y i = k. 2. Handling missing data: use a surrogate splitting var/value at each node (to best approximate the selected one). Extensions: 1. May use non-binary splits; 2. A linear combination of multiple var s as a splitting var. more flexible, but better? +: easy interpretation decision trees! -: unstable due to greedy search and discontinuity; predicting performance not best. R packages tree, rpart; commercial CART. Other implementations: C4.5/C5.0; FIRM by Prof Hawkins (U of M): to detect interactions; by Prof Loh s group (UW-Madison): for count, survival,... data; regression in each terminal node;...

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 email 600/1536 ch$<0.0555 email 280/1177 remove<0.06 remove>0.06 ch$>0.0555 spam 48/359 hp<0.405 hp>0.405 email spam spam email 180/1065 9/112 26/337 0/22 ch!<0.191 george<0.15 CAPAVE<2.907 ch!>0.191 george>0.15 CAPAVE>2.907 email email spam email spam spam 80/861 100/204 6/109 0/3 19/110 7/227 george<0.005 CAPAVE<2.7505 1999<0.58 george>0.005 CAPAVE>2.7505 1999>0.58 email email email spam 80/652 0/209 36/123 16/81 hp<0.03 free<0.065 hp>0.03 free>0.065 spam email 18/109 0/1 email email email spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5business>0.145 email email email spam 20/238 57/185 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 email spam email email 19/236 1/2 48/113 9/72 our<1.2 our>1.2 email spam 37/101 1/12

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 Tree (0.95) GAM (0.98) Weighted Tree (0.90) 0.0 0.2 0.4 0.6 0.8 1.0 Specificity FIGURE 9.6. ROC curves for the classification rules fit to the spam data. Curves that are closer to the northeast corner represent better classifiers. In this case the GAM classifier dominates the trees. The weighted tree achieves better sensitivity for higher specificity than the unweighted tree. The numbers in the legend represent

Application: personalized medicine Also called subgroup analysis (or Precision Medicine): to identify subgroups of patients that would be most benefit from a treatment. Statistical problem: detect (qualitative) trt-predictor interaction! quantitative interactions: differ in magnitudes but in teh same direction; qualitative interactions: differ in directions. Many approaches... one of them is to use trees. Prof Loh s GUIDE: http://www.stat.wisc.edu/ loh/guide.html An example: http://onlinelibrary.wiley.com/doi/10.1002/sim. 6454/abstract Another example: https://www.ncbi.nlm.nih.gov/pubmed/24983709