Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Similar documents
Lecture 20: Classification and Regression Trees

Classification and Regression Trees

Classification: Decision Trees

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Introduction to R and Statistical Data Analysis

A toolkit for stability assessment of tree-based learners

Random Forest A. Fornaser

Machine Learning: Algorithms and Applications Mockup Examination

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Performance Analysis of Data Mining Classification Techniques

Introduction to Classification & Regression Trees

Biology Project 1

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Supervised Learning Classification Algorithms Comparison

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Lecture 19: Decision trees

Chapter 5. Tree-based Methods

Tree-based methods for classification and regression

Classification and Regression Trees

Figure 3.20: Visualize the Titanic Dataset

Visualizing class probability estimators

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

8. Tree-based approaches

Patrick Breheny. November 10

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Data Science with R. Decision Trees.

Lecture 2 :: Decision Trees Learning

Hands on Datamining & Machine Learning with Weka

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Fuzzy Partitioning with FID3.1

Introduction to Machine Learning

Decision Trees In Weka,Data Formats

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Classification with Decision Tree Induction

Oblique Linear Tree. 1. Introduction

k-nearest Neighbors + Model Selection

Classification and Regression

Univariate and Multivariate Decision Trees

MULTIVARIATE ANALYSIS USING R

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Hybrid Feature Selection for Modeling Intrusion Detection Systems

arxiv: v1 [stat.ml] 25 Jan 2018

Lecture 7: Decision Trees

Data Mining Concepts & Techniques

Decision tree learning

Data Warehousing and Machine Learning

Function Approximation and Feature Selection Tool

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Data Mining Practical Machine Learning Tools and Techniques

STAT 1291: Data Science

Orange3-Prototypes Documentation. Biolab, University of Ljubljana

Comparing Univariate and Multivariate Decision Trees *

An overview for regression tree

Introduction to Artificial Intelligence

Visualisation of Regression Trees

CSE4334/5334 DATA MINING

From Building Better Models with JMP Pro. Full book available for purchase here.

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Technical Note Using Model Trees for Classification

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Stat 342 Exam 3 Fall 2014

TITANIC. Predicting Survival Using Classification Algorithms

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Machine Learning with MATLAB --classification

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Boxplot

Business Club. Decision Trees

Data Mining Practical Machine Learning Tools and Techniques

Journal of Statistical Software

Part I. Instructor: Wei Ding

A Systematic Overview of Data Mining Algorithms

Lab and Assignment Activity

Cyber attack detection using decision tree approach

Decision Tree: nagdmc waid

Experimental Design + k- Nearest Neighbors

mmpf: Monte-Carlo Methods for Prediction Functions by Zachary M. Jones

Chapter Three: Contents

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Input: Concepts, Instances, Attributes

Introduction to Statistical Graphics Procedures

Computer Vision Group Prof. Daniel Cremers. 6. Boosting

Intro to R for Epidemiologists

Homework: Data Mining

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Knowledge Discovery and Data Mining

FuzzyDT- A Fuzzy Decision Tree Algorithm Based on C4.5

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Linear discriminant analysis and logistic

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Induction of Multivariate Decision Trees by Using Dipolar Criteria

COMP 364: Computer Tools for Life Sciences

CART Bagging Trees Random Forests. Leo Breiman

USING REGRESSION TREES IN PREDICTIVE MODELLING

Transcription:

Fitting Classification and Regression Trees Using Statgraphics and R Presented by Dr. Neil W. Polhemus

Classification and Regression Trees Machine learning methods used to construct predictive models from data. Recursively partitions the data space using simple binary decisions. Commonly portrayed as a tree with a split at each decision node.

Example Fisher s Iris Data Species Petal.length<2.45 setosa (p=1.0) Petal.width<1.75 Petal.length<4.95 Petal.length<4.95 Sepal.length<5.15 virginica (p=0.666667) virginica (p=0.833333) virginica (p=1.0) versicolor (p=0.8) versicolor (p=1.0)

Basic Model Structure Y: variable to be predicted If categorical, we construct a classification tree. If continuous, we construct a regression tree. X 1, X 2, X p : predictor variables May be either categorical or continuous.

Start at the root node. Partitioning Algorithm Amongst all variables X j, find the split that minimizes the resulting average within-node deviance. For a continuous variable, the split is of the form X j < c. For a discrete variable, the split divides the possible values into 2 distinct groups. If one or more stopping criteria are met after the split, stop. Otherwise consider splitting the child nodes.

RMS Titanic

Sample Data File n=1,309 observations (passengers only) Source: Frank Harrell and Thomas Cason, University of Virginia http://biostat.mc.vanderbilt.edu/twiki/pub/main/datasets/titanic.html

Data Input

Analysis Options

Partitioning Options

de Impurity Measure the impurity in a tree using the residual mean deviance (RMD). n = number of observations in training set k = number of leaves p i,j = proportion of data of same type as i at its assigned leaf j Y i = predicted value of observation i Classification trees Regression trees

Analysis Window

Decision Tree survived sex=female pclass=3 age<9.5 fare<23.0875 sibsp=3,4,5 pclass=2,3

Decision Tree Options

Tree Structure * te: based on complete cases only.

de Probabilities * te: based on complete cases only.

Classification Table * te: based on both complete and partial cases.

Training and Validation Sets May separate the data into 2 sets: Training set used to build the tree. Test set used to estimate the tree misclassification percentages.

Compare Results

Pruning Options Reduces the complexity of the tree by removing branches.

Pruning by Cross-validation Runs a 10-fold cross-validation experiment. Builds 10 trees, leaving out 10% of the data each time, and averages the results. Uses all of the observations for both training and validation. Can be used to determine the optimal size for the tree by increasing the number of leaves until you see warnings such as:

Pruning Example First reduce within-node deviance to fit a complex tree.

Pruning Example surv iv ed sex=fem pclass=3 age<9.5 fare<23.0875 fare<32.0896 sibsp=0,1,2 pclass=1 embarked=q,sfare<31.3312 parch=2,3 fare<149.036age<3.5 fare<15.5729 age<32.25 age<54.5 age<27.5 fare<13.9354 age<17.5 fare<37.0313 sibsp=1,2,3 age<29.5 fare<15.4937 sibsp=0,2,3 sibsp=1 fare<13.125 sibsp=0,1,3 fare<7.9104 fare<26.1438 fare<86.35 age<31.5 fare<18.9625 age<43.0 fare<32.5104 fare<29.85 fare<10.825 parch=0 fare<15.7 fare<8.35625 age<42.5 embarked=q age<31.5 fare<7.7625 fare<7.8896 fare<26.125 age<21.5 fare<27.5833 fare<22.0 age<25.0 fare<13.25 age<37.0 embarked=q,s age<36.25 fare<51.9312 age<42.5 fare<7.8375 fare<14.1584 fare<12.9375 age<39.5 age<36.5 fare<58.875 age<33.5 fare<7.9104 age<20.75 fare<7.5625 pclass=3 fare<13.25 age<33.5 fare<54.2708 fare<80.7542 fare<7.7854 age<25.75 fare<11.0 age<46.0 fare<77.0084 fare<135.067 fare<7.9875 age<23.5 age<25.5 age<21.5 age<59.0 fare<109.892 fare<237.523 age<32.5 fare<7.62705 age<27.5 age<19.5 parch=1 fare<8.10415 fare<8.5479 fare<9.22085 age<21.5

Pruning Example w reduce the number of leaves to 10 and select cross-validation. surv iv ed sex=female pclass=3 age<9.5 fare<23.0875 fare<32.0896 sibsp=3,4,5 pclass=2,3 age<32.25 age<54.5

10 Leaves is Too Complex

Finding Optimal Number of Leaves Step 1: Copy script from Statgraphics to Word and modify last line.

Finding Optimal Number of Leaves Step 2: Copy modified script to R and run it. Look for size with minimum deviance.

Prune Tree

Final Tree survived sex=female pclass=3 age<9.5 sibsp=3,4,5

Predict Additional Cases To make predictions for additional cases, add them to the bottom of the original data, leaving the cell for Y blank.

Predictions and Residuals

Example 2: World Bank Demographics

66.883 71.666 62.228 54.003 74.499 66.652 50.515 55.507 79.518 Decision Tree Life Expectancy Fertility.Rate<3.3 GDP.per.Capita<11068.0 Fertility.Rate<4.585 GDP.per.Capita<4282.75 Fem ale.percentage<49.97 Pop..Density<31.005 Pop..Density<40.37 Age.Dependency.Ratio<72.925

References StatFolios and data files are at: www.statgraphics.com/webinars R Package tree (2015) https://cran.rproject.org/web/packages/tree/tree.pdf Classic text: Brieman, L., Friedman, J., Stone, C.J. and Olshen, R.A. (1998) Classification and Regression Trees. Wadsworth.