Fitting Classification and Regression Trees Using Statgraphics and R Presented by Dr. Neil W. Polhemus
Classification and Regression Trees Machine learning methods used to construct predictive models from data. Recursively partitions the data space using simple binary decisions. Commonly portrayed as a tree with a split at each decision node.
Example Fisher s Iris Data Species Petal.length<2.45 setosa (p=1.0) Petal.width<1.75 Petal.length<4.95 Petal.length<4.95 Sepal.length<5.15 virginica (p=0.666667) virginica (p=0.833333) virginica (p=1.0) versicolor (p=0.8) versicolor (p=1.0)
Basic Model Structure Y: variable to be predicted If categorical, we construct a classification tree. If continuous, we construct a regression tree. X 1, X 2, X p : predictor variables May be either categorical or continuous.
Start at the root node. Partitioning Algorithm Amongst all variables X j, find the split that minimizes the resulting average within-node deviance. For a continuous variable, the split is of the form X j < c. For a discrete variable, the split divides the possible values into 2 distinct groups. If one or more stopping criteria are met after the split, stop. Otherwise consider splitting the child nodes.
RMS Titanic
Sample Data File n=1,309 observations (passengers only) Source: Frank Harrell and Thomas Cason, University of Virginia http://biostat.mc.vanderbilt.edu/twiki/pub/main/datasets/titanic.html
Data Input
Analysis Options
Partitioning Options
de Impurity Measure the impurity in a tree using the residual mean deviance (RMD). n = number of observations in training set k = number of leaves p i,j = proportion of data of same type as i at its assigned leaf j Y i = predicted value of observation i Classification trees Regression trees
Analysis Window
Decision Tree survived sex=female pclass=3 age<9.5 fare<23.0875 sibsp=3,4,5 pclass=2,3
Decision Tree Options
Tree Structure * te: based on complete cases only.
de Probabilities * te: based on complete cases only.
Classification Table * te: based on both complete and partial cases.
Training and Validation Sets May separate the data into 2 sets: Training set used to build the tree. Test set used to estimate the tree misclassification percentages.
Compare Results
Pruning Options Reduces the complexity of the tree by removing branches.
Pruning by Cross-validation Runs a 10-fold cross-validation experiment. Builds 10 trees, leaving out 10% of the data each time, and averages the results. Uses all of the observations for both training and validation. Can be used to determine the optimal size for the tree by increasing the number of leaves until you see warnings such as:
Pruning Example First reduce within-node deviance to fit a complex tree.
Pruning Example surv iv ed sex=fem pclass=3 age<9.5 fare<23.0875 fare<32.0896 sibsp=0,1,2 pclass=1 embarked=q,sfare<31.3312 parch=2,3 fare<149.036age<3.5 fare<15.5729 age<32.25 age<54.5 age<27.5 fare<13.9354 age<17.5 fare<37.0313 sibsp=1,2,3 age<29.5 fare<15.4937 sibsp=0,2,3 sibsp=1 fare<13.125 sibsp=0,1,3 fare<7.9104 fare<26.1438 fare<86.35 age<31.5 fare<18.9625 age<43.0 fare<32.5104 fare<29.85 fare<10.825 parch=0 fare<15.7 fare<8.35625 age<42.5 embarked=q age<31.5 fare<7.7625 fare<7.8896 fare<26.125 age<21.5 fare<27.5833 fare<22.0 age<25.0 fare<13.25 age<37.0 embarked=q,s age<36.25 fare<51.9312 age<42.5 fare<7.8375 fare<14.1584 fare<12.9375 age<39.5 age<36.5 fare<58.875 age<33.5 fare<7.9104 age<20.75 fare<7.5625 pclass=3 fare<13.25 age<33.5 fare<54.2708 fare<80.7542 fare<7.7854 age<25.75 fare<11.0 age<46.0 fare<77.0084 fare<135.067 fare<7.9875 age<23.5 age<25.5 age<21.5 age<59.0 fare<109.892 fare<237.523 age<32.5 fare<7.62705 age<27.5 age<19.5 parch=1 fare<8.10415 fare<8.5479 fare<9.22085 age<21.5
Pruning Example w reduce the number of leaves to 10 and select cross-validation. surv iv ed sex=female pclass=3 age<9.5 fare<23.0875 fare<32.0896 sibsp=3,4,5 pclass=2,3 age<32.25 age<54.5
10 Leaves is Too Complex
Finding Optimal Number of Leaves Step 1: Copy script from Statgraphics to Word and modify last line.
Finding Optimal Number of Leaves Step 2: Copy modified script to R and run it. Look for size with minimum deviance.
Prune Tree
Final Tree survived sex=female pclass=3 age<9.5 sibsp=3,4,5
Predict Additional Cases To make predictions for additional cases, add them to the bottom of the original data, leaving the cell for Y blank.
Predictions and Residuals
Example 2: World Bank Demographics
66.883 71.666 62.228 54.003 74.499 66.652 50.515 55.507 79.518 Decision Tree Life Expectancy Fertility.Rate<3.3 GDP.per.Capita<11068.0 Fertility.Rate<4.585 GDP.per.Capita<4282.75 Fem ale.percentage<49.97 Pop..Density<31.005 Pop..Density<40.37 Age.Dependency.Ratio<72.925
References StatFolios and data files are at: www.statgraphics.com/webinars R Package tree (2015) https://cran.rproject.org/web/packages/tree/tree.pdf Classic text: Brieman, L., Friedman, J., Stone, C.J. and Olshen, R.A. (1998) Classification and Regression Trees. Wadsworth.