CARTWARE Documentation

Similar documents
> z <- cart(sc.cele~k.sc.cele+veg2+elev+slope+aspect+sun+heat, data=envspk)

Nonparametric Approaches to Regression

Classification and Regression Trees

Random Forest A. Fornaser

Decision Tree Structure: Root

Classification Algorithms in Data Mining

Tree-based methods for classification and regression

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Nonparametric Classification Methods

Network Traffic Measurements and Analysis

Fathom Dynamic Data TM Version 2 Specifications

Decision trees. For this lab, we will use the Carseats data set from the ISLR package. (install and) load the package with the data set

The TWIX Package. April 24, 2007

Classification and Regression Trees

Lecture 25: Review I

Package maboost. R topics documented: February 20, 2015

Patrick Breheny. November 10

Using the DATAMINE Program

Business Club. Decision Trees

Supervised vs unsupervised clustering

Lecture 19: Decision trees

Chapter 5. Tree-based Methods

Statistical Pattern Recognition

Linear Methods for Regression and Shrinkage Methods

Ensemble Methods, Decision Trees

CS229 Lecture notes. Raphael John Lamarre Townshend

Classification: Decision Trees

MATH 115: Review for Chapter 1

Statistical Pattern Recognition

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

5 Learning hypothesis classes (16 points)

7. Boosting and Bagging Bagging

Knowledge Discovery and Data Mining

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Classification/Regression Trees and Random Forests

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Nonparametric Methods Recap

Statistics Lecture 6. Looking at data one variable

Workshop 8: Model selection

Statistical Pattern Recognition

Machine Learning Techniques for Data Mining

8. Tree-based approaches

Minitab 17 commands Prepared by Jeffrey S. Simonoff

1) Give decision trees to represent the following Boolean functions:

Supervised Learning Classification Algorithms Comparison

Credit card Fraud Detection using Predictive Modeling: a Review

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA.

Lecture 7: Decision Trees

Cyber attack detection using decision tree approach

Notes based on: Data Mining for Business Intelligence

CART Bagging Trees Random Forests. Leo Breiman

Lecture 5: Decision Trees (Part II)

Notes for week 3. Ben Bolker September 26, Linear models: review

Package JOUSBoost. July 12, 2017

Package StatMeasures

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Random Forest in Genomic Selection

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Didacticiel - Études de cas

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

Performance Evaluation of Various Classification Algorithms

Linear Regression and K-Nearest Neighbors 3/28/18

Slides for Data Mining by I. H. Witten and E. Frank

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Machine Learning Techniques

Trees, SVM, and Random Forest Discriminants

3 Feature Selection & Feature Extraction

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Introduction to Classification & Regression Trees

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Interpretable Machine Learning with Applications to Banking

I211: Information infrastructure II

Didacticiel - Études de cas. Comparison of the implementation of the CART algorithm under Tanagra and R (rpart package).

Machine Learning: An Applied Econometric Approach Online Appendix

Package TWIX. R topics documented: February 15, Title Trees WIth extra splits. Date Version

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

Random Forests: Presentation Summary

Algorithms: Decision Trees

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Data Mining Lecture 8: Decision Trees

Cross-validation and the Bootstrap

Prentice Hall Pre-Algebra 2004 Correlated to: Hawaii Mathematics Content and Performance Standards (HCPS) II (Grades 9-12)

Classification. Instructor: Wei Ding

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Lecture 3 - Object-oriented programming and statistical programming examples

node), split, n, loss, yval, (yprob)

7. Decision or classification trees

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Lecture 06 Decision Trees I

Data Preprocessing. Slides by: Shree Jaswal

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Transcription:

CARTWARE Documentation CARTWARE is a collection of R functions written for Classification and Regression Tree (CART) Analysis of ecological data sets. All of these functions make use of existing R functions and the core cart() function is simply convenience wrapper for the rpart() function contained in the rpart library. Brad Compton, Research Associate Department of Natural Resources Conservation University of Massachusetts, Amherst Date Last Updated: 13 November, 2006 Functions: Cartware.R includes 23 separate functions, but most of these functions are sub-functions that are called from other primary functions (i.e., the user would not likely call the sub-functions directly). The help files in this document include only the primary cartware functions: cart...2 monte.cart...4 rhist...6 var.importance...7 1

Function: cart Description: This is the primary cartware function. Cart() is simply a convenience wrapper for the rpart() function in the rpart library. Cart() computes both classification and regression trees, depending on the type of dependent variable. If the dependent variable is categorical (i.e., designated a factor in R), a classification tree will be created, otherwise a regression tree will be created. In addition to all of the arguments available in rpart(), the cart() function includes a couple of additional arguments to generate smoothed relative error plot (aka the cp plot) and automate tree pruning based on a 10-fold cross-validation and the 1-SE rule. Usage: cart(formula, data,..., control=usual, pick = TRUE, prune = 0, textsize = 0.75, smooth = 1, margin=.06) Dependency: library(rpart). Arguments: formula: data: required formula (symbolic description) of the model to be fit. A typical model has the form 'response ~ terms' where 'response' is the dependent variable or response vector (numeric or character) and 'terms' is a series of terms which specifies a linear predictor for 'response'. A typical model has the form y~x1+x2+... Note, rpart does not allow for interaction terms (e.g., x1*x2). Alternatively, all the variables in the data= object can be included as independent variables by specifying a period (.) on the right hand side of the equation. In this case, the data object should not include the dependent variable. Note, for a classification tree the term on the left hand side of the equation must be a factor. required data frame in which to interpret the variables named in the formula. Note, the data frame can include continuous, categorical and/or count variables.... optional parameters for the splitting function, including parameters for specifying priors, costs and the splitting criterion (see help(rpart) for details). control: pick: optional controls on details of the 'rpart' algorithm. See help(rpart.controls) for details on the parameters that can controlled. To change these controls, specify control=rpart.control( put list of controls here ) as an argument in the cart() call. See example below. The default control settings are specified by the function usual(). Type usual to see the current settings. [default = usual] optional logical (TRUE or FALSE) of whether to use a 10-fold cross-validation and 1-SE rule for picking the right-sized tree. [default = TRUE] 2

prune: optional level for pruning the tree. The level specified is the value of the complexity parameter (cp). [default = 0; no pruning] textsize: optional textsize for the cart plot. [default = 0.75] smooth: optional smoothing parameter (positive integer value) for the relative error plot (aka cp plot). The smoothing parameter specifies the number of times the V-fold cross-validation is conducted to estimate the relative error for trees of various sizes. If smoothing = m, the relative error is computed as the average across the m V-fold cross-validations. If m>>1, the cp plot will be smoothed over many runs and not be the reflection of a single random subsetting of the data. [default = 1; no smoothing] margin: optional width of the margin for the tree plot. [default =.06] Details: cart() is simply a wrapper for rpart(). See the help files for rpart() for more details. Value: Returns an object of class 'rpart', a superset of class 'tree', which is a list containing several components. See help(rpart.object) for details on the list components. B. Compton, August 30, 2006 References: See references in rpart(). Examples: turtle<-read.csv('byturtle.csv',header=true) grp<-as.factor(turtle[,3]) y<-turtle[,6:30] z<-cart(grp~.,data=y,method='class',parms=list(split='gini'),control=rpart.control(minsplit=2, minbucket=1),pick=false) 3

Function: monte.cart Description: Computes a P-value for a tree, based on Monte Carlo resampling. Specifically, monte.cart() conducts a randomization test of the null hypothesis that the overall correct classification rate of the tree is greater than that expected if there were no real group structure to the data. Usage: monte.cart(model, data, size, n = 1000, ifplot=false, ifcartplot=true, control=usual, parms,...) Dependency: library(rpart). Arguments: model: data: size: required rpart model; same as formula in cart() and rpart(). required data frame in which to interpret the variables named in the model formula. Note, the data frame can include continuous, categorical and/or count variables. required tree size to test; must be a positive integer. n: optional number of random permutation trees to create; must be a positive integer (the more the better). [default = 1000] ifplot: optional logical (TRUE or FALSE) indicating whether to produce a density plot of the random permutation distribution of correct classification rate (CCR) under the null hypothesis (curve) vs. the CCR of the original tree (triangle). [default = FALSE] ifcartplot: optional logical (TRUE or FALSE) indicating whether to produce a tree for the original model based on the full data set. Note, this is the same tree that is produced by cart() or rpart() for the same mdoel. [default = TRUE] control: parms: optional controls on details of the 'rpart' algorithm. See control argument in cart(). [default = usual] optional parameters for the rpart splitting function, including parameters for specifying priors, costs and the splitting criterion (see help(rpart) for details). [default = use rpart defaults]... optional arguments to rpart() and rpart.control(); see corresponding help files for 4

details. [default = use all function defaults] Details: The monte.cart() function works by permuting the grouping variable only. In this manner, the correlation structure of the observation vectors is maintained, but the group membership is made random with respect to the discriminating variables. Mone.cart() then compares the original correct classification rate (CCR) with the distribution of CCRs from the random trees. If the original tree was produced by chance, we should expect to see a high P. On the other hand, if there is little chance that the original tree could have been produced from the data set without any real structure, then we should get a small P. A small P, say <0.05, indicates that the observed CCR is significantly better than chance. Value: Prints a summary of the model, including its correct classification rate, Kappa or Tau statistic, tree size, and confusion matrix, as well as a summary of the permutation test result. Two p-values are reported. The first p-value is the proportion of permuted CCR s greater than the observed CCR. The second p-value is the proportion of the kernel density curve greater than the observed CCR. The kernel density curve is simply a smoothed version of the actual permutation distribution. These p-values should be similar in most cases. In addition, monte.cart() returns a vector with the permuted CCR s. B. Compton, August 30, 2006 References: None. Examples: turtle<-read.csv('byturtle.csv',header=true) grp<-as.factor(turtle[,3]) y<-turtle[,6:30] y.grp<-cbind(grp,y) monte.cart(grp~strlen+ub,data=y.grp,method='class',parms=list(split='gini'),n=100,size=3, ifplot=true) 5

Function: rhist Description: Creates a barplot (classification tree) or histogram (regression tree) for each leaf in the tree. Usage: rhist(a, digits=4) Dependency: library(rpart). Arguments: a: required rpart object, the result of running rpart() or cart(). digits: optional number of digits for rounding values in the histogram (regression tree). [default = 4] Details: rhist() is simply a wrapper for barplot() and hist(). It produces a barplot of class membership for each node in a classification tree, giving the frequency of observations in each class, and a histogram of the dependent variable for each node in a regression tree, giving the frequency of observations in each bin. It reports the expected (or predicted) value for each node as well (y-hat). This function is useful in better understanding the composition of samples in each node. Value: None. Produces barplots or histograms corresponding to the nodes in the tree, from left to right, in a new graphics window. B. Compton, August 30, 2006 References: None. Examples: turtle<-read.csv('byturtle.csv',header=true) grp<-as.factor(turtle[,3]) y<-turtle[,6:30] z<-cart(grp~.,data=y,method='class',parms=list(split='gini'),control=rpart.control(minsplit=2, minbucket=1),pick=false) rhist(z) 6

Function: var.importance Description: Computes variable importance for classification and regression trees. Usage: var.importance(x, chatter = FALSE) Dependency: library(rpart). Arguments: x: required rpart object, typically derived by running rpart() or cart(). chatter: optional logical (TRUE or FALSE) indicating whether to print details of the variable importance calculations at each node in the tree. [default = FALSE] Details: While a parsimonious tree and accurate prediction is often the primary goal, we are sometimes also interested in how each of the measured variables, including those that did not make it into the final tree, perform in terms of their ability to predict. For this purpose, it is useful to compute a measure of variable importance. Variable importance is determined by calculating for each variable at each node the change in impurity attributable to the best surrogate split on that variable. Impurity is measured using the gini index in classification trees and deviance in regression trees. The sum of these deltas across all nodes for each variable, M(x m), is a measure of how much total decrease in impurity can be achieved by each variable. Since only the relative magnitude of the M(x m) are interesting, the actual measures of importance are generally given as the normalized quantiles. Note that variable importance is calculated for the tree structure given in the rpart object specified (x). The structure of the tree (i.e., the depth of the tree and the splits at each node) is determined by the criteria specified for growing and pruning the tree. Changing the tree criteria will usually result in a different tree, and this will affect the variable importance measures. In particular, changing the number of surrogate splitters considered at each node can have a significant effect on variable importance. BEWARE, the computed importance values for classification trees do not agree exactly with those produced by the commercial CART(tm) software, for reasons that we have not been able to figure out, but they are very close. Moreover, the importance values for classification trees are only approximately correct when priors are proportional to group sample sizes (the default) and missclassification costs have not been specified (the default), although the results match CART(tm) for regression trees. So, use this function with some caution. 7

Value: Returns a data frame containing two columns: column 1 is the variable (including any variable that was evaluated as a surrogate splitter at any node in the tree); column 2 is the normalized relative importance value. The most important value is always given a score of 100 and all other values are relative. B. Compton, August 30, 2006 References: None. Examples: turtle<-read.csv('byturtle.csv',header=true) grp<-as.factor(turtle[,3]) y<-turtle[,6:30] z<-cart(grp~.,data=y,method='class',parms=list(split='gini'),smooth=30,pick=true) var.importance(z) 8