Data Science with R Decision Trees with Rattle

Similar documents
Data Science with R Transform and Manipulate Data

Hands-On Data Science with R Pulling it Together into a Package

Decision Trees Using Weka and Rattle

The rattle Package. June 23, 2006

Data Science with R Building Models A Template

Data Science with R Case Studies in Data Science

Hands-On Data Science with R Using CKAN Data Resources

WEKA homepage.

The rattle Package. September 30, 2007

Data Mining Lab 2: A Basic Tree Classifier

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

List of Exercises: Data Mining 1 December 12th, 2015

Hands-On Data Science Sharing R Code With Style

Tutorial: De Novo Assembly of Paired Data

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

Data Science with R. Decision Trees.

1 Topic. Image classification using Knime.

One Page R Data Science Coding with Style

Classification Algorithms in Data Mining

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Mining Weather Data Using Rattle

AN OVERVIEW AND EXPLORATION OF JMP A DATA DISCOVERY SYSTEM IN DAIRY SCIENCE

Module 1: Introduction RStudio

Tutorials Case studies

International Journal of Scientific & Engineering Research, Volume 4, Issue 7, July ISSN

Decision Tree Structure: Root

User Manual of ClimMob Software for Crowdsourcing Climate-Smart Agriculture (Version 1.0) Jacob van Etten

COMP s1 - Getting started with the Weka Machine Learning Toolkit

Probabilistic Classifiers DWML, /27

Evaluating Classifiers

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

ForeScout CounterACT. Configuration Guide. Version 5.0

WEKA Explorer User Guide for Version 3-4

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Weka ( )

Evaluating Classifiers

A system for statistical analysis. Instructions for installing software. R, R-studio and the R-commander

ForeScout CounterACT. Configuration Guide. Version 1.2

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

The Explorer. chapter Getting started

Big Data Analytics The Data Mining process. Roger Bohn March. 2017

Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Predicting Rare Failure Events using Classification Trees on Large Scale Manufacturing Data with Complex Interactions

Data Preprocessing. Slides by: Shree Jaswal

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Import and Quality Control in Geochemistry for ArcGIS

rpms: An R Package for Modeling Survey Data with Regression Trees

CARTWARE Documentation

Using Weka for Classification. Preparing a data file

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

The potential of model-based recursive partitioning in the social sciences Revisiting Ockham s Razor

Outline. Mixed models in R using the lme4 package Part 1: Introduction to R. Following the operations on the slides

Data Mining With Weka A Short Tutorial

Machine Learning: An Applied Econometric Approach Online Appendix

TIBCO Spotfire DecisionSite Quick Start Guide

Automatic Domain Partitioning for Multi-Domain Learning

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

> z <- cart(sc.cele~k.sc.cele+veg2+elev+slope+aspect+sun+heat, data=envspk)

5.13 Exercise: Sampling variation - numeric data

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Michele Samorani. University of Alberta School of Business. More tutorials are available on Youtube

Credit card Fraud Detection using Predictive Modeling: a Review

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

CounterACT Reports Plugin

6.034 Design Assignment 2

QDA Miner. Addendum v2.0

A PROPOSED HYBRID BOOK RECOMMENDER SYSTEM

Consolidate Trial Balances

Performance Evaluation of Various Classification Algorithms

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Data Mining and Knowledge Discovery Practice notes 2

OnCommand Insight 7.2

Data Analyst Nanodegree Syllabus

Short instructions on using Weka

Introduction to Modeling with Simile

COMP33111: Tutorial/lab exercise 2

Lecture 25: Review I

Data Science Essentials

Data Mining and Knowledge Discovery: Practice Notes

Can You Make A Box And Whisker Plot In Excel 2007

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Business Club. Decision Trees

Classification Key Concepts

Data Mining Part 5. Prediction


ABBYY Smart Classifier 2.7 User Guide

WEKA KnowledgeFlow Tutorial for Version 3-5-6

Tableau. training courses

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Classification Key Concepts

Cost Sensitive Time-series Classification Shoumik Roychoudhury, Mohamed Ghalwash, Zoran Obradovic

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

CS145: INTRODUCTION TO DATA MINING

AI32 Guide to Weka. Andrew Roberts 1st March 2005

Data Mining Concepts & Techniques

Introduction to R: Part I

Statistics 13, Lab 1. Getting Started. The Mac. Launching RStudio and loading data

Project Collaboration

Transcription:

Data Science with R Decision Trees with Rattle Graham.Williams@togaware.com 9th June 2014 Visit http://onepager.togaware.com/ for more OnePageR s. In this module we use the weather dataset to explore the building of decision tree models in rattle (Williams, 2014). The required packages for this module include: library(rpart) library(rattle) # Popular recursive partitioning decision tree algorithm # Graphical user interface for building decision trees As we work through this chapter, new R commands will be introduced. Be sure to review the command s documentation and understand what the command does. You can ask for help using the? command as in:?read.csv We can obtain documentation on a particular package using the help= option of library(): library(help=rattle) This chapter is intended to be hands on. To learn effectively, you are encouraged to have R running (e.g., RStudio) and to run all the commands as they appear here. Check that you get the same output, and you understand the output. Try some variations. Explore. Copyright 2013-2014 Graham Williams. You can freely copy, distribute, or adapt this material, as long as the attribution is retained and derivative work is provided under the same license.

1 Loading the Data 1. On the Data tab, click the Execute button to load the default weather dataset (which is loaded after clicking Yes). 2. Note that we can click on the Filename chooser box to find some other datasets. Assuming we have just loaded the default weather dataset, we should be taken to the folder containing the actual CSV data files provided with Rattle. 3. Load the audit dataset. 4. Note that the variable TARGET Adjusted is selected as the Target variable, and that the variable ID is identified as having a role of Ident(ifier). 5. Note also the variable RISK Adjustment is set to have a role of Risk (this is based on its name). For now, choose to give it the role of an Input variable. 6. Choose to Partition the data. In fact, leave the 70/15/15 percentage split in the Partition text box as it is. Also, ensure the Partition checkbox is checked. This results in a random 70% of the dataset being used for training or building our models. A 15% sample of the dataset is used for validation, and is used whilst we are building and comparing different decision trees through the use of different parameters. The final 15% is for testing. 7. Exercise: Research the issue of selecting a training/validation/testing dataset. Why is it important to partition the data in this way when data mining? Explain in a paragraph or two. 8. Exercise: Compare this approach to partitioning a dataset with the concept of cross fold validation. Report on this in one or two paragraphs. 9. Be sure you have clicked the Execute button whilst on the Data tab. This will ensure that the sampling, for example, has taken place. Copyright 2013-2014 Graham@togaware.com Module: DTreesG Page: 1 of 10

2 Exploring the Data 10. Click the View button to get a little familiar with the data. 11. Exercise: What values of the variable TARGET Adjusted correspond to an adjustment? How does TARGET Adjusted relate to RISK Adjustment? 12. Explore the dataset further from the Explore tab. 13. Firstly, simply click Execute from the Summary option. 14. Explore some of the different summaries that we can generate. 15. Textual summaries are comprehensive, but sometimes take a little getting used to. Visual summaries can convey information more quickly. Select the Distributions option and then Execute (without any distributions being selected) to view a family of scatter plots. 16. Exercise: What do the numbers in the plots represent? 17. Choose one of each of the plot types for Income, then execute. 18. Exercise: Research the use of Benford s law. How might it be useful in the context of the Benford s plot of Income? Discuss in one or two paragraphs. 19. Rattle is migrating to a more sophisticated collection of graphics. To access this work in progress, from the Settings menu enable the Advanced Graphics option. Then select Execute again. It is experimental and if it fails, un-select the Option. 20. Now have a look at the distribution of Age against the target variable. 21. Exercise: Are the different distributions significant? Explain each of the different elements of the plot. Copyright 2013-2014 Graham@togaware.com Module: DTreesG Page: 2 of 10

3 Naïvely Building Our First Decision Tree 22. Move to the Model tab, click the Execute button. 23. A decision tree is produced. Take a moment to understand what the description of the decision tree means. Click on the Draw button to see a visual presentation of the tree. 24. Go to the Evaluate tab and click the Execute button. 25. Exercise: How accurate is this model? How many true positives are there? How many false positives are there? Which are better false positives or false negatives? What are the consequences of false positives and of true positives in the audit scenario? 26. What is the fundamental flaw with the model we have just built? 27. Go back to the Explore tab and have a look at the distribution of RISK Adjustment against the target variable. Does this explain the model performance? 28. Go back to the Data tab and change RISK Adjustment to be a Risk variable. Set to Ignore any variables that you feel are not suitable for decision tree classification. After having built further decision trees (or any models in fact) you might want to come back to the Data tab and change your variable selections. Be sure to click the Execute button each time. 29. Exercise: What do you notice about the variables chosen in the new model? Why are categoric variables favoured over numeric variables? Research this issue and discuss in one or two paragraphs. Copyright 2013-2014 Graham@togaware.com Module: DTreesG Page: 3 of 10

4 Building a Useful Decision Tree 30. Go to the Model tab and make sure the Tree radio button is selected. 31. Note the various parameters that can be set and modified. Read the Rattle documentation on Decision Trees for more information. You can also get additional help for these parameters from R by typing into the R console of RStudio: help(rpart.control). 32. Which control options noted in the documentation of the rpart() command correspond to which Rattle options? 33. Generate a new decision tree by clicking on Execute and inspect what is printed into the Rattle textview. 34. Click on Draw and a window with a decision tree will be shown. 35. Which leaf node is most predictive of a tax return requiring an adjustment? Which is most predictive of not requiring an adjustment? 36. Compare the decision tree drawing with the Summary of the rpart model in the main Rattle textview. Each leaf node in the drawing has a coloured number (which corresponds to the leaf node number), a 0 or 1 (which is the class label from the audit data set according to the target variable TARGET Adjusted), and a percentage number (which corresponds to the accuracy of the classified training records in this leaf node). Copyright 2013-2014 Graham@togaware.com Module: DTreesG Page: 4 of 10

5 Evaluating the Model 37. On the Evaluate tab examine the different options to evaluate the accuracy of the decision tree we generated. Make sure the Validation and the Error Matrix radio buttons are both selected, and then click on Execute. 38. Check the error matrix that is printed (and write down the four numbers for each tree you generate). 39. What is the error rate of the new decision tree? What is its accuracy? 40. Click the Training radio button and again click on Execute. 41. What is the error rate and what is the accuracy? 42. Why are the error rate and accuracy different between the validation and training settings? 43. Select different Evaluate options then click on Execute. 44. You can read more on these evaluation measures in the Rattle book. 45. Investigate the ROC Curve graphical measure, as this is a widely used method to assess the performance of classifiers. Copyright 2013-2014 Graham@togaware.com Module: DTreesG Page: 5 of 10

6 Fine Tuning the Decision Tree 46. Go back to the Model tab and experiment with different values for the parameters Complexity, Min Bucket, Min Split and Max Depth. 47. Which tree will give you the best accuracy, which one the worst? Which tree is the easiest to interpret? Which is hardest? Remember that you need to click on Execute each time you change a parameter setting and want to generate a new tree. 48. Click on the Rules button to see the rules generated from a given tree. 49. What is easier to understand, the tree diagram or the rules? 50. What is the highest accuracy you can achieve on the audit data set using the Rattle decision tree classifier? Which is the best tree you can generate? Copyright 2013-2014 Graham@togaware.com Module: DTreesG Page: 6 of 10

7 Finishing Up 51. When finished we might like to save our project. Click on the Save icon. 52. We can give our project a name, like audit_140609.rattle. 53. Quit from Rattle and from R. 54. Now start up R and then Rattle again and load the project you saved. Copyright 2013-2014 Graham@togaware.com Module: DTreesG Page: 7 of 10

8 Further Reading ˆ The Rattle Book, published by Springer, provides a comprehensive introduction to data mining and analytics using Rattle and R. It is available from Amazon. Other documentation on a broader selection of R topics of relevance to the data scientist is freely available from http://datamining.togaware.com, including the Datamining Desktop Survival Guide. This module is one of many OnePageR modules available from http://onepager.togaware.com. In particular follow the links on the website with a * which indicates the generally more developed OnePageR modules. Copyright 2013-2014 Graham@togaware.com Module: DTreesG Page: 8 of 10

9 References R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.r-project.org/. Williams GJ (2009). Rattle: A Data Mining GUI for R. The R Journal, 1(2), 45 55. URL http://journal.r-project.org/archive/2009-2/rjournal_2009-2_williams.pdf. Williams GJ (2011). Data Mining with Rattle and R: The art of excavating data for knowledge discovery. Use R! Springer, New York. URL http://www.amazon.com/gp/product/ 1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp= 217145&creative=399373&creativeASIN=1441998896. Williams GJ (2014). rattle: Graphical user interface for data mining in R. R package version 3.0.4, URL http://rattle.togaware.com/. This document, sourced from DTreesG.Rnw revision 419, was processed by KnitR version 1.6 of 2014-05-24 and took 1 seconds to process. It was generated by gjw on nyx running Ubuntu 14.04 LTS with Intel(R) Xeon(R) CPU W3520 @ 2.67GHz having 4 cores and 12.3GB of RAM. It completed the processing 2014-06-09 10:37:50. Copyright 2013-2014 Graham@togaware.com Module: DTreesG Page: 9 of 10