Decision Trees In Weka,Data Formats

Similar documents
CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Basic Concepts Weka Workbench and its terminology

The Explorer. chapter Getting started

Decision Trees Using Weka and Rattle

Input: Concepts, Instances, Attributes

Homework 1 Sample Solution

Machine Learning Chapter 2. Input

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Performance Analysis of Data Mining Classification Techniques

ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning: Algorithms and Applications Mockup Examination

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Introduction to Artificial Intelligence

Sabbatical Leave Report

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Machine Learning in Real World: C4.5

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable!

Data Warehouse & Mining Lab Manual

COMP33111: Tutorial and lab exercise 7

Summary. Machine Learning: Introduction. Marcin Sydow

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Classification using Weka (Brain, Computation, and Neural Learning)

AI32 Guide to Weka. Andrew Roberts 1st March 2005

Classification with Decision Tree Induction

Hands on Datamining & Machine Learning with Weka

Decision Tree Learning

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

Data Mining Input: Concepts, Instances, and Attributes

IMPLEMENTATION OF ANT COLONY ALGORITHMS IN MATLAB R. Seidlová, J. Poživil

Advanced learning algorithms

Data Mining Algorithms: Basic Methods

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts

University of Florida CISE department Gator Engineering. Visualization

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

A Tour of Sweave. Max Kuhn. March 14, Pfizer Global R&D Non Clinical Statistics Groton

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Nearest Neighbor Classification

Implementation of Classification Rules using Oracle PL/SQL

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Contents. 1. Introduction Ripple-down Rules Relational Rules HENRY and ABE...

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining and Analytics

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

Experimental Design + k- Nearest Neighbors

DATA MINING LAB MANUAL

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Machine Learning via Decision Trees: C4.5

Intro to R for Epidemiologists

Introduction to R and Statistical Data Analysis

COMP s1 - Getting started with the Weka Machine Learning Toolkit

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

Function Algorithms: Linear Regression, Logistic Regression

Business Club. Decision Trees

Classification: Decision Trees

Function Approximation and Feature Selection Tool

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset

k-nearest Neighbors + Model Selection

Oblique Linear Tree. 1. Introduction

Induction of Decision Trees

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

CSCI567 Machine Learning (Fall 2014)

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

Back-to-Back Stem-and-Leaf Plots

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

CS4618: Artificial Intelligence I. Accuracy Estimation. Initialization

SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA

CISC 4631 Data Mining

Data Mining Practical Machine Learning Tools and Techniques

STAT 1291: Data Science

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

BITS F464: MACHINE LEARNING

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Using Weka for Classification. Preparing a data file

Introduction to Machine Learning

Slides for Data Mining by I. H. Witten and E. Frank

2012 Fall, CENG 514 Data Mining, Homework 3 Key by Dilek Önal

Unsupervised: no target value to predict

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke

Visualizing class probability estimators

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

Decision Trees: Discussion

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

h=[3,2,5,7], pos=[2,1], neg=[4,4]

Transcription:

CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016

J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned or unpruned C4.5 decision tree. For more information, see Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.CAPABILITIES Class -- Nominal class, Binary class, Missing class values Attributes -- Empty nominal attributes, Nominal attributes, Date attributes, Numeric attributes, Unary attributes, Missing values, Binary attributes min # of instances: 0

Some of the Options: 3 Unpruned -- Whether pruning is performed. Default is false pruned. minnumobj -- The minimum number of instances per leaf. (default is 2). Note that this is separate from value of Unpruned For pruned trees: Subtree pruning: rising entire subtree up a level. Default is true. c: confidencefactor -- The confidence factor used for pruning (smaller values incur more pruning). (default is 0.25). Build full tree and then work back from the leaves, applying a statistical test at each stage reducederrorpruning -- Whether reduced-error pruning is used instead. numfolds -- Determines the amount of data used for reduced-error pruning. One fold is used for pruning, the rest for growing the tree. donotcheckcapabilities -- If set, classifier capabilities are not checked before classifier is built (Use with caution to reduce runtime)

Looking At The Results 4 For all classifiers, Weka will show you === Run information === Scheme: weka.classifiers.trees.j48 -C 0.25 -M 2 Relation: weather.symbolic Instances: 14 Attributes: 5. outlook, temperature, humidity, windy, play Test Mode: 10-fold cross-validation === Classifier model (full training set) === Model-specific information. For J48, the decision tree Time taken to build model: 0.02 seconds ===Evaluation=== This will give the evaluation method and possibly the time it took Summary, Detailed Accuracy By Class, Confusion Matrix Next time we will look in detail at these statistics

Decision Tree Model 5 Text version, number of leaves, size of tree, counts outlook = sunny humidity = high: no (3.0) humidity = normal: yes (2.0) outlook = overcast: yes (4.0) outlook = rainy windy = TRUE: no (2.0) windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8

J48 on Iris 6 J48 pruned tree petalwidth <= 0.6: Iris-setosa (50.0) petalwidth > 0.6 petalwidth <= 1.7 petallength <= 4.9: Iris-versicolor (48.0/1.0) petallength > 4.9 petalwidth <= 1.5: Iris-virginica (3.0) petalwidth > 1.5: Iris-versicolor (3.0/1.0) petalwidth > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9

Weka Pruning Exercise 7 Open the breast-cancer dataset in a text editor. Determine from the comments how many possible values there are for the age attribute, and how many are actually used. Open the dataset in the Explorer, go to the Classify tab, and select J48. Set the unpruned switch set to True. Experiment with values of minnumobj, noting the number of leaves and size of the tree in each case: 1, 2, 3, 5, 10, 20, 50, 100. Which call produces the same values as J48 with default parameters? (i.e., unpruned=false, minnumobj=2). In general, J48's confidencefactor parameter is best left alone, but it is interesting to see its effect. With default values for the other parameters, experiment with the following values of confidencefactor, recording the performance in each case (evaluated using 10-fold cross-validation): 0.005, 0.05, 0.1, 0.25, 0.5 Which value or values produce the greatest accuracy? https://weka.waikato.ac.nz/dataminingwithweka/activity?unit=3&lesson=5

CS 4510/9010 Applied Machine Learning 8 Data Format in Weka Paula Matuszek Fall, 2016

Weka-Supported Formats 9 Weka s native format is called ARFF: Attribute Relation File Format It will also input various other formats: Compressed ARFF files (.arff.gz) Comma-separated value files (.csv) JSON (serialized attribute/relation pair objects)(.json) Various ML tool outputs Chosen on the Preprocess tab, for the Open File button.

Weka Input Menu 10

ARFF Format 11 Header Section: information about the data the name of the relation a list of the attributes (the columns in the data) their types Data Section comma-separated list, one line/instance Comments Begin with % Good idea to describe class, source, sometimes meanings of attributes

Header Section 12 @RELATION declaration: names what we are talking about. String. Quote it if it includes spaces. @RELATION iris @ATTRIBUTE declarations: names each attribute and gives its type. One/attribute, including the class. Must start with a letter. Quote it if includes spaces. @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE petal width NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

Attribute Types 13 Numeric. Can be real or integer. @ATTRIBUTE sepallength NUMERIC Nominal specification: named attributes {} @ATTRIBUTE color {red, green, blue} @ATTRIBUTE class {versicolor, setosa} String: arbitrary text @ATTRIBUTE emailbody string Date. Give date format. @ATTRIBUTE timestamp DATE "yyyy-mm-dd" Note that these are Weka-specific, but concepts are not

Data section 14 @DATA One line/instance, comma separated Example: For attributes: @Attribute sepallength NUMERIC @Attribute class {setosa, versicolor} @Attribute description STRING @Attribute timestamp DATE yyyy MM dd We might have instances 5.1, setosa, Lovely big flowers, 2014 09 10 4.9, setosa, Nice, 2014 06 03

Examples 15 Iris. Detailed, very nice comments. Numeric and nominal attributes. Weather, nominal. No comments, all nominal. Reuters a string attribute.

Importing 16 Restaurant1.csv Import, look at data imported on the right Does the Class look correct? Use the edit button to example further Restaurant2.csv Import, look again. Are all of these attributes useful? Remove any that look inappropriate.

Decision Tree on Restaurants 17 Try it with the defaults. Examine the results. See if you can get to a reasonably accurate tree.

Decision Tree on Restaurants 18 See if you can get to a reasonable tree. Try modifying the following: Change the minimum number of objects to 1. Don t prune. Evaluate against the training set. Basic conclusion: you need data to learn well. We don t have enough here. The only way to get decent performance out of this is to massively overfit.

Summary: 19 J48 in Weka provides a rich implementation of Quinlan s decision tree algorithm, with many options. In general, the default options, which include pruning and a minimum leaf size of 2, work very well. Weka s native data format is ARFF. It provides the name of a relation which will normally be the class for classifiers and a description of each attribute. It is good practice to add comments about source of the data and meaning of the attributes It can import other formats, such as.csv, and will make a reasonable guess about the attributes.