CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016
J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned or unpruned C4.5 decision tree. For more information, see Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.CAPABILITIES Class -- Nominal class, Binary class, Missing class values Attributes -- Empty nominal attributes, Nominal attributes, Date attributes, Numeric attributes, Unary attributes, Missing values, Binary attributes min # of instances: 0
Some of the Options: 3 Unpruned -- Whether pruning is performed. Default is false pruned. minnumobj -- The minimum number of instances per leaf. (default is 2). Note that this is separate from value of Unpruned For pruned trees: Subtree pruning: rising entire subtree up a level. Default is true. c: confidencefactor -- The confidence factor used for pruning (smaller values incur more pruning). (default is 0.25). Build full tree and then work back from the leaves, applying a statistical test at each stage reducederrorpruning -- Whether reduced-error pruning is used instead. numfolds -- Determines the amount of data used for reduced-error pruning. One fold is used for pruning, the rest for growing the tree. donotcheckcapabilities -- If set, classifier capabilities are not checked before classifier is built (Use with caution to reduce runtime)
Looking At The Results 4 For all classifiers, Weka will show you === Run information === Scheme: weka.classifiers.trees.j48 -C 0.25 -M 2 Relation: weather.symbolic Instances: 14 Attributes: 5. outlook, temperature, humidity, windy, play Test Mode: 10-fold cross-validation === Classifier model (full training set) === Model-specific information. For J48, the decision tree Time taken to build model: 0.02 seconds ===Evaluation=== This will give the evaluation method and possibly the time it took Summary, Detailed Accuracy By Class, Confusion Matrix Next time we will look in detail at these statistics
Decision Tree Model 5 Text version, number of leaves, size of tree, counts outlook = sunny humidity = high: no (3.0) humidity = normal: yes (2.0) outlook = overcast: yes (4.0) outlook = rainy windy = TRUE: no (2.0) windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8
J48 on Iris 6 J48 pruned tree petalwidth <= 0.6: Iris-setosa (50.0) petalwidth > 0.6 petalwidth <= 1.7 petallength <= 4.9: Iris-versicolor (48.0/1.0) petallength > 4.9 petalwidth <= 1.5: Iris-virginica (3.0) petalwidth > 1.5: Iris-versicolor (3.0/1.0) petalwidth > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9
Weka Pruning Exercise 7 Open the breast-cancer dataset in a text editor. Determine from the comments how many possible values there are for the age attribute, and how many are actually used. Open the dataset in the Explorer, go to the Classify tab, and select J48. Set the unpruned switch set to True. Experiment with values of minnumobj, noting the number of leaves and size of the tree in each case: 1, 2, 3, 5, 10, 20, 50, 100. Which call produces the same values as J48 with default parameters? (i.e., unpruned=false, minnumobj=2). In general, J48's confidencefactor parameter is best left alone, but it is interesting to see its effect. With default values for the other parameters, experiment with the following values of confidencefactor, recording the performance in each case (evaluated using 10-fold cross-validation): 0.005, 0.05, 0.1, 0.25, 0.5 Which value or values produce the greatest accuracy? https://weka.waikato.ac.nz/dataminingwithweka/activity?unit=3&lesson=5
CS 4510/9010 Applied Machine Learning 8 Data Format in Weka Paula Matuszek Fall, 2016
Weka-Supported Formats 9 Weka s native format is called ARFF: Attribute Relation File Format It will also input various other formats: Compressed ARFF files (.arff.gz) Comma-separated value files (.csv) JSON (serialized attribute/relation pair objects)(.json) Various ML tool outputs Chosen on the Preprocess tab, for the Open File button.
Weka Input Menu 10
ARFF Format 11 Header Section: information about the data the name of the relation a list of the attributes (the columns in the data) their types Data Section comma-separated list, one line/instance Comments Begin with % Good idea to describe class, source, sometimes meanings of attributes
Header Section 12 @RELATION declaration: names what we are talking about. String. Quote it if it includes spaces. @RELATION iris @ATTRIBUTE declarations: names each attribute and gives its type. One/attribute, including the class. Must start with a letter. Quote it if includes spaces. @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE petal width NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
Attribute Types 13 Numeric. Can be real or integer. @ATTRIBUTE sepallength NUMERIC Nominal specification: named attributes {} @ATTRIBUTE color {red, green, blue} @ATTRIBUTE class {versicolor, setosa} String: arbitrary text @ATTRIBUTE emailbody string Date. Give date format. @ATTRIBUTE timestamp DATE "yyyy-mm-dd" Note that these are Weka-specific, but concepts are not
Data section 14 @DATA One line/instance, comma separated Example: For attributes: @Attribute sepallength NUMERIC @Attribute class {setosa, versicolor} @Attribute description STRING @Attribute timestamp DATE yyyy MM dd We might have instances 5.1, setosa, Lovely big flowers, 2014 09 10 4.9, setosa, Nice, 2014 06 03
Examples 15 Iris. Detailed, very nice comments. Numeric and nominal attributes. Weather, nominal. No comments, all nominal. Reuters a string attribute.
Importing 16 Restaurant1.csv Import, look at data imported on the right Does the Class look correct? Use the edit button to example further Restaurant2.csv Import, look again. Are all of these attributes useful? Remove any that look inappropriate.
Decision Tree on Restaurants 17 Try it with the defaults. Examine the results. See if you can get to a reasonably accurate tree.
Decision Tree on Restaurants 18 See if you can get to a reasonable tree. Try modifying the following: Change the minimum number of objects to 1. Don t prune. Evaluate against the training set. Basic conclusion: you need data to learn well. We don t have enough here. The only way to get decent performance out of this is to massively overfit.
Summary: 19 J48 in Weka provides a rich implementation of Quinlan s decision tree algorithm, with many options. In general, the default options, which include pruning and a minimum leaf size of 2, work very well. Weka s native data format is ARFF. It provides the name of a relation which will normally be the class for classifiers and a description of each attribute. It is good practice to add comments about source of the data and meaning of the attributes It can import other formats, such as.csv, and will make a reasonable guess about the attributes.