Machine Learning via Decision Trees: C4.5

Size: px

Start display at page:

Download "Machine Learning via Decision Trees: C4.5"

Jack Hall
6 years ago
Views:

1 Machine Learning via Decision Trees: C4.5

2 C4.5: Algorithms for Machine Learning Main task: learning Decision Trees from data The so'ware development has ended (in favor of C5.0, which is commercial), but s<ll one of the reference algorithms for the considered task. Last release: 8.0 C ImplementaBon, for Unix systems: available at: hfp:// C4.5 Tutorial available at: hfp://www2.cs.uregina.ca/~dbd/ cs831/notes/ml/dtrees/c4.5/tutorial.html The man page for the C4.5 so'ware is available on the course web site.

3 Building the Tree: Base Algorithm Base algorithm (same as in the CLS system) T: training set {C 1,C 2,...,C k }: set of all classes; Consider the T set: if T contains examples all in the same class then build a single leaf, with such class as label if T contains examples of several classes then Build a partition of T based on a test on the value of a particular attributes. Build a nome associate to the test, with one child for each subset in the partition Recursively call the algorithm on each subset

4 Stop condi<ons Actually, C4.5 stops if: T contains examples of a single class (main stop condi<on) Yields a single leaf, labeled with the class T is empty Yields a single leaf, labeled with the most frequent class in the parent node/set No test can generate at least two sets with a minimum of 2*MINOBJ examples Yields a single leaf, labeled with the most frequent class (some examples in the corresponding set will be misclassified) Other condibons (omifed for sake of simplicity)

5 Choosing the APribute for the Test Entropy for a set S of examples: kx freq(c j,s) info(s) = S j=1 Entropy of a par<<on P= (T 0, T 1,...) of a set T: nx T i info P (T )= T info(t i) i=1 Gain of a par<<on P= (T 0, T 1,...) of a set T: gain(p) = info(t) - info P (T) log 2 freq(cj,s) S Spli<nfo: splitinfo P (T )= kx j=1 T i T log 2 Ti T

6 Choosing the APribute for the Test Criterion to choose the apribute for the test: info P (T ) info(t ) splitinfo P (T ) Dividing the gain by splibnfo avoids branching on afributes with many possible values (high risk of overfiwng) This differs from the behavior you have seen before: if you want to replicate the results, you will need to force C4.5 to use the unmodified gain for choosing the afribute

7 Excercise 1 Download the golf dataset from the course web site (this is the example that you already know!) C4.5 is pre-installed on the lab machines. You can run it with: c4.5 -f <filestem> -v <verbosity level> Experiment with different verbosity levels Try to idenbfy, in the so'ware output: The decision tree itself The steps when the algorithm is choosing the spliwng afribute The gain values for the considered afributes Try to understand how conbnuous afributes are treated

8 Moving to more user-friendly environment 1. The WEKA system for data mining provides an implementabon of the C4.5 algorithm 2. Download the weather dataset from the course web-site 3. Open the weather.arff file with a text editor 4. Run WEKA (from the command line or from the GUI) weka 5. Open the weather.arff file from the preprocessing tab in the WEKA explorer

9 What is the content of a.arff file? 1. The name of the main relabon (with the same meaning as in relabonal weather 2. The afributes (by default, the last one is the outlook {sunny, overcast, temperature humidity windy {TRUE, play {yes, no} 3. The data (you should know them sunny,85,85,false,no sunny,80,90,true,no...

10 What is the content of a.arff file? No Outlook Temp ( F) Humid (%) Windy Class D1 sunny T Play D2 sunny T Don't Play D3 sunny F Don't Play D4 sunny F Don't Play D5 sunny F Play D6 overcast T Play D7 overcast F Play D8 overcast T Play D9 overcast F Play D10 rain T Don't Play D11 rain T Don't Play D12 rain F Play D13 rain F Play D14 rain F Play

11 Let s start!

12 Histograms 1. A panel on the lower right contains a histogram with the class distribubon over the afribute currently selected in the preprocessing tabafribub discreb: distribuzione per ogni valore For continuous attributes, the domain is split into bins 2. The color-class mapping can be inferred by selecbng the class afribute 3. Visualize all shows the histograms for all afributes

13 Classifica<on 1. In the Classify tab you can choose: The classifier to be trained The evauabon method The class afribute 2. Select the J4.8 classifier (a Java implementabon of C4.5) 3. Perform the evaluabon on the training set 4. Choose play as the class afribute

14 Classifica<on 1. You can access yet more opbons by clicking on the classifier name: binarysplits: use binary splits on nominal attributes confidencefactor: the confidence factor used for pruning (smaller values incur more pruning). minnumobj: the minimum number of instances per leaf. saveinstancedata: save the training data for visualization. Unpruned: no pruning is performed. 2. Try an run the classificabon task

15 Output (1) === Run information === Scheme: weka.classifiers.trees.j48 -C M 2 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data PREAMBOLO

16 Output (2) J48 pruned tree outlook = sunny humidity <= 75: yes (2.0) humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy windy = TRUE: no (2.0) windy = FALSE: yes (3.0) Questo è l'albero decisionale che conosciamo... Che si può visualizzare! Click destro sulla lista dei risultab, poi vizualize tree

17 Output (3) === Evaluation on training set === === Summary === Correctly Classified Instances % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances these are the classificabon results on the training set: no mistake has been made

18 Output (4) === Confusion Matrix === False negabves a b <-- classified as 9 0 a = yes 0 5 b = no False posibves

19 Error Based Pruning C4.5 employs a technique called error-based pruning Before the tree is considered fina, the algorithm afempts to simplify its structure For each leaf, C4.5 esbmates a missclassificabon probability on unknown examples base on stabsbcal reasoning For each node the missclassificabon probability is given by the sum of the probabilibes of the underlying leaves At each node: If the error esbmate can be reduced by replacing the node with a leaf, then C4.5 performs the replacement If the error esbmate can be reduce by replacing the node with the branch having te most examples, then C4.5 performs the replacement

20 Excercise 3 1. Try and learn a tree for the cars2004 dataset, using the sport car afribute as class, both with an without error based pruning How do the two trees look? What is their performance on the training set? 2. Try and perform the evaluabon using cars2004_test_noname.arff as a test set. What happens?

21 Esercise 4: Language Detec<on

22 Excercise 4: Language Detec<on We want to design a so'ware system for automabcally detecbng the language of a short text Examples: The doctor, who was the family physician, saluted him, but he scarcely took any nobce. --> english J'ai couru chez toi, je ne t'ai plus trouvée, tu sais la parole que je t'avais donnée, je la Bens. -- > french Conosci tu qualche hossanieh poco scrupoloso che si possa comperare con un bel pugno d'oro? -- > italian

23 Exercise 4: Language Detec<on Download the language.zip file from the course web site. The archive contains the files tset_data.txt and vset_data.txt, respecbvely corresponding to the test and the validabon set Training and test set are not in the arff format Because we sbll don t know which afributes should be used as input for the classfier! This is what happens in 99% of the prac<cal cases. IdenBfying a good set of features and training a classifier are two components of a single design problem

24 Exercise 4: Language Detec<on Together with the dataset, you will find the generate_arff.py script, which can process the raw dataset to produce an.arff file. The script does not specify which features should be used: that s your task!. The generate_arff script is wrifen in Python (we will use Python in two more occasions): Python is (as a first approximabon) an interpreted languge. It s not very fast, but it allows for fast code wribng Python interpreters are pre-installad on most *nix systems (including OSX). In can be installed on Windows For OSX user: my advice is to override the system Python and install it via the homebrew tool

25 Python Basics Python is loosely typed (variables lack a fixed type). PrimiBve types: a = 2 (int) b = 2.4 (float) s = hello world o hello world (stringhe) Boolean: True, False Data structures: Lists (dynamic sequences): l = [1, 3, 5, 7] Tuples (immutable sequences): t = (1, 3, 5, 7) Indexing: lists & tuples: l[0] (first item), l[-1] (last item) Strings: s[0] (first lefer), s[-1] (last lefer)

26 Python Basics InstrucBons end when the line ends (no final ; ) When you need to write an instrucbon on mulbple lines, you can end the parbal lines with \ The \ character is not needed between pairs of brackets: E.g. l = [1, 2, 3, 4] No {} to delimit instrucbon blocks: they are instead defined via indenta<on

27 Python Basics Condi<onal instruc<ons: if <condition>: <instruction block> Example: if a == 0: a += 1 print I have just incremented a elif a == 1: a -= 1 print I have just decremented a else: print No increment

28 Python Basics Cycles for <variable> in <enumerable object>: <instrucbon block> Examples: for a in [1, 2, 3]: print a for i in range(3): print i Lists, strings, and tuples are all enumerables range(n) returns a list with all integers between 0 and n-1

29 Python Basics List Comprehension [<espression> for <variable> in <enumerable> if <condibon>] Example: Even numbers from 0 to 8: [2*i for i in range(5)] Squares of integers from 0 to 4: [i**2 for i in range(5)] Even numbers from 0 to 8 (bis): [i for i in range(10) if i % 2 == 0]

30 Python Basics Func<on defini<on: def <funcbon name>(<parameter>, <parameter>,...): <instrucbon block> Example: def even(n): return i % 2 == 0 FuncBons are objects! They can be passed as parameters. There is a vast collecbon of external modules: import <module name>

31 Exercise 6: Shape Recogni<on

32 Exercise 6: Shape Recogni<on An industrial word-processing plant employs a machine the should process only square boards. The input slot of the machine has been instrumented with an array of opbcal sensors, plus a so'ware unit, which can provide a descripbon of the board as a polygon:

33 Exercise 6: Shape Recogni<on Boards can have different size and slightly irregular shape. They can also be posiboned in different at the input slot. Devise and implementa a so'ware system based on Decision Trees to classify the boards as square and not square".

34 Exercise 6: Shape Recogni<on Download the shapes dataset from the course web site. The archive contains: Two dataset files ( tset_data.txt and vset_data.txt ) for the training and test set in raw format Two directores ( tset and vset ), containing an image file for each example in the training and test set. A Python script to generate the.arff file, to be customized as in the previous exercise.

Decision Trees In Weka,Data Formats

CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016 J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned