Machine Learning via Decision Trees: C4.5

Size: px
Start display at page:

Download "Machine Learning via Decision Trees: C4.5"

Transcription

1 Machine Learning via Decision Trees: C4.5

2 C4.5: Algorithms for Machine Learning Main task: learning Decision Trees from data The so'ware development has ended (in favor of C5.0, which is commercial), but s<ll one of the reference algorithms for the considered task. Last release: 8.0 C ImplementaBon, for Unix systems: available at: hfp:// C4.5 Tutorial available at: hfp://www2.cs.uregina.ca/~dbd/ cs831/notes/ml/dtrees/c4.5/tutorial.html The man page for the C4.5 so'ware is available on the course web site.

3 Building the Tree: Base Algorithm Base algorithm (same as in the CLS system) T: training set {C 1,C 2,...,C k }: set of all classes; Consider the T set: if T contains examples all in the same class then build a single leaf, with such class as label if T contains examples of several classes then Build a partition of T based on a test on the value of a particular attributes. Build a nome associate to the test, with one child for each subset in the partition Recursively call the algorithm on each subset

4 Stop condi<ons Actually, C4.5 stops if: T contains examples of a single class (main stop condi<on) Yields a single leaf, labeled with the class T is empty Yields a single leaf, labeled with the most frequent class in the parent node/set No test can generate at least two sets with a minimum of 2*MINOBJ examples Yields a single leaf, labeled with the most frequent class (some examples in the corresponding set will be misclassified) Other condibons (omifed for sake of simplicity)

5 Choosing the APribute for the Test Entropy for a set S of examples: kx freq(c j,s) info(s) = S j=1 Entropy of a par<<on P= (T 0, T 1,...) of a set T: nx T i info P (T )= T info(t i) i=1 Gain of a par<<on P= (T 0, T 1,...) of a set T: gain(p) = info(t) - info P (T) log 2 freq(cj,s) S Spli<nfo: splitinfo P (T )= kx j=1 T i T log 2 Ti T

6 Choosing the APribute for the Test Criterion to choose the apribute for the test: info P (T ) info(t ) splitinfo P (T ) Dividing the gain by splibnfo avoids branching on afributes with many possible values (high risk of overfiwng) This differs from the behavior you have seen before: if you want to replicate the results, you will need to force C4.5 to use the unmodified gain for choosing the afribute

7 Excercise 1 Download the golf dataset from the course web site (this is the example that you already know!) C4.5 is pre-installed on the lab machines. You can run it with: c4.5 -f <filestem> -v <verbosity level> Experiment with different verbosity levels Try to idenbfy, in the so'ware output: The decision tree itself The steps when the algorithm is choosing the spliwng afribute The gain values for the considered afributes Try to understand how conbnuous afributes are treated

8 Moving to more user-friendly environment 1. The WEKA system for data mining provides an implementabon of the C4.5 algorithm 2. Download the weather dataset from the course web-site 3. Open the weather.arff file with a text editor 4. Run WEKA (from the command line or from the GUI) weka 5. Open the weather.arff file from the preprocessing tab in the WEKA explorer

9 What is the content of a.arff file? 1. The name of the main relabon (with the same meaning as in relabonal weather 2. The afributes (by default, the last one is the outlook {sunny, overcast, temperature humidity windy {TRUE, play {yes, no} 3. The data (you should know them sunny,85,85,false,no sunny,80,90,true,no...

10 What is the content of a.arff file? No Outlook Temp ( F) Humid (%) Windy Class D1 sunny T Play D2 sunny T Don't Play D3 sunny F Don't Play D4 sunny F Don't Play D5 sunny F Play D6 overcast T Play D7 overcast F Play D8 overcast T Play D9 overcast F Play D10 rain T Don't Play D11 rain T Don't Play D12 rain F Play D13 rain F Play D14 rain F Play

11 Let s start!

12 Histograms 1. A panel on the lower right contains a histogram with the class distribubon over the afribute currently selected in the preprocessing tabafribub discreb: distribuzione per ogni valore For continuous attributes, the domain is split into bins 2. The color-class mapping can be inferred by selecbng the class afribute 3. Visualize all shows the histograms for all afributes

13 Classifica<on 1. In the Classify tab you can choose: The classifier to be trained The evauabon method The class afribute 2. Select the J4.8 classifier (a Java implementabon of C4.5) 3. Perform the evaluabon on the training set 4. Choose play as the class afribute

14 Classifica<on 1. You can access yet more opbons by clicking on the classifier name: binarysplits: use binary splits on nominal attributes confidencefactor: the confidence factor used for pruning (smaller values incur more pruning). minnumobj: the minimum number of instances per leaf. saveinstancedata: save the training data for visualization. Unpruned: no pruning is performed. 2. Try an run the classificabon task

15 Output (1) === Run information === Scheme: weka.classifiers.trees.j48 -C M 2 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data PREAMBOLO

16 Output (2) J48 pruned tree outlook = sunny humidity <= 75: yes (2.0) humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy windy = TRUE: no (2.0) windy = FALSE: yes (3.0) Questo è l'albero decisionale che conosciamo... Che si può visualizzare! Click destro sulla lista dei risultab, poi vizualize tree

17 Output (3) === Evaluation on training set === === Summary === Correctly Classified Instances % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances these are the classificabon results on the training set: no mistake has been made

18 Output (4) === Confusion Matrix === False negabves a b <-- classified as 9 0 a = yes 0 5 b = no False posibves

19 Error Based Pruning C4.5 employs a technique called error-based pruning Before the tree is considered fina, the algorithm afempts to simplify its structure For each leaf, C4.5 esbmates a missclassificabon probability on unknown examples base on stabsbcal reasoning For each node the missclassificabon probability is given by the sum of the probabilibes of the underlying leaves At each node: If the error esbmate can be reduced by replacing the node with a leaf, then C4.5 performs the replacement If the error esbmate can be reduce by replacing the node with the branch having te most examples, then C4.5 performs the replacement

20 Excercise 3 1. Try and learn a tree for the cars2004 dataset, using the sport car afribute as class, both with an without error based pruning How do the two trees look? What is their performance on the training set? 2. Try and perform the evaluabon using cars2004_test_noname.arff as a test set. What happens?

21 Esercise 4: Language Detec<on

22 Excercise 4: Language Detec<on We want to design a so'ware system for automabcally detecbng the language of a short text Examples: The doctor, who was the family physician, saluted him, but he scarcely took any nobce. --> english J'ai couru chez toi, je ne t'ai plus trouvée, tu sais la parole que je t'avais donnée, je la Bens. -- > french Conosci tu qualche hossanieh poco scrupoloso che si possa comperare con un bel pugno d'oro? -- > italian

23 Exercise 4: Language Detec<on Download the language.zip file from the course web site. The archive contains the files tset_data.txt and vset_data.txt, respecbvely corresponding to the test and the validabon set Training and test set are not in the arff format Because we sbll don t know which afributes should be used as input for the classfier! This is what happens in 99% of the prac<cal cases. IdenBfying a good set of features and training a classifier are two components of a single design problem

24 Exercise 4: Language Detec<on Together with the dataset, you will find the generate_arff.py script, which can process the raw dataset to produce an.arff file. The script does not specify which features should be used: that s your task!. The generate_arff script is wrifen in Python (we will use Python in two more occasions): Python is (as a first approximabon) an interpreted languge. It s not very fast, but it allows for fast code wribng Python interpreters are pre-installad on most *nix systems (including OSX). In can be installed on Windows For OSX user: my advice is to override the system Python and install it via the homebrew tool

25 Python Basics Python is loosely typed (variables lack a fixed type). PrimiBve types: a = 2 (int) b = 2.4 (float) s = hello world o hello world (stringhe) Boolean: True, False Data structures: Lists (dynamic sequences): l = [1, 3, 5, 7] Tuples (immutable sequences): t = (1, 3, 5, 7) Indexing: lists & tuples: l[0] (first item), l[-1] (last item) Strings: s[0] (first lefer), s[-1] (last lefer)

26 Python Basics InstrucBons end when the line ends (no final ; ) When you need to write an instrucbon on mulbple lines, you can end the parbal lines with \ The \ character is not needed between pairs of brackets: E.g. l = [1, 2, 3, 4] No {} to delimit instrucbon blocks: they are instead defined via indenta<on

27 Python Basics Condi<onal instruc<ons: if <condition>: <instruction block> Example: if a == 0: a += 1 print I have just incremented a elif a == 1: a -= 1 print I have just decremented a else: print No increment

28 Python Basics Cycles for <variable> in <enumerable object>: <instrucbon block> Examples: for a in [1, 2, 3]: print a for i in range(3): print i Lists, strings, and tuples are all enumerables range(n) returns a list with all integers between 0 and n-1

29 Python Basics List Comprehension [<espression> for <variable> in <enumerable> if <condibon>] Example: Even numbers from 0 to 8: [2*i for i in range(5)] Squares of integers from 0 to 4: [i**2 for i in range(5)] Even numbers from 0 to 8 (bis): [i for i in range(10) if i % 2 == 0]

30 Python Basics Func<on defini<on: def <funcbon name>(<parameter>, <parameter>,...): <instrucbon block> Example: def even(n): return i % 2 == 0 FuncBons are objects! They can be passed as parameters. There is a vast collecbon of external modules: import <module name>

31 Exercise 6: Shape Recogni<on

32 Exercise 6: Shape Recogni<on An industrial word-processing plant employs a machine the should process only square boards. The input slot of the machine has been instrumented with an array of opbcal sensors, plus a so'ware unit, which can provide a descripbon of the board as a polygon:

33 Exercise 6: Shape Recogni<on Boards can have different size and slightly irregular shape. They can also be posiboned in different at the input slot. Devise and implementa a so'ware system based on Decision Trees to classify the boards as square and not square".

34 Exercise 6: Shape Recogni<on Download the shapes dataset from the course web site. The archive contains: Two dataset files ( tset_data.txt and vset_data.txt ) for the training and test set in raw format Two directores ( tset and vset ), containing an image file for each example in the training and test set. A Python script to generate the.arff file, to be customized as in the previous exercise.

Decision Trees In Weka,Data Formats

Decision Trees In Weka,Data Formats CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016 J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned

More information

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before

More information

Decision Trees Using Weka and Rattle

Decision Trees Using Weka and Rattle 9/28/2017 MIST.6060 Business Intelligence and Data Mining 1 Data Mining Software Decision Trees Using Weka and Rattle We will mainly use Weka ((http://www.cs.waikato.ac.nz/ml/weka/), an open source datamining

More information

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow

More information

The Explorer. chapter Getting started

The Explorer. chapter Getting started chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules

More information

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten Representing structural patterns: Plain Classification rules Decision Tree Rules with exceptions Relational solution Tree for Numerical Prediction Instance-based presentation Reading Material: Chapter

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Output: Knowledge representation Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision tables Decision trees Decision rules

More information

COMP33111: Tutorial and lab exercise 7

COMP33111: Tutorial and lab exercise 7 COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised

More information

Machine Learning in Real World: C4.5

Machine Learning in Real World: C4.5 Machine Learning in Real World: C4.5 Industrial-strength algorithms For an algorithm to be useful in a wide range of realworld applications it must: Permit numeric attributes with adaptive discretization

More information

Homework 1 Sample Solution

Homework 1 Sample Solution Homework 1 Sample Solution 1. Iris: All attributes of iris are numeric, therefore ID3 of weka cannt be applied to this data set. Contact-lenses: tear-prod-rate = reduced: none tear-prod-rate = normal astigmatism

More information

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form) Comp 135 Introduction to Machine Learning and Data Mining Our first learning algorithm How would you classify the next example? Fall 2014 Professor: Roni Khardon Computer Science Tufts University o o o

More information

Classification. Instructor: Wei Ding

Classification. Instructor: Wei Ding Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

Part I. Instructor: Wei Ding

Part I. Instructor: Wei Ding Classification Part I Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition Given a collection of records (training set ) Each record contains a set

More information

DATA MINING LAB MANUAL

DATA MINING LAB MANUAL DATA MINING LAB MANUAL Subtasks : 1. List all the categorical (or nominal) attributes and the real-valued attributes seperately. Attributes:- 1. checking_status 2. duration 3. credit history 4. purpose

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Decision trees Extending previous approach: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank to permit numeric s: straightforward

More information

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable!

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable! Project 1 140313 1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable! network.txt @attribute play {yes, no}!!! @graph! play -> outlook! play -> temperature!

More information

CS Machine Learning

CS Machine Learning CS 60050 Machine Learning Decision Tree Classifier Slides taken from course materials of Tan, Steinbach, Kumar 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K

More information

Data Mining Algorithms: Basic Methods

Data Mining Algorithms: Basic Methods Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association

More information

ROBOTC Basic Programming

ROBOTC Basic Programming ROBOTC Basic Programming Open ROBOTC and create a new file Check Compiler Target If you plan to download code to a robot, select the Physical Robot opbon. If you plan to download code to a virtual robot,

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d) Outline RainForest A Framework for Fast Decision Tree Construction of Large Datasets resented by: ov. 25, 2004 1. 2. roblem Definition 3. 4. Family of Algorithms 5. 6. 2 Classification is an important

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification: Basic Concepts, Decision Trees, and Model Evaluation Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Warehousing and Mining Lecture 4 by Hossen Asiful Mustafa Classification: Definition Given a collection of records (training set

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID: CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem Muhammad Asiful Islam, SBID: 106506983 Original Data Outlook Humidity Wind PlayTenis Sunny High Weak No Sunny

More information

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input Data Mining 1.3 Input Fall 2008 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be learned. Characterized

More information

Inducer: a Rule Induction Workbench for Data Mining

Inducer: a Rule Induction Workbench for Data Mining Inducer: a Rule Induction Workbench for Data Mining Max Bramer Faculty of Technology University of Portsmouth Portsmouth, UK Email: Max.Bramer@port.ac.uk Fax: +44-2392-843030 Abstract One of the key technologies

More information

Lecture 5: Decision Trees (Part II)

Lecture 5: Decision Trees (Part II) Lecture 5: Decision Trees (Part II) Dealing with noise in the data Overfitting Pruning Dealing with missing attribute values Dealing with attributes with multiple values Integrating costs into node choice

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning 1 Simple example of object classification Instances Size Color Shape C(x) x1 small red circle positive x2 large red circle positive x3 small red triangle negative x4 large blue circle

More information

Decision tree learning

Decision tree learning Decision tree learning Andrea Passerini passerini@disi.unitn.it Machine Learning Learning the concept Go to lesson OUTLOOK Rain Overcast Sunny TRANSPORTATION LESSON NO Uncovered Covered Theoretical Practical

More information

Implementation of Classification Rules using Oracle PL/SQL

Implementation of Classification Rules using Oracle PL/SQL 1 Implementation of Classification Rules using Oracle PL/SQL David Taniar 1 Gillian D cruz 1 J. Wenny Rahayu 2 1 School of Business Systems, Monash University, Australia Email: David.Taniar@infotech.monash.edu.au

More information

Advanced learning algorithms

Advanced learning algorithms Advanced learning algorithms Extending decision trees; Extraction of good classification rules; Support vector machines; Weighted instance-based learning; Design of Model Tree Clustering Association Mining

More information

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input Data Mining Part 1. Introduction 1.3 Fall 2009 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be

More information

Lecture outline. Decision-tree classification

Lecture outline. Decision-tree classification Lecture outline Decision-tree classification Decision Trees Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes

More information

Data Engineering. Data preprocessing and transformation

Data Engineering. Data preprocessing and transformation Data Engineering Data preprocessing and transformation Just apply a learner? NO! Algorithms are biased No free lunch theorem: considering all possible data distributions, no algorithm is better than another

More information

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Decision Tree Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 24 Table of contents 1 Introduction 2 Decision tree

More information

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek CS 8520: Artificial Intelligence Weka Lab Paula Matuszek Fall, 2015!1 Weka is Waikato Environment for Knowledge Analysis Machine Learning Software Suite from the University of Waikato Been under development

More information

Data Mining and Machine Learning: Techniques and Algorithms

Data Mining and Machine Learning: Techniques and Algorithms Instance based classification Data Mining and Machine Learning: Techniques and Algorithms Eneldo Loza Mencía eneldo@ke.tu-darmstadt.de Knowledge Engineering Group, TU Darmstadt International Week 2019,

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,

More information

3.DEFINITION. examples being the finite sets. However, we will have to consider infinite sets as well.

3.DEFINITION. examples being the finite sets. However, we will have to consider infinite sets as well. 3.DEFINITION Discrete Mathema6cs is the Math needed in decision making in noncon6nuous situa6ons. Thus, it mainly deals with discrete objects, their best examples being the finite sets. However, we will

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Search. The Nearest Neighbor Problem

Search. The Nearest Neighbor Problem 3 Nearest Neighbor Search Lab Objective: The nearest neighbor problem is an optimization problem that arises in applications such as computer vision, pattern recognition, internet marketing, and data compression.

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

What Is Data Mining? CMPT 354: Database I -- Data Mining 2 Data Mining What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

BITS F464: MACHINE LEARNING

BITS F464: MACHINE LEARNING BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031

More information

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input Data Mining Part 1. Introduction 1.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be

More information

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that

More information

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

AI32 Guide to Weka. Andrew Roberts 1st March 2005

AI32 Guide to Weka. Andrew Roberts   1st March 2005 AI32 Guide to Weka Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005 1 Introduction Weka is an excellent system for learning about machine learning techniques. Of course, it is a generic

More information

Python review. 1 Python basics. References. CS 234 Naomi Nishimura

Python review. 1 Python basics. References. CS 234 Naomi Nishimura Python review CS 234 Naomi Nishimura The sections below indicate Python material, the degree to which it will be used in the course, and various resources you can use to review the material. You are not

More information

Python lab session 1

Python lab session 1 Python lab session 1 Dr Ben Dudson, Department of Physics, University of York 28th January 2011 Python labs Before we can start using Python, first make sure: ˆ You can log into a computer using your username

More information

Data Mining Classification - Part 1 -

Data Mining Classification - Part 1 - Data Mining Classification - Part 1 - Universität Mannheim Bizer: Data Mining I FSS2019 (Version: 20.2.2018) Slide 1 Outline 1. What is Classification? 2. K-Nearest-Neighbors 3. Decision Trees 4. Model

More information

Construct an optimal tree of one level

Construct an optimal tree of one level Economics 1660: Big Data PS 3: Trees Prof. Daniel Björkegren Poisonous Mushrooms, Continued A foodie friend wants to cook a dish with fresh collected mushrooms. However, he knows that some wild mushrooms

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar (modified by Predrag Radivojac, 2017) Classification:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Decision Tree Example Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short} Class: Country = {Gromland, Polvia} CS4375 --- Fall 2018 a

More information

Extra readings beyond the lecture slides are important:

Extra readings beyond the lecture slides are important: 1 Notes To preview next lecture: Check the lecture notes, if slides are not available: http://web.cse.ohio-state.edu/~sun.397/courses/au2017/cse5243-new.html Check UIUC course on the same topic. All their

More information

COMP s1 - Getting started with the Weka Machine Learning Toolkit

COMP s1 - Getting started with the Weka Machine Learning Toolkit COMP9417 16s1 - Getting started with the Weka Machine Learning Toolkit Last revision: Thu Mar 16 2016 1 Aims This introduction is the starting point for Assignment 1, which requires the use of the Weka

More information

Hierarchical Clustering Lecture 9

Hierarchical Clustering Lecture 9 Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:

More information

Lists, loops and decisions

Lists, loops and decisions Caltech/LEAD Summer 2012 Computer Science Lecture 4: July 11, 2012 Lists, loops and decisions Lists Today Looping with the for statement Making decisions with the if statement Lists A list is a sequence

More information

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU Machine Learning 10-701/15-781, Spring 2008 Decision Trees Le Song Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU Reading: Chap. 1.6, CB & Chap 3, TM Learning non-linear functions f:

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN NonMetric Data Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical

More information

Data Structures III: K-D

Data Structures III: K-D Lab 6 Data Structures III: K-D Trees Lab Objective: Nearest neighbor search is an optimization problem that arises in applications such as computer vision, pattern recognition, internet marketing, and

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1 CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #19: Machine Learning 1 Supervised Learning Would like to do predicbon: esbmate a func3on f(x) so that y = f(x) Where y can be: Real number:

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Blaž Zupan and Ivan Bratko magixfriuni-ljsi/predavanja/uisp An Example Data Set and Decision Tree # Attribute Class Outlook Company Sailboat Sail? 1 sunny big small yes 2 sunny

More information

Homework #4 RELEASE DATE: 04/22/2014 DUE DATE: 05/06/2014, 17:30 (after class) in CSIE R217

Homework #4 RELEASE DATE: 04/22/2014 DUE DATE: 05/06/2014, 17:30 (after class) in CSIE R217 Homework #4 RELEASE DATE: 04/22/2014 DUE DATE: 05/06/2014, 17:30 (after class) in CSIE R217 As directed below, you need to submit your code to the designated place on the course website. Any form of cheating,

More information

Data Mining and Analytics

Data Mining and Analytics Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/

More information

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux. 1 Introduction Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux. The gain chart is an alternative to confusion matrix for the evaluation of a classifier.

More information

Classifica(on and Clustering with WEKA. Classifica*on and Clustering with WEKA

Classifica(on and Clustering with WEKA. Classifica*on and Clustering with WEKA Classifica(on and Clustering with WEKA 1 Schedule: Classifica(on and Clustering with WEKA 1. Presentation of WEKA. 2. Your turn: perform classification and clustering. 2 WEKA Weka is a collec*on of machine

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

ISSUES IN DECISION TREE LEARNING

ISSUES IN DECISION TREE LEARNING ISSUES IN DECISION TREE LEARNING Handling Continuous Attributes Other attribute selection measures Overfitting-Pruning Handling of missing values Incremental Induction of Decision Tree 1 DECISION TREE

More information

List of Exercises: Data Mining 1 December 12th, 2015

List of Exercises: Data Mining 1 December 12th, 2015 List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring

More information

Decision Trees: Discussion

Decision Trees: Discussion Decision Trees: Discussion Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning

More information

Lazy Rule Learning. Lazy Rule Learning Bachelor-Thesis von Nikolaus Korfhage Januar ngerman

Lazy Rule Learning. Lazy Rule Learning Bachelor-Thesis von Nikolaus Korfhage Januar ngerman ngerman Lazy Rule Learning Lazy Rule Learning Bachelor-Thesis von Nikolaus Korfhage Januar 2012 Fachbereich Informatik Fachgebiet Knowledge Engineering Lazy Rule Learning Lazy Rule Learning Vorgelegte

More information

CSE 115. Introduction to Computer Science I

CSE 115. Introduction to Computer Science I CSE 115 Introduction to Computer Science I Progress In UBInfinite? A. Haven't started B. Earned 3 stars in "Calling Functions" C. Earned 3 stars in "Defining Functions" D. Earned 3 stars in "Conditionals"

More information

Basic Python 3 Programming (Theory & Practical)

Basic Python 3 Programming (Theory & Practical) Basic Python 3 Programming (Theory & Practical) Length Delivery Method : 5 Days : Instructor-led (Classroom) Course Overview This Python 3 Programming training leads the student from the basics of writing

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

ARTIFICIAL INTELLIGENCE AND PYTHON

ARTIFICIAL INTELLIGENCE AND PYTHON ARTIFICIAL INTELLIGENCE AND PYTHON DAY 1 STANLEY LIANG, LASSONDE SCHOOL OF ENGINEERING, YORK UNIVERSITY WHAT IS PYTHON An interpreted high-level programming language for general-purpose programming. Python

More information

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM 1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu

More information

cs1114 REVIEW of details test closed laptop period

cs1114 REVIEW of details test closed laptop period python details DOES NOT COVER FUNCTIONS!!! This is a sample of some of the things that you are responsible for do not believe that if you know only the things on this test that they will get an A on any

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

WEB BASED DATA-MINING ASSISTANT

WEB BASED DATA-MINING ASSISTANT P. J. Safarik University Faculty of Science WEB BASED DATA-MINING ASSISTANT THESIS Field of Study: Institute: Tutor: Computer Science Institute of Computer Science RNDr. Tomáš Horváth, PhD. Košice 2015

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Data Understanding Exercise: Market Basket Analysis Exercise:

More information

Crash Dive into Python

Crash Dive into Python ECPE 170 University of the Pacific Crash Dive into Python 2 Lab Schedule Today Ac:vi:es Endianness Python Thursday Network programming Lab 8 Network Programming Lab 8 Assignments Due Due by Mar 30 th 5:00am

More information

WEKA homepage.

WEKA homepage. WEKA homepage http://www.cs.waikato.ac.nz/ml/weka/ Data mining software written in Java (distributed under the GNU Public License). Used for research, education, and applications. Comprehensive set of

More information

S2 Text. Instructions to replicate classification results.

S2 Text. Instructions to replicate classification results. S2 Text. Instructions to replicate classification results. Machine Learning (ML) Models were implemented using WEKA software Version 3.8. The software can be free downloaded at this link: http://www.cs.waikato.ac.nz/ml/weka/downloading.html.

More information