Decision Trees In Weka,Data Formats

Size: px
Start display at page:

Download "Decision Trees In Weka,Data Formats"

Transcription

1 CS 4510/9010 Applied Machine Learning 1 Decision Trees In Weka,Data Formats Paula Matuszek Fall, 2016

2 J48: Decision Tree in Weka 2 NAME: weka.classifiers.trees.j48 SYNOPSIS Class for generating a pruned or unpruned C4.5 decision tree. For more information, see Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.CAPABILITIES Class -- Nominal class, Binary class, Missing class values Attributes -- Empty nominal attributes, Nominal attributes, Date attributes, Numeric attributes, Unary attributes, Missing values, Binary attributes min # of instances: 0

3 Some of the Options: 3 Unpruned -- Whether pruning is performed. Default is false pruned. minnumobj -- The minimum number of instances per leaf. (default is 2). Note that this is separate from value of Unpruned For pruned trees: Subtree pruning: rising entire subtree up a level. Default is true. c: confidencefactor -- The confidence factor used for pruning (smaller values incur more pruning). (default is 0.25). Build full tree and then work back from the leaves, applying a statistical test at each stage reducederrorpruning -- Whether reduced-error pruning is used instead. numfolds -- Determines the amount of data used for reduced-error pruning. One fold is used for pruning, the rest for growing the tree. donotcheckcapabilities -- If set, classifier capabilities are not checked before classifier is built (Use with caution to reduce runtime)

4 Looking At The Results 4 For all classifiers, Weka will show you === Run information === Scheme: weka.classifiers.trees.j48 -C M 2 Relation: weather.symbolic Instances: 14 Attributes: 5. outlook, temperature, humidity, windy, play Test Mode: 10-fold cross-validation === Classifier model (full training set) === Model-specific information. For J48, the decision tree Time taken to build model: 0.02 seconds ===Evaluation=== This will give the evaluation method and possibly the time it took Summary, Detailed Accuracy By Class, Confusion Matrix Next time we will look in detail at these statistics

5 Decision Tree Model 5 Text version, number of leaves, size of tree, counts outlook = sunny humidity = high: no (3.0) humidity = normal: yes (2.0) outlook = overcast: yes (4.0) outlook = rainy windy = TRUE: no (2.0) windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8

6 J48 on Iris 6 J48 pruned tree petalwidth <= 0.6: Iris-setosa (50.0) petalwidth > 0.6 petalwidth <= 1.7 petallength <= 4.9: Iris-versicolor (48.0/1.0) petallength > 4.9 petalwidth <= 1.5: Iris-virginica (3.0) petalwidth > 1.5: Iris-versicolor (3.0/1.0) petalwidth > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9

7 Weka Pruning Exercise 7 Open the breast-cancer dataset in a text editor. Determine from the comments how many possible values there are for the age attribute, and how many are actually used. Open the dataset in the Explorer, go to the Classify tab, and select J48. Set the unpruned switch set to True. Experiment with values of minnumobj, noting the number of leaves and size of the tree in each case: 1, 2, 3, 5, 10, 20, 50, 100. Which call produces the same values as J48 with default parameters? (i.e., unpruned=false, minnumobj=2). In general, J48's confidencefactor parameter is best left alone, but it is interesting to see its effect. With default values for the other parameters, experiment with the following values of confidencefactor, recording the performance in each case (evaluated using 10-fold cross-validation): 0.005, 0.05, 0.1, 0.25, 0.5 Which value or values produce the greatest accuracy?

8 CS 4510/9010 Applied Machine Learning 8 Data Format in Weka Paula Matuszek Fall, 2016

9 Weka-Supported Formats 9 Weka s native format is called ARFF: Attribute Relation File Format It will also input various other formats: Compressed ARFF files (.arff.gz) Comma-separated value files (.csv) JSON (serialized attribute/relation pair objects)(.json) Various ML tool outputs Chosen on the Preprocess tab, for the Open File button.

10 Weka Input Menu 10

11 ARFF Format 11 Header Section: information about the data the name of the relation a list of the attributes (the columns in the data) their types Data Section comma-separated list, one line/instance Comments Begin with % Good idea to describe class, source, sometimes meanings of attributes

12 Header Section declaration: names what we are talking about. String. Quote it if it includes declarations: names each attribute and gives its type. One/attribute, including the class. Must start with a letter. Quote it if includes sepallength petal width class {Iris-setosa,Iris-versicolor,Iris-virginica}

13 Attribute Types 13 Numeric. Can be real or sepallength NUMERIC Nominal specification: named attributes color {red, green, class {versicolor, setosa} String: arbitrary body string Date. Give date timestamp DATE "yyyy-mm-dd" Note that these are Weka-specific, but concepts are not

14 Data section One line/instance, comma separated Example: For sepallength class {setosa, description timestamp DATE yyyy MM dd We might have instances 5.1, setosa, Lovely big flowers, , setosa, Nice,

15 Examples 15 Iris. Detailed, very nice comments. Numeric and nominal attributes. Weather, nominal. No comments, all nominal. Reuters a string attribute.

16 Importing 16 Restaurant1.csv Import, look at data imported on the right Does the Class look correct? Use the edit button to example further Restaurant2.csv Import, look again. Are all of these attributes useful? Remove any that look inappropriate.

17 Decision Tree on Restaurants 17 Try it with the defaults. Examine the results. See if you can get to a reasonably accurate tree.

18 Decision Tree on Restaurants 18 See if you can get to a reasonable tree. Try modifying the following: Change the minimum number of objects to 1. Don t prune. Evaluate against the training set. Basic conclusion: you need data to learn well. We don t have enough here. The only way to get decent performance out of this is to massively overfit.

19 Summary: 19 J48 in Weka provides a rich implementation of Quinlan s decision tree algorithm, with many options. In general, the default options, which include pruning and a minimum leaf size of 2, work very well. Weka s native data format is ARFF. It provides the name of a relation which will normally be the class for classifiers and a description of each attribute. It is good practice to add comments about source of the data and meaning of the attributes It can import other formats, such as.csv, and will make a reasonable guess about the attributes.

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek CS 8520: Artificial Intelligence Weka Lab Paula Matuszek Fall, 2015!1 Weka is Waikato Environment for Knowledge Analysis Machine Learning Software Suite from the University of Waikato Been under development

More information

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before

More information

Basic Concepts Weka Workbench and its terminology

Basic Concepts Weka Workbench and its terminology Changelog: 14 Oct, 30 Oct Basic Concepts Weka Workbench and its terminology Lecture Part Outline Concepts, instances, attributes How to prepare the input: ARFF, attributes, missing values, getting to know

More information

The Explorer. chapter Getting started

The Explorer. chapter Getting started chapter 10 The Explorer Weka s main graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. It is illustrated in Figure 10.1. There are six different

More information

Decision Trees Using Weka and Rattle

Decision Trees Using Weka and Rattle 9/28/2017 MIST.6060 Business Intelligence and Data Mining 1 Data Mining Software Decision Trees Using Weka and Rattle We will mainly use Weka ((http://www.cs.waikato.ac.nz/ml/weka/), an open source datamining

More information

Input: Concepts, Instances, Attributes

Input: Concepts, Instances, Attributes Input: Concepts, Instances, Attributes 1 Terminology Components of the input: Concepts: kinds of things that can be learned aim: intelligible and operational concept description Instances: the individual,

More information

Homework 1 Sample Solution

Homework 1 Sample Solution Homework 1 Sample Solution 1. Iris: All attributes of iris are numeric, therefore ID3 of weka cannt be applied to this data set. Contact-lenses: tear-prod-rate = reduced: none tear-prod-rate = normal astigmatism

More information

Machine Learning Chapter 2. Input

Machine Learning Chapter 2. Input Machine Learning Chapter 2. Input 2 Input: Concepts, instances, attributes Terminology What s a concept? Classification, association, clustering, numeric prediction What s in an example? Relations, flat

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Input: Concepts, instances, attributes Terminology What s a concept?

More information

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten Representing structural patterns: Plain Classification rules Decision Tree Rules with exceptions Relational solution Tree for Numerical Prediction Instance-based presentation Reading Material: Chapter

More information

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges. Instance-Based Representations exemplars + distance measure Challenges. algorithm: IB1 classify based on majority class of k nearest neighbors learned structure is not explicitly represented choosing k

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville

ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville ADaM version 4.0 (Eagle) Tutorial Information Technology and Systems Center University of Alabama in Huntsville Tutorial Outline Overview of the Mining System Architecture Data Formats Components Using

More information

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise

More information

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Output: Knowledge representation Tables Linear models Trees Rules

More information

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016

ESERCITAZIONE PIATTAFORMA WEKA. Croce Danilo Web Mining & Retrieval 2015/2016 ESERCITAZIONE PIATTAFORMA WEKA Croce Danilo Web Mining & Retrieval 2015/2016 Outline Weka: a brief recap ARFF Format Performance measures Confusion Matrix Precision, Recall, F1, Accuracy Question Classification

More information

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing) k Nearest Neighbors k Nearest Neighbors To classify an observation: Look at the labels of some number, say k, of neighboring observations. The observation is then classified based on its nearest neighbors

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Output: Knowledge representation Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter of Data Mining by I. H. Witten and E. Frank Decision tables Decision trees Decision rules

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Input: Concepts, instances, attributes Data ining Practical achine Learning Tools and Techniques Slides for Chapter 2 of Data ining by I. H. Witten and E. rank Terminology What s a concept z Classification,

More information

Machine Learning: Algorithms and Applications Mockup Examination

Machine Learning: Algorithms and Applications Mockup Examination Machine Learning: Algorithms and Applications Mockup Examination 14 May 2012 FIRST NAME STUDENT NUMBER LAST NAME SIGNATURE Instructions for students Write First Name, Last Name, Student Number and Signature

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow

More information

Introduction to Artificial Intelligence

Introduction to Artificial Intelligence Introduction to Artificial Intelligence COMP307 Machine Learning 2: 3-K Techniques Yi Mei yi.mei@ecs.vuw.ac.nz 1 Outline K-Nearest Neighbour method Classification (Supervised learning) Basic NN (1-NN)

More information

Sabbatical Leave Report

Sabbatical Leave Report Zdravko Markov, Ph.D. Phone: (860) 832-2711 Associate Professor of Computer Science E-mail: markovz@ccsu.edu Central Connecticut State University URL: http://www.cs.ccsu.edu/~markov/ Sabbatical Leave Report

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Model Selection Matt Gormley Lecture 4 January 29, 2018 1 Q&A Q: How do we deal

More information

Machine Learning in Real World: C4.5

Machine Learning in Real World: C4.5 Machine Learning in Real World: C4.5 Industrial-strength algorithms For an algorithm to be useful in a wide range of realworld applications it must: Permit numeric attributes with adaptive discretization

More information

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable!

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable! Project 1 140313 1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable! network.txt @attribute play {yes, no}!!! @graph! play -> outlook! play -> temperature!

More information

Data Warehouse & Mining Lab Manual

Data Warehouse & Mining Lab Manual Data Warehouse & Mining Lab Manual Roll No: Name: Sem: Section CERTIFICATE Certified that this file is submitted by Shri/Ku. Roll No. a student of VII Semester final year of the course Computer Science

More information

COMP33111: Tutorial and lab exercise 7

COMP33111: Tutorial and lab exercise 7 COMP33111: Tutorial and lab exercise 7 Guide answers for Part 1: Understanding clustering 1. Explain the main differences between classification and clustering. main differences should include being unsupervised

More information

Summary. Machine Learning: Introduction. Marcin Sydow

Summary. Machine Learning: Introduction. Marcin Sydow Outline of this Lecture Data Motivation for Data Mining and Learning Idea of Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication and Regression Examples Data:

More information

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Data analysis case study using R for readily available data set using any one machine learning Algorithm Assignment-4 Data analysis case study using R for readily available data set using any one machine learning Algorithm Broadly, there are 3 types of Machine Learning Algorithms.. 1. Supervised Learning

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

AI32 Guide to Weka. Andrew Roberts 1st March 2005

AI32 Guide to Weka. Andrew Roberts   1st March 2005 AI32 Guide to Weka Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005 1 Introduction Weka is an excellent system for learning about machine learning techniques. Of course, it is a generic

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

Hands on Datamining & Machine Learning with Weka

Hands on Datamining & Machine Learning with Weka Step1: Click the Experimenter button to launch the Weka Experimenter. The Weka Experimenter allows you to design your own experiments of running algorithms on datasets, run the experiments and analyze

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning 1 Simple example of object classification Instances Size Color Shape C(x) x1 small red circle positive x2 large red circle positive x3 small red triangle negative x4 large blue circle

More information

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules arulescba: Classification for Factor and Transactional Data Sets Using Association Rules Ian Johnson Southern Methodist University Abstract This paper presents an R package, arulescba, which uses association

More information

Data Mining Input: Concepts, Instances, and Attributes

Data Mining Input: Concepts, Instances, and Attributes Data Mining Input: Concepts, Instances, and Attributes Chapter 2 of Data Mining Terminology Components of the input: Concepts: kinds of things that can be learned Goal: intelligible and operational concept

More information

IMPLEMENTATION OF ANT COLONY ALGORITHMS IN MATLAB R. Seidlová, J. Poživil

IMPLEMENTATION OF ANT COLONY ALGORITHMS IN MATLAB R. Seidlová, J. Poživil Abstract IMPLEMENTATION OF ANT COLONY ALGORITHMS IN MATLAB R. Seidlová, J. Poživil Institute of Chemical Technology, Department of Computing and Control Engineering Technická 5, Prague 6, 166 28, Czech

More information

Advanced learning algorithms

Advanced learning algorithms Advanced learning algorithms Extending decision trees; Extraction of good classification rules; Support vector machines; Weighted instance-based learning; Design of Model Tree Clustering Association Mining

More information

Data Mining Algorithms: Basic Methods

Data Mining Algorithms: Basic Methods Algorithms: The basic methods Inferring rudimentary rules Data Mining Algorithms: Basic Methods Chapter 4 of Data Mining Statistical modeling Constructing decision trees Constructing rules Association

More information

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts 6 Subscripting 6.1 Basics of Subscripting For objects that contain more than one element (vectors, matrices, arrays, data frames, and lists), subscripting is used to access some or all of those elements.

More information

University of Florida CISE department Gator Engineering. Visualization

University of Florida CISE department Gator Engineering. Visualization Visualization Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida What is visualization? Visualization is the process of converting data (information) in to

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input Data Mining Part 1. Introduction 1.3 Fall 2009 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be

More information

A Tour of Sweave. Max Kuhn. March 14, Pfizer Global R&D Non Clinical Statistics Groton

A Tour of Sweave. Max Kuhn. March 14, Pfizer Global R&D Non Clinical Statistics Groton A Tour of Sweave Max Kuhn Pfizer Global R&D Non Clinical Statistics Groton March 14, 2011 Creating Data Analysis Reports For most projects where we need a written record of our work, creating the report

More information

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci Data Representation Information Retrieval and Data Mining Prof. Matteo Matteucci Instances, Attributes, Concepts 2 Instances The atomic elements of information from a dataset Also known as records, prototypes,

More information

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form) Comp 135 Introduction to Machine Learning and Data Mining Our first learning algorithm How would you classify the next example? Fall 2014 Professor: Roni Khardon Computer Science Tufts University o o o

More information

Nearest Neighbor Classification

Nearest Neighbor Classification Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48 Outline 1 Administration 2 First learning algorithm: Nearest

More information

Implementation of Classification Rules using Oracle PL/SQL

Implementation of Classification Rules using Oracle PL/SQL 1 Implementation of Classification Rules using Oracle PL/SQL David Taniar 1 Gillian D cruz 1 J. Wenny Rahayu 2 1 School of Business Systems, Monash University, Australia Email: David.Taniar@infotech.monash.edu.au

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05 Data Mining Tools Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, 75252 Paris, Cedex 05 Jean-Gabriel.Ganascia@lip6.fr DATA BASES Data mining Extraction Data mining Interpretation/

More information

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus Fitting Classification and Regression Trees Using Statgraphics and R Presented by Dr. Neil W. Polhemus Classification and Regression Trees Machine learning methods used to construct predictive models from

More information

Contents. 1. Introduction Ripple-down Rules Relational Rules HENRY and ABE...

Contents. 1. Introduction Ripple-down Rules Relational Rules HENRY and ABE... Abstract The modular nature of rules learned by most inductive machine learning algorithms makes them difficult and costly to maintain when the knowledge they are based on changes. Ripple-down rules, a

More information

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS 1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes and a class attribute

More information

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input Data Mining 1.3 Input Fall 2008 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be learned. Characterized

More information

Data Mining and Analytics

Data Mining and Analytics Data Mining and Analytics Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 9/22/2017 http://tanlab.ucdenver.edu/labhomepage/teaching/bsbt6111/

More information

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

BL5229: Data Analysis with Matlab Lab: Learning: Clustering BL5229: Data Analysis with Matlab Lab: Learning: Clustering The following hands-on exercises were designed to teach you step by step how to perform and understand various clustering algorithm. We will

More information

Experimental Design + k- Nearest Neighbors

Experimental Design + k- Nearest Neighbors 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Experimental Design + k- Nearest Neighbors KNN Readings: Mitchell 8.2 HTF 13.3

More information

DATA MINING LAB MANUAL

DATA MINING LAB MANUAL DATA MINING LAB MANUAL Subtasks : 1. List all the categorical (or nominal) attributes and the real-valued attributes seperately. Attributes:- 1. checking_status 2. duration 3. credit history 4. purpose

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

Machine Learning via Decision Trees: C4.5

Machine Learning via Decision Trees: C4.5 Machine Learning via Decision Trees: C4.5 C4.5: Algorithms for Machine Learning Main task: learning Decision Trees from data The so'ware development has ended (in favor of C5.0, which is commercial), but

More information

Intro to R for Epidemiologists

Intro to R for Epidemiologists Lab 9 (3/19/15) Intro to R for Epidemiologists Part 1. MPG vs. Weight in mtcars dataset The mtcars dataset in the datasets package contains fuel consumption and 10 aspects of automobile design and performance

More information

Introduction to R and Statistical Data Analysis

Introduction to R and Statistical Data Analysis Microarray Center Introduction to R and Statistical Data Analysis PART II Petr Nazarov petr.nazarov@crp-sante.lu 22-11-2010 OUTLINE PART II Descriptive statistics in R (8) sum, mean, median, sd, var, cor,

More information

COMP s1 - Getting started with the Weka Machine Learning Toolkit

COMP s1 - Getting started with the Weka Machine Learning Toolkit COMP9417 16s1 - Getting started with the Weka Machine Learning Toolkit Last revision: Thu Mar 16 2016 1 Aims This introduction is the starting point for Assignment 1, which requires the use of the Weka

More information

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer In part from: Yizhou Sun 2008 What is WEKA? Waikato Environment for Knowledge Analysis It s a data mining/machine learning tool developed by Department of Computer Science,,

More information

Function Algorithms: Linear Regression, Logistic Regression

Function Algorithms: Linear Regression, Logistic Regression CS 4510/9010: Applied Machine Learning 1 Function Algorithms: Linear Regression, Logistic Regression Paula Matuszek Fall, 2016 Some of these slides originated from Andrew Moore Tutorials, at http://www.cs.cmu.edu/~awm/tutorials.html

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Classification: Decision Trees

Classification: Decision Trees Classification: Decision Trees IST557 Data Mining: Techniques and Applications Jessie Li, Penn State University 1 Decision Tree Example Will a pa)ent have high-risk based on the ini)al 24-hour observa)on?

More information

Function Approximation and Feature Selection Tool

Function Approximation and Feature Selection Tool Function Approximation and Feature Selection Tool Version: 1.0 The current version provides facility for adaptive feature selection and prediction using flexible neural tree. Developers: Varun Kumar Ojha

More information

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset

Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset Ripple Down Rule learner (RIDOR) Classifier for IRIS Dataset V.Veeralakshmi Department of Computer Science Bharathiar University, Coimbatore, Tamilnadu veeralakshmi13@gmail.com Dr.D.Ramyachitra Department

More information

k-nearest Neighbors + Model Selection

k-nearest Neighbors + Model Selection 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University k-nearest Neighbors + Model Selection Matt Gormley Lecture 5 Jan. 30, 2019 1 Reminders

More information

Oblique Linear Tree. 1. Introduction

Oblique Linear Tree. 1. Introduction Oblique Linear Tree João Gama LIACC, FEP - University of Porto Rua Campo Alegre, 823 4150 Porto, Portugal Phone: (+351) 2 6001672 Fax: (+351) 2 6003654 Email: jgama@ncc.up.pt WWW: http//www.up.pt/liacc/ml

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Blaž Zupan and Ivan Bratko magixfriuni-ljsi/predavanja/uisp An Example Data Set and Decision Tree # Attribute Class Outlook Company Sailboat Sail? 1 sunny big small yes 2 sunny

More information

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input Data Mining Part 1. Introduction 1.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Instances Attributes References Instances Instance: Instances Individual, independent example of the concept to be

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 9, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 1 / 47

More information

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM 1 CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM John R. Koza Computer Science Department Stanford University Stanford, California 94305 USA E-MAIL: Koza@Sunburn.Stanford.Edu

More information

Back-to-Back Stem-and-Leaf Plots

Back-to-Back Stem-and-Leaf Plots Chapter 195 Back-to-Back Stem-and-Leaf Plots Introduction This procedure generates a stem-and-leaf plot of a batch of data. The stem-and-leaf plot is similar to a histogram and its main purpose is to show

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

CS4618: Artificial Intelligence I. Accuracy Estimation. Initialization

CS4618: Artificial Intelligence I. Accuracy Estimation. Initialization CS4618: Artificial Intelligence I Accuracy Estimation Derek Bridge School of Computer Science and Information echnology University College Cork Initialization In [1]: %reload_ext autoreload %autoreload

More information

SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA

SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA SWETHA ENGINEERING COLLEGE (Approved by AICTE, New Delhi, Affiliated to JNTUA) DATA MINING USING WEKA LAB RECORD N99A49G70E68S51H Data Mining using WEKA 1 WEKA [ Waikato Environment for Knowledge Analysis

More information

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) 1 Classification:

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Decision trees Extending previous approach: Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank to permit numeric s: straightforward

More information

STAT 1291: Data Science

STAT 1291: Data Science STAT 1291: Data Science Lecture 18 - Statistical modeling II: Machine learning Sungkyu Jung Where are we? data visualization data wrangling professional ethics statistical foundation Statistical modeling:

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

BITS F464: MACHINE LEARNING

BITS F464: MACHINE LEARNING BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031

More information

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a

More information

Using Weka for Classification. Preparing a data file

Using Weka for Classification. Preparing a data file Using Weka for Classification Preparing a data file Prepare a data file in CSV format. It should have the names of the features, which Weka calls attributes, on the first line, with the names separated

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 Outline Machine Learning: what and why? Motivating example Tree-based methods Regression trees Trees aggregation 2/77 Teachers Eric Medvet Dipartimento

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 6 Implementation: Real machine learning schemes Decision trees: from ID3 to C4.5 Pruning, missing values, numeric attributes, efficiency Decision rules:

More information

2012 Fall, CENG 514 Data Mining, Homework 3 Key by Dilek Önal

2012 Fall, CENG 514 Data Mining, Homework 3 Key by Dilek Önal 2012 Fall, CENG 514 Data Mining, Homework 3 Key by Dilek Önal SOLUTIONS Task 1 (Data conversion 15 points, Weka commands 10 points = 25 points) You should have implemented a piece of code which converts

More information

Unsupervised: no target value to predict

Unsupervised: no target value to predict Clustering Unsupervised: no target value to predict Differences between models/algorithms: Exclusive vs. overlapping Deterministic vs. probabilistic Hierarchical vs. flat Incremental vs. batch learning

More information

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke Data Sorcery with Clojure & Incanter Introduction to Datasets & Charts National Capital Area Clojure Meetup 18 February 2010 David Edgar Liebke liebke@incanter.org Outline Overview What is Incanter? Getting

More information

Visualizing class probability estimators

Visualizing class probability estimators Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers

More information

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time, Chapter 2 Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since

More information

Decision Trees: Discussion

Decision Trees: Discussion Decision Trees: Discussion Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning

More information

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs zhaoxia@ics.uci.edu 1 What can you say about the figure? signal C 0.0 0.5 1.0 1500 subjects Two

More information

h=[3,2,5,7], pos=[2,1], neg=[4,4]

h=[3,2,5,7], pos=[2,1], neg=[4,4] 2D1431 Machine Learning Lab 1: Concept Learning & Decision Trees Frank Hoffmann e-mail: hoffmann@nada.kth.se November 8, 2002 1 Introduction You have to prepare the solutions to the lab assignments prior

More information