Homework 1 Sample Solution

Similar documents
Input: Concepts, Instances, Attributes

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Data Mining Algorithms: Basic Methods

Summary. Machine Learning: Introduction. Marcin Sydow

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

Data Representation Information Retrieval and Data Mining. Prof. Matteo Matteucci

Machine Learning Chapter 2. Input

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining Practical Machine Learning Tools and Techniques

Classification with Decision Tree Induction

Decision Trees In Weka,Data Formats

COMP33111: Tutorial and lab exercise 7

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Basic Concepts Weka Workbench and its terminology

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Data Mining Part 4. Tony C Smith WEKA Machine Learning Group Department of Computer Science University of Waikato

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Data Mining and Knowledge Discovery Practice notes 2

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Data Mining and Knowledge Discovery: Practice Notes

A dynamic rule-induction method for classification in data mining

COMP33111: Tutorial/lab exercise 2

The Explorer. chapter Getting started

Data Mining Input: Concepts, Instances, and Attributes

Learning Rules. Learning Rules from Decision Trees

Machine Learning in Real World: C4.5

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

Data Mining and Analytics

Decision Trees Using Weka and Rattle

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Advanced learning algorithms

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Data Mining Practical Machine Learning Tools and Techniques

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

CS513-Data Mining. Lecture 2: Understanding the Data. Waheed Noor

Unsupervised: no target value to predict

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

Data Mining. Part 1. Introduction. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Input

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Data Mining Practical Machine Learning Tools and Techniques

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Classification: Decision Trees

MACHINE LEARNING Example: Google search

Decision Tree Learning

Coefficient of Variation based Decision Tree (CvDT)

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Decision Trees: Discussion

Induction of Decision Trees

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Decision tree learning

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

Machine Learning: Symbolische Ansätze

Chapter 4: Algorithms CS 795

Introduction to Machine Learning CANB 7640

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

1. make a scenario and build a bayesian network + conditional probability table! use only nominal variable!

Slides for Data Mining by I. H. Witten and E. Frank

Chapter 4: Algorithms CS 795

ISSUES IN DECISION TREE LEARNING

Homework2 Chapter4 exersices Hongying Du

DATA MINING LAB MANUAL

Data Mining Practical Machine Learning Tools and Techniques

IMPLEMENTATION OF ANT COLONY ALGORITHMS IN MATLAB R. Seidlová, J. Poživil

Hierarchical Clustering Lecture 9

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

An Empirical Study on feature selection for Data Classification

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Implementation of Classification Rules using Oracle PL/SQL

Machine Learning via Decision Trees: C4.5

Data Mining and Knowledge Discovery

Slides for Data Mining by I. H. Witten and E. Frank

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Engineering. Data preprocessing and transformation

CS34800, Fall 2016, Assignment 4

Sabbatical Leave Report

ISSN: [Lakshmikandan* et al., 6(3): March, 2017] Impact Factor: 4.116

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

CSIS. Pattern Recognition. Prof. Sung-Hyuk Cha Fall of School of Computer Science & Information Systems. Artificial Intelligence CSIS

Business Club. Decision Trees

Induction of Decision Trees. An Example Data Set and Decision Tree

Programming Languages and Techniques (CIS120)

BITS F464: MACHINE LEARNING

Rule induction. Dr Beatriz de la Iglesia

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

International Journal of Software and Web Sciences (IJSWS)

9/9/2004. Release 3.5. Decision Trees 350 1

Knowledge Discovery in Databases

CS 135 Winter 2018 Tutorial 7: Accumulative Recursion and Binary Trees. CS 135 Winter 2018 Tutorial 7: Accumulative Recursion and Binary Trees 1

Tema 8: Koncepti i teorije relevantne za donošenje odluka (VEŽBE)

Machine Learning Lecture 11

IMPACT MODELS AND DATA MATTEO DE FELICE

COMP 465: Data Mining Classification Basics

Classification. Instructor: Wei Ding

Transcription:

Homework 1 Sample Solution 1. Iris: All attributes of iris are numeric, therefore ID3 of weka cannt be applied to this data set. Contact-lenses: tear-prod-rate = reduced: none tear-prod-rate = normal astigmatism = no age = young: soft age = pre-presbyopic: soft age = presbyopic spectacle-prescrip = myope: none spectacle-prescrip = hypermetrope: soft astigmatism = yes spectacle-prescrip = myope: hard spectacle-prescrip = hypermetrope age = young: hard age = pre-presbyopic: none age = presbyopic: none Cross-Validation Training Set 2-fold 5-fold 10-fold Error Rate 0 % 33.3333 % 16.6667 % 29.1667 % Notes: As for details about k-fold cross-validation, you can check it out from the textbook, p111-112. Using only the training set, the test set is empty; therefore the error rate for training set is zero. Weather-nominal: outlook = sunny humidity = high: no humidity = normal: yes outlook = overcast: yes outlook = rainy windy = TRUE: no windy = FALSE: yes Cross-Validation Training Set 2-fold 5-fold 10-fold Error Rate 0 % 71.4286 % 14.2857 % 21.4286 %

2. (a) If a decision learning algorithm without pruning generates a complete decision tree, one possible dataset is defined as follows: Since the dataset has n binary-valued attributes, it has totally 2 n tuples, each of which is a possible instantiation of the n binary-valued attributes. And its corresponding class value is the decimal representation of the binary number represented by attributes. For example, if n=5, the generated dataset is as follows: @RELATION b @ATTRIBUTE winter {0,1 @ATTRIBUTE sunny {0,1 @ATTRIBUTE morning {0,1 @ATTRIBUTE cool {0,1 @ATTRIBUTE play {0,1 @ATTRIBUTE class {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32 @DATA 0,0,0,0,0,1 0,0,0,0,1,2 0,0,0,1,0,3 0,0,0,1,1,4 0,0,1,0,0,5 0,0,1,0,1,6 0,0,1,1,0,7 0,0,1,1,1,8 0,1,0,0,0,9 0,1,0,0,1,10 0,1,0,1,0,11 0,1,0,1,1,12 0,1,1,0,0,13 0,1,1,0,1,14 0,1,1,1,0,15 0,1,1,1,1,16 1,0,0,0,0,17 1,0,0,0,1,18 1,0,0,1,0,19 1,0,0,1,1,20

1,0,1,0,0,21 1,0,1,0,1,22 1,0,1,1,0,23 1,0,1,1,1,24 1,1,0,0,0,25 1,1,0,0,1,26 1,1,0,1,0,27 1,1,0,1,1,28 1,1,1,0,0,29 1,1,1,0,1,30 1,1,1,1,0,31 1,1,1,1,1,32 (b) If a decision learning algorithm without pruning generates a decision tree with only one internal node, one possible dataset is defined as follows: Again for the dataset with n binary-valued attributes, it has totally 2 n tuples, each of which is a possible instantiation of the n binary-valued attributes. And its class variable has the opposite value of a given attribute. For example, if n=5, the generated dataset is as follows (Here the class variable has the opposite value of attribute play ): @RELATION a @ATTRIBUTE winter {0,1 @ATTRIBUTE sunny {0,1 @ATTRIBUTE morning {0,1 @ATTRIBUTE cool {0,1 @ATTRIBUTE play {0,1 @ATTRIBUTE class {yes,no @DATA 0,0,0,0,0,yes 0,0,0,0,1,no 0,0,0,1,0,yes 0,0,0,1,1,no 0,0,1,0,0,yes 0,0,1,0,1,no 0,0,1,1,0,yes 0,0,1,1,1,no 0,1,0,0,0,yes 0,1,0,0,1,no

0,1,0,1,0,yes 0,1,0,1,1,no 0,1,1,0,0,yes 0,1,1,0,1,no 0,1,1,1,0,yes 0,1,1,1,1,no 1,0,0,0,0,yes 1,0,0,0,1,no 1,0,0,1,0,yes 1,0,0,1,1,no 1,0,1,0,0,yes 1,0,1,0,1,no 1,0,1,1,0,yes 1,0,1,1,1,no 1,1,0,0,0,yes 1,1,0,0,1,no 1,1,0,1,0,yes 1,1,0,1,1,no 1,1,1,0,0,yes 1,1,1,0,1,no 1,1,1,1,0,yes 1,1,1,1,1,no (c) For (a), the size of the resulting tree is estimated to be 2 n+1-1, which is in order of O(2 n ). The size of the dataset that would create such a complete decision tree on n features is 2 n. Because there are n+1 layers in such a complete binary tree. And for each layer i (i=0,1, n), there are 2 i nodes. Specifically there are 2 n leaves in the final complete binary tree. So the size of the tree, i.e. the total number of nodes of a tree, is 1+2+2 2 + +2 n =2 n+1-1, which is of order O(2 n ). Also the size of the dataset corresponds to the number of leaves in the complete binary tree, which is also in order of O(2 n ). For (b), the resulting tree has only three nodes, one internal node plus two leaf nodes. But the dataset still has 2 n tuples. (d) When n=5, if we use the above-defined dataset, we can get the desired decision trees as follow: For (a): winter = 0 sunny = 0

morning = 0 play = 0: 1 play = 1: 2 play = 0: 3 play = 1: 4 play = 0: 5 play = 1: 6 play = 0: 7 play = 1: 8 sunny = 1 morning = 0 play = 0: 9 play = 1: 10 play = 0: 11 play = 1: 12 play = 0: 13 play = 1: 14 play = 0: 15 play = 1: 16 winter = 1 sunny = 0 morning = 0 play = 0: 17 play = 1: 18 play = 0: 19 play = 1: 20 play = 0: 21

play = 1: 22 play = 0: 23 play = 1: 24 sunny = 1 morning = 0 play = 0: 25 play = 1: 26 play = 0: 27 play = 1: 28 play = 0: 29 play = 1: 30 play = 0: 31 play = 1: 32 For (b): play = 0: yes play = 1: no 3. Change the code of ID3 to Gini as follows: (1) In method maketree(data), change a) m_attribute = data.attribute(utils.maxindex(infogains)) to m_attribute = data.attribute(utils.minindex(infogains)); b) In while loop, instead of calling method computeinfogain(data, att), call another method computegini(data, att) to calculate Gini-Index values. (2) Implement method computegini(data, att) as follows: private double computegini(instances data, Attribute att) throws Exception { double gini = 0; Instances[] splitdata = splitdata(data, att); for (int j = 0; j < att.numvalues(); j++) { if (splitdata[j].numinstances() > 0) {

gini += ((double) splitdata[j].numinstances() / (double) data.numinstances()) * computeginiindex(splitdata[j]); return gini; (3) Implement method computeginiindex(data) as follows: private double computeginiindex(instances data) throws Exception { double [] classcounts = new double[data.numclasses()]; Enumeration instenum = data.enumerateinstances(); while (instenum.hasmoreelements()) { Instance inst = (Instance) instenum.nextelement(); classcounts[(int) inst.classvalue()]++; double entropy = 1; for (int j = 0; j < data.numclasses(); j++) { if (classcounts[j] > 0) { entropy -= Math.sqr((classCounts[j]/((double)data.numInstances())) * (Utils.log2(classCounts[j])-Utils.log2(data.numInstances()))); return entropy ;