Decision Trees: Discussion

Similar documents
Introduction to Machine Learning

Lecture 5: Decision Trees (Part II)

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Classification with Decision Tree Induction

BITS F464: MACHINE LEARNING

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting

Decision Trees. Query Selection

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Decision Trees: Representation

Data Mining Practical Machine Learning Tools and Techniques

Decision Tree Learning

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

CSC411/2515 Tutorial: K-NN and Decision Tree

Machine Learning. Decision Trees. Manfred Huber

Decision tree learning

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

ISSUES IN DECISION TREE LEARNING

Algorithms: Decision Trees

Decision Trees Oct

Data Mining and Analytics

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Learning Rules. Learning Rules from Decision Trees

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Lecture outline. Decision-tree classification

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

Ensemble Methods, Decision Trees

Data Mining and Machine Learning: Techniques and Algorithms

CS489/698 Lecture 2: January 8 th, 2018

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Advanced learning algorithms

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

Machine Learning in Real World: C4.5

Classification/Regression Trees and Random Forests

Business Club. Decision Trees

Decision Trees: Representa:on

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Supervised Learning: K-Nearest Neighbors and Decision Trees

Nearest Neighbor Classification. Machine Learning Fall 2017

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

CS229 Lecture notes. Raphael John Lamarre Townshend

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Decision Trees. Petr Pošík. Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics

Lecture 19: Decision trees

Chapter 4: Algorithms CS 795

Decision Trees Using Weka and Rattle

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Machine Learning Lecture 11

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Decision Tree Learning

Lecture 7: Decision Trees

Basic Concepts Weka Workbench and its terminology

Chapter 4: Algorithms CS 795

Boosting Simple Model Selection Cross Validation Regularization

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

An introduction to random forests

Classification and Regression Trees

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Machine Learning in Telecommunications

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

Input: Concepts, Instances, Attributes

Decision Trees / Discrete Variables

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining Lecture 8: Decision Trees

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017

Classification: Decision Trees

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

Data Mining Part 5. Prediction

PROBLEM 4

COMP 465: Data Mining Classification Basics

Machine Learning Techniques

Data Mining Practical Machine Learning Tools and Techniques

7. Decision or classification trees

Statistical Machine Learning Hilary Term 2018

Introduction to AI Spring 2006 Dan Klein Midterm Solutions

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

CS Machine Learning

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Introduction to Machine Learning CANB 7640

Data Mining. Part 1. Introduction. 1.4 Input. Spring Instructor: Dr. Masoud Yaghini. Input

Random Forest A. Fornaser

Data Engineering. Data preprocessing and transformation

Instance-based Learning

Nearest neighbor classification DSE 220

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

COMP33111: Tutorial and lab exercise 7

Lecture on Modeling Tools for Clustering & Regression

CS34800, Fall 2016, Assignment 4

8. Tree-based approaches

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Machine Learning: Symbolische Ansätze

Transcription:

Decision Trees: Discussion Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1

This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy heuristic 3. Some extensions 2

Tips and Tricks 1. Decision tree variants 2. Handling examples with missing feature values 3. Non-Boolean features 4. Avoiding overfitting 3

1. Variants of information gain Information gain is defined using entropy to measure the purity of the labels in the splitted datasets. Other ways to measure purity Example: The MajorityError, which computes: Suppose the tree was not grown below this node and the majority label were chosen, what would be the error? Suppose at some node, there are 15 + and 5 - examples. What is the MajorityError? Answer: ¼ Works like entropy 4

1. Variants of information gain Information gain is defined using entropy to measure the purity of the labels. Other ways to measure purity Example: The MajorityError, which computes: Suppose the tree was not grown below this node and the majority label were chosen, what would be the error? Suppose at some node, there are 15 + and 5 - examples. What is the MajorityError? Answer: ¼ 5

2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 8 Sunny Mild??? Weak No 9 Sunny Cool High Weak Yes 11 Sunny Mild Normal Strong Yes 6

2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Complete the example by Using the most common value of the attribute in the data Using the most common value of the attribute among all examples with the same output Using fractional counts of the attribute values Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/14 Rain} Exercise: Will this change probability computations? 7

2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Complete the example by Using the most common value of the attribute in the data Using the most common value of the attribute among all examples with the same output Using fractional counts of the attribute values Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/14 Rain} Exercise: Will this change probability computations? At test time? 8

2. Missing feature values Suppose an example is missing the value of an attribute. What can we do at training time? Complete the example by Using the most common value of the attribute in the data Using the most common value of the attribute among all examples with the same output Using fractional counts of the attribute values Eg: Outlook={5/14 Sunny, 4/14 Overcast, 5/14 Rain} Exercise: Will this change probability computations? At test time? Use the same method 9

3. Non-Boolean features If the features can take multiple values We have seen one edge per value (i.e. a multi-way split) Outlook Sunny Overcast Rain 10

3. Non-Boolean features If the features can take multiple values We have seen one edge per value (i.e. a multi-way split) Another option: Make the attributes Boolean by testing for each value Convert Outlook=Sunny à {Outlook:Sunny=True, Outlook:Overcast=False, Outlook:Rain=False} Or, perhaps group values into disjoint sets 11

3. Non-Boolean features If the features can take multiple values We have seen one edge per value (i.e a multi-way split) Another option: Make the attributes Boolean by testing for each value Convert Outlook=Sunny à {Outlook:Sunny=True, Outlook:Overcast=False, Outlook:Rain=False} Or, perhaps group values into disjoint sets For numeric features, use thresholds or ranges to get Boolean/discrete alternatives 12

4. Overfitting 13

The First Bit function A Boolean function with n inputs Simply returns the value of the first input, all others irrelevant X 0 X 1 Y F F F F T F T F T T T T What is the decision tree for this function? Y = X 0 X 1 is irrelvant 14

The First Bit function A Boolean function with n inputs Simply returns the value of the first input, all others irrelevant X 0 X 1 Y F F F F T F T F T T T T What is the decision tree for this function? T X 0 F T F 15

The First Bit function A Boolean function with n inputs Simply returns the value of the first input, all others irrelevant X 0 X 1 Y F F F F T F T F T T T T What is the decision tree for this function? T X 0 F Exercise: Convince yourself that ID3 will generate this tree T F 16

The best case scenario: Perfect data Suppose we have all 2 n examples for training. What will the error be on any future examples? Zero! Because we have seen every possible input! And the decision tree can represent the function and ID3 will build a consistent tree 17

The best case scenario: Perfect data Suppose we have all 2 n examples for training. What will the error be on any future examples? Zero! Because we have seen every possible input! And the decision tree can represent the function and ID3 will build a consistent tree 18

Noisy data What if the data is noisy? And we have all 2 n examples. X 0 X 1 X 2 Y F F F F F F T F F T F F F T T F T F F T T F T T T T F T T T T T Suppose, the outputs of both training and test sets are randomly corrupted Train and test sets are no longer identical. Both have noise, possibly different 19

Noisy data What if the data is noisy? And we have all 2 n examples. X 0 X 1 X 2 Y F F F F F F T F F T F F F T T F T F F T T F T T T T F T T T T T T F Suppose, the outputs of both training and test sets are randomly corrupted Train and test sets are no longer identical. Both have noise, possibly different 20

E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Test accuracy for different input sizes 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of features The error bars are generated by running the same experiment multiple times for the same setting 21

E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Test accuracy for different input sizes Error = approx0.375 We can analytically compute test error in this case Correct prediction: P(Training example uncorrupted AND test example uncorrupted) = 0.75 0.75 P(Training example corrupted AND test example corrupted) = 0.25 0.25 P(Correct prediction) = 0.625 Incorrect 2 3 prediction: 4 5 6 7 8 9 10 11 12 13 14 15 P(Training example uncorrupted Number AND of test features example corrupted) = 0.75 0.25 P(Training example corrupted and AND example uncorrupted) = 0.25 0.75 P(incorrect prediction) = 0.375 22

E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Test accuracy for different input sizes What about the training accuracy? 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of features 23

E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Test accuracy for different input sizes What about the training accuracy? Training accuracy = 100% Because the learning algorithm will find a tree that agrees with the data 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of features 24

E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Test accuracy for different input sizes Then, why is the classifier not perfect? 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of features 25

E.g: Output corrupted with probability 0.25 The data is noisy. And we have all 2 n examples. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Test accuracy for different input sizes Then, why is the classifier not perfect? The classifier overfits the training data 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of features 26

Overfitting The learning algorithm fits the noise in the data Irrelevant attributes or noisy examples influence the choice of the hypothesis May lead to poor performance on future examples 27

Overfitting: One definition Data comes from a probability distribution D(X, Y) We are using a hypothesis space H Errors: Training error for hypothesis h H: error train (h) True error for h H: error D (h) A hypothesis h overfits the training data if there is another hypothesis h such that 1. error train (h) < error train (h ) 2. error D (h) > error D (h ) 28

Decision trees will overfit Graph from Mitchell 29

Avoiding overfitting with decision trees Some approaches: 1. Fix the depth of the tree Decision stump = a decision tree with only one level Typically will not be very good by itself But, short decision trees can make very good features for a second layer of learning 30

Avoiding overfitting with decision trees Some approaches: 2. Optimize on a held-out set (also called development set or validation set) while growing the trees Split your data into two parts training set and held-out set Grow your tree on the training split and check the performance on the held-out set after every new node is added If growing the tree hurts validation set performance, stop growing 31

Avoiding overfitting with decision trees Some approaches: 3. Grow full tree and then prune as a post-processing step in one of several ways: 1. Use a validation set for pruning from bottom up greedily 2. Convert the tree into a set of rules (one rule per path from root to leaf) and prune each rule independently 32

Reminder: Decision trees are rules If Color=Blue and Shape=triangle, then output = B If Color=Red, then output = B 33

Summary: Decision trees A popular machine learning tool Prediction is easy Represents any Boolean functions Greedy heuristics for learning ID3 algorithm (using information gain) (Can be used for regression too) Decision trees are prone to overfitting unless you take care to avoid it 34