Decision tree learning

Similar documents
Decision Tree CE-717 : Machine Learning Sharif University of Technology

Lecture 5: Decision Trees (Part II)

Classification with Decision Tree Induction

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

BITS F464: MACHINE LEARNING

Machine Learning in Real World: C4.5

Machine Learning. Decision Trees. Manfred Huber

Lecture outline. Decision-tree classification

Introduction to Machine Learning

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting

Data Mining Practical Machine Learning Tools and Techniques

CS Machine Learning

7. Decision or classification trees

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Business Club. Decision Trees

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Decision Tree Learning

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Classification and Regression Trees

CSE4334/5334 DATA MINING

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

CS229 Lecture notes. Raphael John Lamarre Townshend

Decision Trees: Discussion

Classification. Instructor: Wei Ding

Classification/Regression Trees and Random Forests

Extra readings beyond the lecture slides are important:

Lecture 7: Decision Trees

Induction of Decision Trees

Decision Trees / Discrete Variables

CSC411/2515 Tutorial: K-NN and Decision Tree

Lecture 19: Decision trees

Algorithms: Decision Trees

Decision Trees. Query Selection

Decision Tree Learning

Data Mining Part 5. Prediction

Part I. Instructor: Wei Ding

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Lazy Decision Trees Ronny Kohavi

h=[3,2,5,7], pos=[2,1], neg=[4,4]

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

COMP 465: Data Mining Classification Basics

Ensemble Methods, Decision Trees

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

An introduction to random forests

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining Lecture 8: Decision Trees

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Data Mining Classification - Part 1 -

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

ISSUES IN DECISION TREE LEARNING

Comparing Univariate and Multivariate Decision Trees *

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

Advanced learning algorithms

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Nearest neighbor classification DSE 220

Building Intelligent Learning Database Systems

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Lecture 2 :: Decision Trees Learning

8. Tree-based approaches

Backtracking. Chapter 5

A Systematic Overview of Data Mining Algorithms

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Machine Learning Lecture 11

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Learning Rules. Learning Rules from Decision Trees

Data Mining and Analytics

Induction of Decision Trees. An Example Data Set and Decision Tree

Supervised Learning: K-Nearest Neighbors and Decision Trees

1) Give decision trees to represent the following Boolean functions:

Using Machine Learning to Optimize Storage Systems

CS 229 Midterm Review

Exam Advanced Data Mining Date: Time:

Improved Post Pruning of Decision Trees

Data Mining Algorithms: Basic Methods

Logical Rhythm - Class 3. August 27, 2018

Knowledge Discovery and Data Mining

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Artificial Intelligence. Programming Styles

Data Mining Concepts & Techniques

Classification and Regression Trees

Univariate and Multivariate Decision Trees

Network Traffic Measurements and Analysis

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Data Mining and Knowledge Discovery: Practice Notes

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Data Engineering. Data preprocessing and transformation

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Transcription:

Decision tree learning Andrea Passerini passerini@disi.unitn.it Machine Learning

Learning the concept Go to lesson OUTLOOK Rain Overcast Sunny TRANSPORTATION LESSON NO Uncovered Covered Theoretical Practical NO YES NO YES

encode logical formulas A decision tree represents a disjunction of conjunctions of constraints over attribute values. Each path from the root to a leaf is a conjunction of the constraints specified in the nodes along it: OUTLOOK = Overcast LESSON = Theoretical The leaf contains the label to be assigned to instances reaching it The disjunction of all paths is the logical formula represented by the tree

Appropriate problems for decision trees Binary or multiclass classification tasks (extensions to regressions also exist) Instances represented as attribute-value pairs Different explanations for the concept are possible (disjunction) Some instances have missing attributes There is need for an interpretable explanation of the output

Learning decision trees Greedy top-down strategy (ID3 - Quinlan 1986, C4-5 - Quinlan 1993) For each node, starting from the root with full training set: 1 Choose best attribute to be evaluated 2 Add a child for each attribute value 3 Split node training set into children according to value of chosen attribute 4 Stop splitting a node if it contains examples from a single class, or there are no more attributes to test. Divide et impera approach

Chosing the best attribute Entropy A measure of the amount of information contained in a collection of instances S which can take a number c of possible values. H(S) = c p i log 2 p i i=1 where p i is the fraction of S taking value i. In our case instances are training examples and values are class labels The entropy of a set of labelled examples measures its label inhomogeneity

Chosing the best attribute Information gain Expected reduction in entropy obtained by partitioning a set S according to the value of a certain attribute A IG(S, A) = H(S) v Values(A) S v S H(S v) where Values(A) is the set of possible values taken by A and S v is the subset of S taking value v at attribute A. The second term represents the sum of entropies of subsets of examples obtained partitioning over A values, weighted by their respective sizes. An attribute with high information gain tends to produce homogeneous groups in terms of labels, thus favouring their classification.

Issues in decision tree learning Overfitting avoidance Requiring that each leaf has only examples of a certain class can lead to very complex trees. A complex tree can easily overfit the training set, incorporating random regularities not representative of the full distribution, or noise in the data. It is possible to accept impure leaves, assigning them the label of the majority of their training examples Two possible strategies to prune a decision tree: pre-pruning decide whether to stop splitting a node even if it contains training examples with different labels. post-pruning learn a full tree and successively prune it removing subtrees.

Reduced error pruning Post-pruning strategy Assumes a separate labelled validation set for the pruning stage. The procedure 1 For each node in the tree: Evaluate the performance on the validation set when removing the subtree rooted at it 2 If all node removals worsen performance, STOP. 3 Choose the node whose removal has the best performance improvement 4 Replace the subtree rooted at it with a leaf 5 Assign to the leaf the majority label of all examples in the subtree 6 Return to 1

Issues in decision tree learning Dealing with continuous-valued attributes Continuous valued attributes need to be discretized in order to be used in internal nodes tests Discretization threshold can be chosen in order to maximize the attribute quality criterion (e.g. infogain) Procedure: 1 Examples are sorted according to their continuous attribute values 2 For each pair of successive examples having different labels, a candidate threshold is placed as the average of the two attribute values. 3 For each candidate threshold, the infogain achieved splitting examples according to it is computed 4 The threshold producing the higher infogain is used to discretize the attribute

Issues in decision tree learning Alternative attribute test measures The information gain criterion tends to prefer attributes with a large number of possible values As an extreme, the unique ID of each example is an attribute perfectly splitting the data into singletons, but it will be of no use on new examples A measure of such spread is the entropy of the dataset wrt the attribute value instead of the class value: H A (S) = v Values(A) S v S log S v 2 S The gain ratio measure downweights the information gain by such attribute value entropy IGR(S, A) = IG(S, A) H A (S)

Issues in decision tree learning Handling attributes with missing values Assume example x with class c(x) has missing value for attribute A. when attribute A is to be tested at node n: simple solution assign to x the most common attribute values among examples in n or (during training) the most common of examples in n with class c(x). complex solution propagate x to each of the children of n, with a fractional value equal to the proportion of examples with the corresponding attribute value the complex solution implies that at test time, for each candidate class, all fractions of the test example which reached a leaf with that class are summed, and the example is assigned the class with highest overall value