Machine Learning. Decision Trees. Manfred Huber

Similar documents
Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision tree learning

Introduction to Machine Learning

Decision Trees: Representation

BITS F464: MACHINE LEARNING

Lecture 5: Decision Trees (Part II)

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

6.034 Quiz 2, Spring 2005

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Decision Trees: Discussion

Data Mining Practical Machine Learning Tools and Techniques

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting

CS 229 Midterm Review

Decision Tree Learning

Decision Trees Oct

Machine Learning. Supervised Learning. Manfred Huber

CS Machine Learning

Random Forest A. Fornaser

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Lecture outline. Decision-tree classification

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Classification with Decision Tree Induction

Data Mining and Knowledge Discovery: Practice Notes

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

CS229 Lecture notes. Raphael John Lamarre Townshend

Network Traffic Measurements and Analysis

Data Mining and Knowledge Discovery: Practice Notes

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Decision Trees / Discrete Variables

Extra readings beyond the lecture slides are important:

Lecture 7: Decision Trees

Ensemble Methods, Decision Trees

Machine Learning in Real World: C4.5

7. Decision or classification trees

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

CSC411/2515 Tutorial: K-NN and Decision Tree

Data Engineering. Data preprocessing and transformation

Classification Algorithms in Data Mining

Business Club. Decision Trees

Decision Tree Learning

CSE4334/5334 DATA MINING

Data Mining and Knowledge Discovery: Practice Notes

Classification. Instructor: Wei Ding

Machine Learning Techniques for Data Mining

8. Tree-based approaches

Slides for Data Mining by I. H. Witten and E. Frank

COMP 465: Data Mining Classification Basics

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Algorithms: Decision Trees

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Lecture 19: Decision trees

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Nearest neighbor classification DSE 220

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Decision Trees. Petr Pošík. Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics

Model Selection and Assessment

Part I. Instructor: Wei Ding

Classification/Regression Trees and Random Forests

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Applying Supervised Learning

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Data Mining Classification - Part 1 -

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Simple Model Selection Cross Validation Regularization Neural Networks

Classifiers and Detection. D.A. Forsyth

Bayesian model ensembling using meta-trained recurrent neural networks

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining Concepts & Techniques

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

10601 Machine Learning. Model and feature selection

Supervised Learning Classification Algorithms Comparison

Data Mining Part 5. Prediction

The Curse of Dimensionality

Decision Trees: Representa:on

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Logical Rhythm - Class 3. August 27, 2018

Application of Support Vector Machine Algorithm in Spam Filtering

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Features: representation, normalization, selection. Chapter e-9

CS6220: DATA MINING TECHNIQUES

Data Mining Practical Machine Learning Tools and Techniques

Probabilistic Classifiers DWML, /27

PROBLEM 4

PV211: Introduction to Information Retrieval

Data Mining and Knowledge Discovery Practice notes 2

SOCIAL MEDIA MINING. Data Mining Essentials

Learning Bayesian Networks (part 3) Goals for the lecture

2. Data Preprocessing

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Introduction to Automated Text Analysis. bit.ly/poir599

Data Preprocessing. Slides by: Shree Jaswal

Data Mining and Knowledge Discovery: Practice Notes

Performance Evaluation of Various Classification Algorithms

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Transcription:

Machine Learning Decision Trees Manfred Huber 2015 1

Decision Trees Classifiers covered so far have been Non-parametric (KNN) Probabilistic with independence (Naïve Bayes) Linear in features (Logistic regression, SVM) Decision trees use a different representation that is inherently non-linear Hypothesis space contains logic sentences over boolean variables and feature functions (e.g.=, >) Intuitive representation Can be converted into rules relatively easily Manfred Huber 2015 2

Decision Trees Hypotheses are represented as trees Nodes are variables (features) Leaves are class predictions (or class probabilities) Leaves can also contain subsets of the training set Arcs represent node evaluations Example from Manfred Huber 2015 Tom Mitchell 3

Decision Trees Decision trees represent logical sentences in disjunctive normal form (DNF) Consecutive elements in a branch form conjuncts Branches with same leaf label form disjuncts Variables/features are assumed to be independent Can be converted into rules by making the sentences into implications with class as the right hand side Manfred Huber 2015 4

Decision Trees Decision trees work with continuous and discrete variables and form complex decision boundaries Non-continuous Non-differentiable Decision trees can dramatically overfit noisy data Manfred Huber 2015 5

Decision Trees Variables can occur multiple times on a branch of the tree if they are non-boolean Decision trees with predicates =, < represent all possible DNF sentences with arbitrary thresholds Trees can have very different depths Unlimited hypothesis space Generally many decision trees for same classification Decision trees can be extended Use additional features Including additional predicates for evaluation Manfred Huber 2015 6

Learning Decision Trees There is no continuous interpretation of either the probability or the decision boundary No derivative with respect to tree parameters To optimize performance, decision tree algorithms generally use search Since the space of all trees is intractable they typically use greedy, fixed lookahead search Each greedy search adds one node to the tree The resulting tree is generally not optimal Manfred Huber 2015 7

Learning Decision Trees Performance function Classification performance Purity of the prediction Tree construction criterion Tree simplicity Usually terms of number of nodes or depth of the tree Basic algorithms iterates as follows Do lookahead search to find best feature node Terminate if feature does not improve performance Add node and split parent data set on node values Manfred Huber 2015 8

Learning Decision Trees How can we pick the best feature? Error in terms of number of misclassified data points Does not consider whether the result of the split might allow for a better split later Manfred Huber 2015 9

Learning Decision Trees How can we pick the best feature? Information gain (reduction in the Entropy) Performance function is entropy H D (Y ) = " P D (y)logp D (y) y!c Best feature is the one that reduces entropy the most (has the highest information gain) Gain(D, A) = I D (A,Y ) = H D (Y )! H D (Y A) # a"range( A) = H D (Y )! P D (A = a)log H D (Y A = a) Prefer purer sets Manfred Huber 2015 10

Example Example form Tom Mitchell Manfred Huber 2015 11

Example E=0.971 E=0 E=0.971 Gain(S, Outlook) =.940 - (5/14).971 (4/14)0 (5/14).971 = 0.246. Manfred Huber 2015 12

Multi-Valued Features Features with multiple values have to be treated differently Many-way splits result in very small sets and thus unreliable estimates of the entropy Gain ratio provides one possible way to overcome this Gain ratio looks at the information gain relative to the intrinsic information of the feature GainRatio(D, A) =! Gain(D, A) # a"range( A) D a D log D a D Tries to compensate for the benefits of multiple options Manfred Huber 2015 13

Continuous features Continuous Features nave to be converted into boolean (or multi-valued) For continuous variables a threshold for the < predicate has to be picked Sort data elements and evaluate every possible threshold at which the classification changes Other points can not result in maximum Use the threshold with the highest gain Manfred Huber 2015 14

Overfitting in Decision Trees If there is noise in the data, decision trees can significantly overfit To determine overfitting, we need to use test data that was not used for training Manfred Huber 2015 15

Overfitting in Decision Trees There are two basic mechanisms to address overfitting Early stopping when split is not significant Post-pruning after complete learning Evaluate benefit of pruning on accuracy in test set and remove the node (branch) with highest accuracy gain Manfred Huber 2015 16

Decision Trees and Rules Decision trees can be easily converted into a set of rules (one rule per branch) In C4.5 (and a number of other decision tree algorithms) pruning is performed on the rules Prune each rule individually Sort rules into desired sequence for faster use Manfred Huber 2015 17

Decision Trees Decision trees are a very frequently used type of classifier in particular when discrete features are present Can represent highly non-linear classes Can be translated into rules Result is easy to understand and interpret Does not represent continuous relationships Can be augmented by including logistic regression predicates and multi-variate feature functions Many practical algorithms: ID3, C4.5, Manfred Huber 2015 18