Decision Trees / Discrete Variables

Similar documents
Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Decision tree learning

Random Forest A. Fornaser

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning. Decision Trees. Manfred Huber

Lecture 19: Decision trees

Classification/Regression Trees and Random Forests

Algorithms: Decision Trees

Business Club. Decision Trees

Classification with Decision Tree Induction

CS229 Lecture notes. Raphael John Lamarre Townshend

Data Mining Lecture 8: Decision Trees

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Classification: Decision Trees

Introduction to Machine Learning

8. Tree-based approaches

CS Machine Learning

BITS F464: MACHINE LEARNING

Lecture 5: Decision Trees (Part II)

Lecture outline. Decision-tree classification

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Nearest neighbor classification DSE 220

Nonparametric Classification Methods

COMP 465: Data Mining Classification Basics

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Random Forest Classification and Attribute Selection Program rfc3d

International Journal of Software and Web Sciences (IJSWS)

Machine Learning in Real World: C4.5

CS 229 Midterm Review

Classification and Regression Trees

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

CSE4334/5334 DATA MINING

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Network Traffic Measurements and Analysis

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Decision Trees. Predic-on: Based on a bunch of IF THEN ELSE rules. Fi>ng: Find a bunch of IF THEN ELSE rules to cover all cases as best you can.

The Basics of Decision Trees

Machine Learning. bad news. Big Picture: Supervised Learning. Supervised (Function) Learning. Learning From Data with Decision Trees.

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Tree-based methods for classification and regression

Decision Trees: Discussion

Classification. Instructor: Wei Ding

Data Mining Classification - Part 1 -

Improving Ensemble of Trees in MLlib

7. Decision or classification trees

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Ensemble Methods, Decision Trees

An introduction to random forests

Classification with PAM and Random Forest

Decision Tree Learning

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Introduction to Classification & Regression Trees

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Lecture 7: Decision Trees

Comparing Univariate and Multivariate Decision Trees *

7. Boosting and Bagging Bagging

Ensemble Methods: Bagging

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Knowledge Discovery and Data Mining

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Cellular Tree Classifiers. Gérard Biau & Luc Devroye

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Lecture 06 Decision Trees I

Part I. Instructor: Wei Ding

Univariate and Multivariate Decision Trees

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Data Mining Part 5. Prediction

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Scaling up Decision Trees

Extra readings beyond the lecture slides are important:

Nonparametric Approaches to Regression

Lecture 2 :: Decision Trees Learning

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

CSC 411 Lecture 4: Ensembles I

Logical Rhythm - Class 3. August 27, 2018

Model Selection and Assessment

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

CSC411/2515 Tutorial: K-NN and Decision Tree

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Data Mining Concepts & Techniques

Backtracking. Chapter 5

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

CART Bagging Trees Random Forests. Leo Breiman

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

USING CONVEX PSEUDO-DATA TO INCREASE PREDICTION ACCURACY

Random Forests and Boosting

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Orthogonal Range Search and its Relatives

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

ECLT 5810 Evaluation of Classification Quality

Lecture 20: Bagging, Random Forests, Boosting

Lazy Decision Trees Ronny Kohavi

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

Transcription:

Decision trees

Decision Trees / Discrete Variables Season Location Fun? Location summer prison -1 summer beach +1 Prison Beach Ski Slope Winter ski-slope +1-1 Season +1 Winter beach -1 Winter Summer -1 +1

Decision Trees / Discrete Variables Mass Temperature explosion Mass>8 1 100-1 3.4 945-1 10 32-1 no -1 yes yes Temperature>500 no 11.5 1202 +1-1 +1

Decision Trees Y X>3 +1 no -1 yes Y>5 5-1 no -1 yes +1-1 3 X

Decision trees Popular because very flexible and easy to interpret. Learning a decision tree = finding a tree with small error on the training set. 1. Start with the root node. 2. At each step split one of the leaves 3. Repeat until a termination criterion.

Which node to split? We want the children to be more pure than the parent. Example: Parent node is 50%+, 50%-. Child nodes are (90%+,10%-),(10%+,90%-) How can we quantify improvement in purity?

First approach: minimize error P(+1) = Probability of label +1 in parent node. P(A) + P(B) = 1 Probability of each one of the children P(+1 A),P(+1 B) = Probability of label +1 condition on each of the children P(+1) = P(+1 A)P(A) + P(+1 B)P(B) At Parent: if P(+1) > P( 1) then: predict +1 Else: predict -1 Error rate = min(p(+1), P( 1)) = min(p(+1),1 P(+1)) no A Parent yes B At node A: if P(+1 A) > P( 1 A) then: predict +1 Else: predict -1 Error rate = min(p(+1 A),1 P(+1 A)) At node B: If P(+1 B) > P( 1 B) then: predict +1 Else: predict -1 Error rate = min(p(+1 B),1 P(+1 B)) Combined error of A and B: P(A)min(P(+1 A),1 P(+1 A)) + P(B)min(P(+1 B),1 P(+1 B))

The problem with classification error. Define err(p) = min(p,1 p) error rate at parent - error rate at children ( ( ) + P(B)err( P ( +1 B) )) = err(p(+1)) P(A)err P(+1 A) We also that P(+1) = P(+1 A)P(A) + P(+1 B)P(B) Therefor if P(+1 A) > 1 2 and P(+1 B) > 1 2 or P(+1 A) < 1 2 and P(+1 B) < 1 2 Then the change in the error is zero.

The problem with classification error (pictorially) P(+1 A) = 0.7, P(+1) = 0.8, P(+1 B) = 0.9

Fixing the problem instead of err(p) = min(p,1 p) use H (p) 2 = 1 ( 2 plog p + (1 p)log 2 2 (1 p) ) P(+1 A) = 0.7, P(+1) = 0.8, P(+1 B) = 0.9

Any strictly convex function can be used H (P) = plog p + (1 p)log(1 p) Circle(p) = 1/ 4 (p 1/ 2) 2 Gini(p) = p(1 p)

Decision tree learning algorithm Learning a decision tree = finding a tree with small error on the training set. 1. Start with the root node. 2. At each step split one of the leaves 3. Repeat until a termination criterion.

The splitting step Given: current tree. For each leaf and each feature, find all possible splitting rules (finite because data is finite). compute reduction in entropy find leaf X feature X split rule the minimizes entropy. Add selected rule to split selected leaf.

Enumerating splitting rules If feature has a fixed, small, number of values. then either: Split on all values (Location is beach/prison/ski-slope) or Split on equality to one value (location = beach) If feature is continuous (temperature) then either: Sort records by feature value and search for best split. or split on parcentiles: 1%,2%,.,99%

Splitting on percentiles Suppose data is in an RDD with 100 million examples. sorting according to each feature value is very expensive. Instead: use Sample(false,0.00001).collect() to get a sample of about 10,000 examples. sort the sample (small, sort in head node). pick examples at location 100,200, as boundaries. Call those feature values T1,T2,T3,,T99 Broadcast boundaries to all partitions. Each partition computes it s contribution to P(+1 T i f T i+1 )

Pruning trees Trees are very flexible. A fully grown tree is one where all leaves are pure, i.e. are all +1 or all -1. A fully grown tree has training error zero. If the tree is large and the data is limited, the test error of the tree is likely to be high = the tree overfits the data. Statisticians say that trees are high variance or unstable One way to reduce overfitting is pruning which means that the fully grown tree is made smaller by pruning leaves that have have few examples and contribute little to the training set performance.

Bagging Bagging, invented by Leo Breiman in the 90s, is a different way to reduce the variance of trees. Instead of pruning the tree, we generate many trees, using randomly selected subsets of the training data. We predict using the majority vote over the trees. A more sophisticated method to reduce variance, that is currently very popular, is Random Forests about which we will talk in a later lesson.