An introduction to random forests

Similar documents
8. Tree-based approaches

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Random Forest A. Fornaser

7. Boosting and Bagging Bagging

CS 229 Midterm Review

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Network Traffic Measurements and Analysis

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Classification/Regression Trees and Random Forests

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Data Mining Lecture 8: Decision Trees

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Classification with Decision Tree Induction

Ensemble Methods, Decision Trees

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Random Forest Classification and Attribute Selection Program rfc3d

Computer Vision Group Prof. Daniel Cremers. 6. Boosting

Business Club. Decision Trees

April 3, 2012 T.C. Havens

Decision tree learning

Classification and Regression Trees

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Semi-supervised learning and active learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Applying Supervised Learning

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Applied Statistics for Neuroscientists Part IIa: Machine Learning

Classification with PAM and Random Forest

The Curse of Dimensionality

Lecture 7: Decision Trees

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Decision Trees Oct

STA 4273H: Statistical Machine Learning

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Introduction to Machine Learning

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Machine Learning Lecture 11

The Basics of Decision Trees

Machine Learning Techniques

International Journal of Software and Web Sciences (IJSWS)

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Nonparametric Classification Methods

Contents. Preface to the Second Edition

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Tree-based methods for classification and regression

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Data Mining in Bioinformatics Day 1: Classification

Classification: Decision Trees

Classification and Regression Trees

Decision Tree CE-717 : Machine Learning Sharif University of Technology

ECG782: Multidimensional Digital Signal Processing

Classification Algorithms in Data Mining

Random Forests and Boosting

Information Management course

CSE4334/5334 DATA MINING

Lecture 2 :: Decision Trees Learning

Logical Rhythm - Class 3. August 27, 2018

7. Decision or classification trees

Kernel Methods & Support Vector Machines

Supervised Learning for Image Segmentation

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Lecture 5: Decision Trees (Part II)

ECE 285 Final Project

COMP 465: Data Mining Classification Basics

Allstate Insurance Claims Severity: A Machine Learning Approach

Naïve Bayes for text classification

Decision Trees. Predic-on: Based on a bunch of IF THEN ELSE rules. Fi>ng: Find a bunch of IF THEN ELSE rules to cover all cases as best you can.

Data Mining Practical Machine Learning Tools and Techniques

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

CS229 Lecture notes. Raphael John Lamarre Townshend

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Random Forests for Big Data

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Trade-offs in Explanatory

Introduction to Automated Text Analysis. bit.ly/poir599

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Slides for Data Mining by I. H. Witten and E. Frank

MASTER. Random forest visualization. Kuznetsova, N.I. Award date: Link to publication

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Machine Learning Models for Pattern Classification. Comp 473/6731

Supervised vs unsupervised clustering

Statistical Methods for Data Mining

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Decision Trees: Discussion

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Data Mining and Analytics

Lecture outline. Decision-tree classification

Bagging for One-Class Learning

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Linear methods for supervised learning

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Supervised Learning Classification Algorithms Comparison

Performance Evaluation of Various Classification Algorithms

Transcription:

An introduction to random forests Eric Debreuve / Team Morpheme Institutions: University Nice Sophia Antipolis / CNRS / Inria Labs: I3S / Inria CRI SA-M / ibv

Outline Machine learning Decision tree Random forest Bagging Random decision trees Kernel-Induced Random Forest (KIRF) Byproducts Out-of-bag error Variable importance 2

Machine learning Learning/training: build a classification or regression rule from a set of samples Samples (learning set) Machine learning algorithm class/category = rule(sample) or value = rule(sample) Prediction: assign a class or value to new samples New samples Learned rule predicted class or predicted value 3

(Un)Supervised learning Supervised Learning set = { (sample [acquisition], class [expert]) } Unsupervised Learning set = unlabeled samples Semi-supervised Learning set = some labeled samples + many unlabeled samples 4

Ensemble learning Combining weak classifiers (of the same type)...... in order to produce a strong classifier Condition: diversity among the weak classifiers Example: Boosting Train each new weak classifier focusing on samples misclassified by previous ones Popular implementation: AdaBoost Weak classifiers: only need to be better than random guess 5

Outline Machine learning Decision tree Random forest Bagging Random decision trees Kernel-Induced Random Forest (KIRF) Byproducts Out-of-bag error Variable importance 6

Decision tree Root node Entry point to a collection of data Inner nodes (among which the root node) Q2 A question is asked about data One child node per possible answer Leaf nodes D1 D2 Correspond to the decision to take (or conclusion to make) if reached Q1 Q3 D3 D4 D5 Example: CART - Classification and Regression Tree Labeled sample Vector of variable/feature values + class label Binary decision tree Top-down, greedy building...... by recursively partitioning the feature space into hyper-rectangles Similarity with weighted knn Normally, pruning To avoid over-fitting of learning data To achieve a trade-off between prediction accuracy and complexity 7

Decision tree > CART > Building All labeled samples initially assigned to root node N root node With node N do Find the feature F + threshold value T...... that split the samples assigned to N into 2 subsets Sleft and Sright...... so as to maximize the label purity within these subsets Assign (F,T) to N If Sleft and Sright too small to be splitted Attach child leaf nodes Lleft and Lright to N Tag the leaves with the most present label in Sleft and Sright, resp. else Attach child nodes Nleft and Nright to N Assign Sleft and Sright to them, resp. Repeat procedure for N = Nleft and N = Nright 8

Decision tree > CART > Building > Purity (Im)Purity Quality measure applied to each subset Sleft and Sright Combination of the measures (e.g., weighted average) Examples Gini index = Entropy = Misclassification error = 9

Decision tree > CART > Properties CART knn SVM Intrinsically multiclass Handles Apple and Orange features Robustness to outliers Works w/ "small" learning set Scalability (large learning set) Prediction accuracy Parameter tuning 10

Outline Machine learning Decision tree Random forest Bagging Random decision trees Kernel-Induced Random Forest (KIRF) Byproducts Out-of-bag error Variable importance 11

Random forest Definition Collection of unpruned CARTs Rule to combine individual tree decisions Purpose Improve prediction accuracy Principle Encouraging diversity among the tree Solution: randomness Bagging Random decision trees (rcart) 12

Random forest > Bagging Bagging: Bootstrap aggregation Technique of ensemble learning...... to avoid over-fitting Important since trees are unpruned... to improve stability and accuracy Two steps Bootstrap sample set Aggregation 13

Random forest > Bagging > Bootstrap L: original learning set composed of p samples Generate K learning sets Lk...... composed of q samples, q p,...... obtained by uniform sampling with replacement from L In consequences, Lk may contain repeated samples Random forest: q = p Asymptotic proportion of unique samples in Lk = 100 (1-1/e) ~ 63% The remaining samples can be used for testing 14

Random forest > Bagging > Aggregation Learning For each Lk, one classifier Ck (rcart) is learned Prediction S: a new sample Aggregation = majority vote among the K predictions/votes Ck(S) 15

Random forest > Random decision tree All labeled samples initially assigned to root node N root node With node N do Find the feature F among a random subset of features + threshold value T...... that split the samples assigned to N into 2 subsets Sleft and Sright...... so as to maximize the label purity within these subsets Assign (F,T) to N If Sleft and Sright too small to be splitted Attach child leaf nodes Lleft and Lright to N Tag the leaves with the most present label in Sleft and Sright, resp. else Attach child nodes Nleft and Nright to N Assign Sleft and Sright to them, resp. Repeat procedure for N = Nleft and N = Nright Random subset of features Random drawing repeated at each node For D-dimensional samples, typical subset size = round(sqrt(d)) (also round(log2(x))) Increases diversity among the rcarts + reduces computational load Typical purity: Gini index 16

Random forest > Properties RF CART knn SVM Intrinsically multiclass Handles Apple and Orange features Robustness to outliers Works w/ "small" learning set Scalability (large learning set) Prediction accuracy Parameter tuning 17

Random forest > Illustration 1 rcart 10 rcarts 100 rcarts 500 rcarts 18

Random forest > Limitations Oblique/curved frontiers Staircase effect Many pieces of hyperplanes Fundamentally discrete Functional data? (Example: curves) 19

Outline Machine learning Decision tree Random forest Bagging Random decision trees Kernel-Induced Random Forest (KIRF) Byproducts Out-of-bag error Variable importance 20

Kernel-Induced Random Forest (KIRF) Random forest Sample S is a vector Features of S = components of S Kernel-induced features Learning set L = { Si, i [1..N] } Kernel K(x,y) Features of sample S = { Ki(S) = K(Si, S), i [1..N] } Samples S and Si can be vectors or functional data 21

Kernel > Kernel trick Kernel trick Maps samples into an inner product space...... usually of higher dimension (possibly infinite)...... in which classification (or regression) is easier Typically linear Kernel K(x,y) Symmetric Positive semi-definite (Mercer's condition): Note: mapping needs not to be known (might not even have an explicit representation; e.g., Gaussian kernel) 22

Kernel > Examples Polynomial (homogeneous): Polynomial (inhomogeneous): Hyperbolic tangent: Gaussian: Function of the distance between samples Straightforward application to functional data of a metric space E.g., curves 23

KIRF > Illustration Gaussian kernel Some similarity with vantage-point tree Reminder: RF w/ 100 rcarts KIRF w/ 100 rcarts 24

KIRF > Limitations Which kernel? Which kernel parameters? No orange and apple handling anymore (x y or (x - y) 2 ) Computational load (kernel evaluations) Especially during learning Needs to store samples (Instead of feature indices in Random forest) 25

Outline Machine learning Decision tree Random forest Bagging Random decision trees Kernel-Induced Random Forest (KIRF) Byproducts Out-of-bag error Variable importance 26

Byproduct > Reminder To grow one rcart Bootstrap sample set from learning set L Remaining samples Called out-of-bag samples Can be used for testing Two points of view For one rcart, out-of-bag samples = L \ Bootstrap samples Used for variable importance For one sample S of L, set of rcarts for which S was out-of-bag Used for out-of-bag error 27

Byproduct > Out-of-bag error For each sample S of the learning set Look for all the rcarts for which S was out-of-bag Build the corresponding sub-forest Predict the class of S with it Error = is prediction correct? Out-of-bag error = average over all samples of S Note: predictions not made using the whole forest...... but with some aggregation Provides an estimation of the generalization error Can be used to decide when to stop adding trees to the forest 28

Byproduct > Variable importance For each rcart Compute out-of-bag error OOBoriginal Fraction of misclassified out-of-bag samples Consider the i th feature/variable of the samples Randomly permute its values among the out-of-bag samples Re-compute out-of-bag error OOBpermutation rcart-level importance(i) = OOBpermutation - OOBoriginal Variable importance(i) = average over all rcarts Note: rcart-based errors (no aggregation) Avoid attenuation of individual errors 29

An introduction to random forests Thank you for your attention